I am trying to devellop some crawlers linked to a database and a web site to make some charts about p2p networks, beginning with the edonkey network. It includes some graphs, mainly similar to the ed2khistory ones, but including more servers.
My work is still at its begining. I still have a lot of things to do, but the first graphs are begining to talk, I can follow the files spreading, see which files are the most shared, and in a more general way, study the dynamics of the network.
The site is p2p-top50, and I'll try here to say in a few words what this way of watching the network allowed me to 'discover'.
First, the most obvious, laying around since the begining, a little file (22 bytes), certainly a virus, is present on the network and replicates very well. Since two month at least, it can be seen on 6000 to 16000 computers, taking a great variety of appealing names (see a graph and the list of filenames it can take).
The ed2khistory for this file confirm the activity (on razorback servers). It is difficult to estimate the spreading rate, but as the file is extremely small, I guess it is downloaded instantly, so the difference between nb sources and nb complete sources, shows us that the file is spreading activelly.
Second - but you all know that I guess - it shows that this network is mainly a european one. I am working on an indicator to have more precise estimator of location of downloaders by file, but clearly the top 50 speaks for itself, amongst the most downloaded files, there are french, german, spanish, ... We do not observe a great US domination, as we do in bittorrent. To me, this explains why bittorrent is said to be the most popular p2p client in the world, and little word is said about emule; just because the US media are much more represented on the web than any other, and US ISPs mainly see bittorrent traffic. I would be curious to see the results of a european study.
This conducts me to a third aspect, I wanted to have an estimation of the number of times a popular file is downloaded and the amount of bandwidth that represented. I will take the most popular file of the moment as an example. It is copyrighted material, I know, but I can't do anything about that. Perhaps this could even be an obvious deduction: the most downloaded files are very oftenly copyrighted material, but this is an other debate ...
Back to our example, we can see on the graph that the spreading of the file began 08/august 2005, I have comments on the first two days of spreading, but that will be an other post. This file is clearly in a spreading state, the number of sources as well as the number of complete sources is increasing. The estimated number of downloads is 130000, representing 87 Tb of bandwidth in approx. 2 weeks (I make this estimations with an average download rate of 10kb/s , I still have no clear idea of what is the 'real' average download rate, but I think this one is a minimum on popular files.) This shows the great efficiency of the network. I have great variability, but I think this is because of my crawlers, not because of he network. A look at ed2khistory confirm this assumption - I still have place for improvements .
If I take an older popular file which spreading began sooner, I have an estimate of nearly 300000 downloads, representing 216 Tb. I think this is a good order of magnitude of the number of download for a popular file. I did not have any idea of the number of downloads before doing those calculation. To me these a new kind of statistics available about the network, interresting ones ....
And last but not least, since the begining of august, I can clearly see the activity of the fake servers, for example on this file or this one. We can see on those graphs that the dynamics of the downloads are completely false, the nb of sources beeing almost always the same than the the nb of complete sources, it take no time to download those files !! Of course, this is due to servers falsely claiming they have a lot of sources, if we compare with dynamics on razorback servers , we can see that those files are 'rare' on 'true' servers, and poorly exchanged, even if we can see the effect of the fake servers (increase in download at 8/aug)
To me, the explanation is clear, their goal is not to log your transactions - we've seen that the number of download by files is something like hundred thousands, what would they do with those logs ? Instead, they are trying the same strategy that worked on gnutella, put junk on the network ...
I think they tryed that without fake servers first, but it didn't worked. Now they are counting on the people making global search and sorting by avaibility to spread their fakes, we will see if this strategy is 'better', to me it seems that it's not working (exept on my charts, but that is an other story, I said I still have some work ... )
Just a final word to say that the devellopment of this site already took me a lot of time, but it is a real pleasure. I hope this kind of informations which were not available before will be usefull to the community, and that I will be able to improve them thanks to your comments.
Excuse me for my bad english but I'm french ...
Sebasto.
This post has been edited by Some Support: 22 August 2005 - 11:26 AM