Official eMule-Board: Handle Downloads Of Almost Identical Files As One - Official eMule-Board

Jump to content


Page 1 of 1

Handle Downloads Of Almost Identical Files As One specially in case of same file size and almost identical hashset Rate Topic: -----

#1 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 28 May 2012 - 03:38 PM

If you have two (or more) downloads of files, which are the same size and almost identical hashset, eMule should exchange downloaded data between those files to avoid unnecessary use of upload bandwidth. That way we would need to download just one file + the chunks, that are different, with the use of AICH (and if we are lucky) we might need just 180KB of the second file to complete it instead of hundreds of megabytes or whatever it's size is.

Of course it should not matter if one of the files is already completed, the user should be asked to allow to import the identical chunks anyway.

Another useful thing is shown in this screenshot:

Posted Image

The red marked files are both almost identical, just one out of over 60 chunks is different. Furthermore, as you can see, the second file is dead (never seen complete since I started the download), while the same chunk is already downloaded on the first one. So my eMule could have actually already fixed that, if it has imported this chunk from the other download.

Another useful thing, maybe for more advanced users, would be to import chunks from files which are not in share anymore or files with different file size (if you have a "shorter version" of that file).


PS: I'm pretty sure similar feature reqests have been posted, but I couldn't find them to post the dead-file-rescue-thingy, so I made my own.
So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

#2 User is offline   tHeWiZaRdOfDoS 

  • Man, what a bunch of jokers...
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 5,391
  • Joined: 28-December 02

Posted 28 May 2012 - 05:50 PM

View PostLink64, on 28 May 2012 - 04:38 PM, said:

If you have two (or more) downloads of files, which are the same size and almost identical hashset, eMule should exchange downloaded data between those files to avoid unnecessary use of upload bandwidth. That way we would need to download just one file + the chunks, that are different, with the use of AICH (and if we are lucky) we might need just 180KB of the second file to complete it instead of hundreds of megabytes or whatever it's size is.

That sounds nice but can't be done automatically. Of course you can always use a mod and import parts :)

As we are just using checksums, collisions ARE possible and ARE happening and even though a block or chunk may have the same hash as another, it can contain different data.
0

#3 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 28 May 2012 - 07:14 PM

View PosttHeWiZaRdOfDoS, on 28 May 2012 - 07:50 PM, said:

That sounds nice but can't be done automatically. Of course you can always use a mod and import parts :)

Well, the user should be asked of course, if eMule discoveres something like that. Untill the user has decided, if he wants to import the chunks, eMule could prefer to download those, that can't be imported, or alternatively even begin with restoring the file as far as possible including AICH recovery of the different chunks, so that the user can view some results, when he gets back to his computer and can than decide, if he wants to keep it like that or if he want to download the file the traditional way. It's not like we can't have that configurable, so that everyone can choose his prefered default behavior.

BTW, I know there are different ways of doing that (with mods or other tools), I have done that several times, but I ask for that here, because I think it would be useful to have in in the official eMule, so all I (and other eMule users) need to do, is click on OK when eMule discovers by itself, that it can eventually import some parts of a file from an other one.



View PosttHeWiZaRdOfDoS, on 28 May 2012 - 07:50 PM, said:

As we are just using checksums, collisions ARE possible and ARE happening and even though a block or chunk may have the same hash as another, it can contain different data.

I know, but back to real world: if I have two files of exactly the same size and basically with the same hashset (I'm not talking about files with size under 1 chunk), how often does it happen that that are actually two completely different files? Never seen that, but a lot of files, which differ just in "few" bytes.

We are working now with checksums to verify everything and recover partially damaged chunks, I think we can get that done with them as well, it's basically the same stuff.
So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

#4 User is offline   fox88 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 3,708
  • Joined: 13-May 07

Posted 29 May 2012 - 07:05 AM

View PostLink64, on 28 May 2012 - 10:14 PM, said:

We are working now with checksums to verify everything and recover partially damaged chunks, I think we can get that done with them as well, it's basically the same stuff.

We could try that because the probability of having exactly the same checksum with different contents should be low.
But there is another question: how often eMule would see exactly the same chunks in real files? Thinking about the files I have downloaded my estimation would be close to 0. Does it worth the trouble of implementation at all?
0

#5 User is offline   tHeWiZaRdOfDoS 

  • Man, what a bunch of jokers...
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 5,391
  • Joined: 28-December 02

Posted 29 May 2012 - 07:22 AM

View PostLink64, on 28 May 2012 - 08:14 PM, said:

I know, but back to real world: if I have two files of exactly the same size and basically with the same hashset (I'm not talking about files with size under 1 chunk), how often does it happen that that are actually two completely different files? Never seen that, but a lot of files, which differ just in "few" bytes.

We are working now with checksums to verify everything and recover partially damaged chunks, I think we can get that done with them as well, it's basically the same stuff.

To compare the hashsets you have to exchange them which may cost a LOT of (in most cases unnecessary) overhead...

I can remember a similar request that I found very useful that included checking your local files for blocks/chunks with the same hashset... AFAIK there was at least one P2P program out there that uses that system (splitting files into tiny chunks) to protect their users from getting sued.
0

#6 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 29 May 2012 - 07:46 AM

View Postfox88, on 29 May 2012 - 09:05 AM, said:

But there is another question: how often eMule would see exactly the same chunks in real files? Thinking about the files I have downloaded my estimation would be close to 0. Does it worth the trouble of implementation at all?

Well, that probably depends on what files you are using eMule for. If you are using it only for quite new files, you'll probably not see namy of such "twins". You will however see them in case of old files, which were released many years ago, maybe even before eMule existed. The "shorter versions" of files are very probably coming from Kazaa (FastTrack) or similar Network, where the files were not downloaded in chunks but simply from the beginning to the end (IIRC) while versions with zero filled ed2k-sized chunks must have been created out of incomplete ed2k downloads. This feature would be specially intresting for files which differ in just few bytes (or KB), which however usually make the difference between good and broken video file.

I wouldn't have posted this FR if I didn't see the use for it often enough.
So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

#7 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 29 May 2012 - 07:53 AM

View PosttHeWiZaRdOfDoS, on 29 May 2012 - 09:22 AM, said:

To compare the hashsets you have to exchange them which may cost a LOT of (in most cases unnecessary) overhead...

You get the hashset for the file, when you start the download anyway, AICH hashsets would be only needed for chunks, which are not identical, so the overhead would be the same as on any other AICH based chunk recovery.
So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

#8 User is offline   niRRity 

  • Avid Post Editor
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1,219
  • Joined: 28-January 03

Posted 09 June 2012 - 07:56 AM

This should be the eMule developers' holy grail. If implemented correctly it can greatly increase file availability and prevent incomplete files in the network.
0

#9 User is offline   Some Support 

  • Last eMule
  • PipPipPipPipPipPipPip
  • Group: Yes
  • Posts: 3,411
  • Joined: 27-June 03

Posted 11 June 2012 - 10:29 PM

I don't think this can be implemented properly. First of all, with the current hash system it would only work if really only some bytes a different at one specific postion of a file. I suppose that happens, but much more common are changes which effect the whole file (like truncating or adding bytes) for example when someone edits meta data of an picture or mp3 file. So the possibly use of such an feature would be very limited to begin with.

But it goes on: eMule cannot just download parts of a different file, just because it seems to be more available. What if some malicous attacker spreads maleware by changing some parts of an otherwise trusted/verified file which a user wants to downlaod via link? Making sure that a downloaded file is 100% identical to the one which was requested is one of the top priorities of filesharing clients. So eMule couldn't do this automatically, but would have to ask the user to deceide - and the decision to be made would probably be too complex for most users.

Now I do agree that there are some cases were such a function would have advantages, but in the end it really doesn't seems to be worth it and adds unnecessary complexity to the interface menues.

#10 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 12 June 2012 - 08:28 AM

View PostSome Support, on 12 June 2012 - 12:29 AM, said:

I don't think this can be implemented properly. First of all, with the current hash system it would only work if really only some bytes a different at one specific postion of a file. I suppose that happens, but much more common are changes which effect the whole file (like truncating or adding bytes) for example when someone edits meta data of an picture or mp3 file. So the possibly use of such an feature would be very limited to begin with.

In case of small files (1 chunk) like pictures or mp3 there is no hashset, only the hash, so there this feature will not kick in anyway. Only in case of large files with some small changes within the file or eventually some added bytes at the end of the file (if some bytes are added at the beginning of the file, the entire hashset changes, so there it won't work).



View PostSome Support, on 12 June 2012 - 12:29 AM, said:

eMule cannot just download parts of a different file, just because it seems to be more available. What if some malicous attacker spreads maleware by changing some parts of an otherwise trusted/verified file which a user wants to downlaod via link?

If someone downloads a file via link, he probably has just this file in his downloads. I'm not asking for automatic search for files, which could eventually be used to complete the download, eMule should only use the files it has in it's share or download list anyway.



View PostSome Support, on 12 June 2012 - 12:29 AM, said:

Making sure that a downloaded file is 100% identical to the one which was requested is one of the top priorities of filesharing clients.

Isn't that done at the end of download anyway? Spreading bad/malicous data can be done thru the original file with a bad client, I don't see here much difference, sometimes we get bad data, which has to be downloaded again, nothing unusual.



View PostSome Support, on 12 June 2012 - 12:29 AM, said:

So eMule couldn't do this automatically, but would have to ask the user to deceide - and the decision to be made would probably be too complex for most users.

Hmm... could eMule (with default settings) not just try to first get all the needed chunks, than try in the background to make a file out of it and see if it matches the hashes? What eMule does right now is not that different, it gets parts of the data from all over the world and when it thinks it has all it needs for that file, it checks the entire file...



View PostSome Support, on 12 June 2012 - 12:29 AM, said:

Now I do agree that there are some cases were such a function would have advantages, but in the end it really doesn't seems to be worth it and adds unnecessary complexity to the interface menues.

Enable thru preferences.ini?
So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

#11 User is offline   DatHebIkWeer 

  • Advanced Member
  • PipPipPip
  • Group: Members
  • Posts: 66
  • Joined: 07-July 12

Posted 23 July 2012 - 03:36 PM

I would say your request would not really work only regarding the full file hash, since a similar hash does not necessarily mean a similar file. But if the file is essentially the same the hash of most blocks will be identical. In that case it would be possible. But dangerous. Keep in mind the hash is not the actual file. It is just an indicator that the file is probably identical or definitely is not. What will you do if the chunk looked identical, but actually is totally different? eMule has no way of identifying it anymore.
Even in cases where the file looks identical to the user or the file is actually almost identical (2 versions of the same video. Same quality, same subtitles, same everything, but in the beginning somewhere 1 byte is added) this will be extremely difficult. To achieve a merge in a situation like that eMule will have to hash the file several times over. Maybe 100s of times and still not be successful. Given the time and performance that takes it may be easier to just download the obvious chunks in the normal way.

In very rare cases this would actually be possible. And it might be a saviour of a case where and incomplete download would be reshared as a new file.
In that case the partial hashes of all the completed chunks would be identical to the original.
Maybe something the machine cann’t decide, but the user can.

By the way. If it is about videos or files like that repairing a broken download with another similar broken download will be easier and more successful with a dedicated editor for the type of files.
0

#12 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 23 July 2012 - 05:07 PM

View PostDatHebIkWeer, on 23 July 2012 - 05:36 PM, said:

I would say your request would not really work only regarding the full file hash, since a similar hash does not necessarily mean a similar file. But if the file is essentially the same the hash of most blocks will be identical.

Hence I wrote "almost identical hashset", the file hash alone (with or without AICH hash) is useless here, there's also nothing like "similar" hash.



View PostDatHebIkWeer, on 23 July 2012 - 05:36 PM, said:

In that case it would be possible. But dangerous. Keep in mind the hash is not the actual file. It is just an indicator that the file is probably identical or definitely is not. What will you do if the chunk looked identical, but actually is totally different? eMule has no way of identifying it anymore.

We have not only one hash we are working with, so at least at the end of a download eMule should notice, that there's something wrong. Also that's not intended for files where one out of 100 blocks has the same hash, but pretty much for the opposite case. I mean we are doing nothing else than that already: we take blocks of data from all over the world, write them in the desired order and check than if we got what we wanted.



View PostDatHebIkWeer, on 23 July 2012 - 05:36 PM, said:

Even in cases where the file looks identical to the user or the file is actually almost identical (2 versions of the same video. Same quality, same subtitles, same everything, but in the beginning somewhere 1 byte is added) this will be extremely difficult.

If 1 Byte is added at the beginning, the entire hashset is different -> does not work at all. And that's also not a case, which this feature is intended for, neither a human nor eMule can recognize before the download, that parts of such files are (binary) identical.



View PostDatHebIkWeer, on 23 July 2012 - 05:36 PM, said:

By the way. If it is about videos or files like that repairing a broken download with another similar broken download will be easier and more successful with a dedicated editor for the type of files.

Not really, a simple hex editor is way better, you can simply copy the entire chunk from one file to the other. Actually a hex editor is the only way to get the file completed, with "normal" editing tools you can be pretty sure you won't get exactly the same file (binary the same, not human the same, that's uninteresting, that only adds one more file to the network). And that's the job I wanted eMule to do, simply copy the chunk from one to another file.

This post has been edited by Link64: 23 July 2012 - 08:28 PM

So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

#13 User is offline   root676 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 11
  • Joined: 28-May 12

Posted 24 July 2012 - 01:38 AM

That may be a golden gate, but I believe SomeSupport, that the feature would probably be just too advanced for most users. How many of users do you think are out there that can decide, whatever the chunks they are downloading are good or bad, but I am thinking that probably not many, because not many would have such a great understanding of a reality they are living. Maybe I think would work if the feature would be added in some MoD feature.
0

#14 User is offline   DatHebIkWeer 

  • Advanced Member
  • PipPipPip
  • Group: Members
  • Posts: 66
  • Joined: 07-July 12

Posted 28 July 2012 - 10:17 AM

Of course this would only work in a very limited number of cases. Because the user needs to be downloading both files and has to know they are similar.

I don’t know what your exact situation is, and maybe I say something stupid now because I haven’t tried it yet, but if you have the file containing the chunks missing in the other one, would it be possible to import the first file into eMule using the .part.met file of the other one?
If the files would be mostly identical that may filter the chunks you can use out and make them available for the file you want to complete. You could use that import as the basis for a new download of the second file or use a trick to merge it into it (like 2 clients running at the same time).

Or else a file merge feature that would import the file and just take the good chunks to put into the other file would be wonderful to do it.

This post has been edited by DatHebIkWeer: 28 July 2012 - 10:18 AM

0

#15 User is offline   Link64 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 2,009
  • Joined: 25-January 04

Posted 28 July 2012 - 01:08 PM

View PostDatHebIkWeer, on 28 July 2012 - 12:17 PM, said:

I don’t know what your exact situation is, and maybe I say something stupid now because I haven’t tried it yet, but if you have the file containing the chunks missing in the other one, would it be possible to import the first file into eMule using the .part.met file of the other one?
If the files would be mostly identical that may filter the chunks you can use out and make them available for the file you want to complete. You could use that import as the basis for a new download of the second file or use a trick to merge it into it (like 2 clients running at the same time).

That's what tools like Metfileregenerator can do, eMule can't. I've done that already few times, downloaded one file first, used that to complete the 2nd file with Metfileregenerator as far as possible, imported im eMule, downloaded the missing chunks (i.e. those that are different). Than when I had both files I could decide which is the one with no (or at least less) errors.



View PostDatHebIkWeer, on 28 July 2012 - 12:17 PM, said:

Or else a file merge feature that would import the file and just take the good chunks to put into the other file would be wonderful to do it.

Well, yes, that would be part of the whole thing, since one file can be downloaded already.
So poste ich richtig! (besonders Punkt 2 beachten)
Für alle, die was heruntergeladen haben und nicht wissen was sie damit anfangen sollen: endun.gen.

My Computers: LinkDesk LinkLap
BOINC ...and you can always say you're working on a science project.
0

  • Member Options

Page 1 of 1

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users