Official eMule-Board: Backward Compatible Protocol Extension - Official eMule-Board

Jump to content


  • (2 Pages)
  • +
  • 1
  • 2

Backward Compatible Protocol Extension

#1 User is offline   loisel 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 24-February 03

Posted 24 February 2003 - 05:49 AM

Hello,

I'm about to go off and see if I can extend the protocol in mldonkey in a backwards-compatible manner. My problem is that I find the granularity of the hash is insufficient.

My dev question is, before I go implement this in mldonkey, does emule have a similar project going on that I should be aware of? I don't think we want another zillion and a half protocol extensions to complicate everyone's life like the source exchange protocols.

Here is a short description of my protocol extension:

For each file A of size larger than 10 KB (say), I will create a file A.hash. This A.hash file will contain an array of md5 or sha hashes of file A, each hash covering only 10KB of file A. The usual md5 hash of file A will also be included in file A.hash. A client aware of this extension would never show both A and A.hash in a search window, of course.

If the A.hash file is available, it is downloaded first. Then, the client knows much more precisely what the final file should look like, and so can decide whether fragments are corrupted early on. The same usual 9MB granularity hashing is left in for backwards-compatibility.

As an additional note, if A.hash is also of size more than 10KB (say), then a hash of it will be generated, although I would probably call it A.hash1 instead of A.hash.hash, and so on. One would first get the very last hash file, then work one's way backwards.


I apologize for not contributing this extension to emule, but I do not have access to MSVS7.net.

My main motivation is to minimize the effect of adversarial clients who send bogus information to corrupt large chunks of files. If every 10KB is checkable separately, it makes intentional corruption much more difficult and detectable.

I hope I found the proper channel, I can't figure out who I need to contact.

Sebastien Loisel
loisel at math dot mcgill dot ca

PS (Edit -loisel): Thanks for pointing that out, VQB. (Please read the Forum Rules. Edit -VQB)

This post has been edited by loisel: 24 February 2003 - 07:27 AM

0

#2 User is offline   roversr13 

  • -=/ SR 13 /=-
  • PipPipPipPip
  • Group: Members
  • Posts: 216
  • Joined: 25-December 02

Posted 24 February 2003 - 06:03 AM

First, 10k looks kinda TOO small. I doubt it is good to decrease size of hashed content beyond few hundreds kb. Maybe 180k (which happens to be size of compressed block emule transfers in one go)? Second, having hashes lying around would be kinda annoying, so where it is to have such files for currently downloading files, for other complete shared files it is better to store data in some centralized database, like known.met works now. Third, I don't see any gain in recursive hashing.

This post has been edited by roversr13: 24 February 2003 - 06:05 AM

0

#3 User is offline   loisel 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 24-February 03

Posted 24 February 2003 - 06:42 AM

Well,

1) 10KB was just an example.
2) The details are irrelevant, as my question is not about the details of this protocol extension, but whether there's already something of that form in the works for emule.
3) The recursive hashing is so that the problem with hard-to-detect-easily-exploited corrupted files isn't moved to hard-to-detect-easily-exploited corrupted hash files (this would happen if the hash file were large, like a few megabytes, which might happen on a file several gigabytes in size.)
4) For a discussion of why a very small granularity is desirable, see this thread.

However, I have a question to add. If I patched lmule instead of mldonkey, would that be more interesting to the people here?

Sebastien Loisel

This post has been edited by loisel: 24 February 2003 - 08:05 AM

0

#4 User is offline   SlugFiller 

  • The one and only master slug
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 6,988
  • Joined: 15-September 02

Posted 24 February 2003 - 03:21 PM

Well, the bandwidth load would be high, but I admit that doing that(plus getting said list from multiple users for safety) would allow detection of mid-corrupts. Won't help against MD4 cracking, though, but currently that is more rumor than issue(although fully feasable).
Why haven't you clicked yet?

SlugFiller rule #1: Unsolicited PMs is the second most efficient method to piss me off.
SlugFiller rule #2: The first most efficient method is unsolicited eMails.
SlugFiller rule #3: If it started in a thread, it should end in the same thread.
SlugFiller rule #4: There is absolutely no reason to perform the same discussion twice in parallel, especially if one side is done via PM.
SlugFiller rule #5: Does it say "Group: Moderators" under my name? No? Then stop telling me about who you want to ban! I really don't care! Go bother a moderator.
SlugFiller rule #6: I can understand English, Hebrew, and a bit of Japanese(standard) and Chinese(mandarin), but if you speak to me in anything but English, do expect to be utterly ignored, at best.
0

#5 User is offline   loisel 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 24-February 03

Posted 24 February 2003 - 05:09 PM

Again, I'd like to emphasize that I'm not trying to discuss implementation details, but I'd like to know if there's some sort of project prevent this attack to the edonkey protocol.

As for the bandwidth, even the recursive one would only require something like 0.5% more transfer (which you may or may not view as big.) As for the resilience, so long as you begin with the md4 hash to A.hash (and not the md4 hash to A), it is no easier to attack A.hash than it is to attack A.

I'll give this thread a couple of days before I decide what to do. Hopefully, if someone's working on this already, by then they'll pipe up.

Sebastien Loisel
0

#6 User is offline   Wildnomad 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 11
  • Joined: 15-January 03

Post icon  Posted 24 February 2003 - 08:02 PM

I've been thinking something similar for the last three days, and I've just one question to ask... What if the "bad-guys" put multiple 'A.hash' files on the network?

That would be a problem for the clients as they could choose the wrong one and spread it.

Got the idea? I've been thinking how to differentiate the good hash tables from the bad ones... In one hand, they could be optionally and manually loaded from web pages you trust.

On the other hand, there is another (this one automated) approach which involves public-key cryptography. This way, you can install any number of certificates (available from trusted web-pages, like ShareReactor) into your client. When a new release is out, a signed hash table is created along with it. Then you just download, use and spread the hash tables signed by one of the trusted certificates you own...

On the later I've got tons of ideas, and I think it would work very well. It's more difficult to describe the method than implementing it ;)

Any comments welcomed!
0

#7 User is offline   SlugFiller 

  • The one and only master slug
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 6,988
  • Joined: 15-September 02

Posted 24 February 2003 - 09:05 PM

xrmb already found a proper and simpler solution for malicious corrupt-chunks attacks.
Why haven't you clicked yet?

SlugFiller rule #1: Unsolicited PMs is the second most efficient method to piss me off.
SlugFiller rule #2: The first most efficient method is unsolicited eMails.
SlugFiller rule #3: If it started in a thread, it should end in the same thread.
SlugFiller rule #4: There is absolutely no reason to perform the same discussion twice in parallel, especially if one side is done via PM.
SlugFiller rule #5: Does it say "Group: Moderators" under my name? No? Then stop telling me about who you want to ban! I really don't care! Go bother a moderator.
SlugFiller rule #6: I can understand English, Hebrew, and a bit of Japanese(standard) and Chinese(mandarin), but if you speak to me in anything but English, do expect to be utterly ignored, at best.
0

#8 User is offline   loisel 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 8
  • Joined: 24-February 03

Posted 25 February 2003 - 12:21 AM

Are you referring to this thread? If so, I don't think that's either proper nor simpler. It will fix the immediate problem with people sending you zero chunks, but the malicious corrupt-chunkers just need to start sending things that aren't zeros. Not all malicious users care if they send a corrupt packet that doesn't compress; some malicious users' goal is to ruin your download, and they don't care if they waste their bandwidth. xrmb's proposal is a heuristic to detect corrupt chunks, and it is a very simple heuristic; it is very easy to imagine a way to defeat this heuristic.

I want to know if there's a more solid algorithm for detecting corrupt chunks in the pipeline. (I'm beginning to think there isn't.)
0

#9 User is offline   Elandal 

  • Nobody in particular
  • PipPipPipPipPipPipPip
  • Group: Debugger
  • Posts: 579
  • Joined: 25-November 02

Posted 25 February 2003 - 04:58 AM

Jumping into the thread..

So, you're thinking about adding files containing hashes, that a new client that understood the files could fetch and use for additional integrity testing? Considering that that'd need a new client, although not a new protocol request, how would that differ from extending the protocol itself by adding a new "extended hash request" command instead of new filetype?

Still, now that I read a little about various hashes, I'm not sure how much md5 or SHA would add in protecting against malicious clients. Yes, they're better hashes than md4, but based on the same basic algorithm. Tiger looked a nice hash that adds protection against known md4 attacks that might work in some form against md5 and SHA, too, so why not use Tiger? I'm OK with SHA, too, should that be chosen, but..

Also, you're proposing adding second hash layer with different hash algorithm and smaller chunk size, but still basic sequential block hash mechanism.
Simple calculation with eg. block size 150kB and file size 700MB (assuming that it's typical in ed2k network) comes to about 4800 hash blocks. With md5 (128bit) that'd be 77kB of hashes, with SHA1 (160bit) 96kB, with SHA-256 153kB, and with Tiger (192bit) 115kB.
With binary treehash, the amounts would be about twice those. I haven't checked different treehash types or implementations, but I assume going from binary tree to wider would be OK, and would decrease total hashset size from twice to easily just 1.4 times the blockhashset size.

Of course these hashsets must be stored somewhere, and it must be possible to request them, as with current md4 blockhashset. Going with binary tree Tiger and 150kB blocksize, that means about 330 bytes per megabyte of the content in hashsets. Not a big deal mostly, but with large shared files lists it starts taking space.. From 220GB it'd be 70MB in hashsets.. And then assuming I had, in known files list, additionally roughly ten times that, so the hashset would come to 770MB total..
Of course with old files the hashsets could be prunes to just eg. 4 levels below filehash, for total of about 760 bytes per file, which would on large files (100MB+) save at least 98% of the hashset size, bringing it back to manageable levels. Of course how and when to remove finegrained hashses would require some thinking.

Now, assuming a simple file with hashset in it, starting download of a 700MB file would require first getting 115kB (192bit Tiger sequential block hashset with blocksize of 150kB) or perhaps twice that (binary tree hashset). And with the file, you either get it or don't - you can't choose which parts of the hashset you'd need.
With protocol extension, the client could request just a part of the hashset. Assuming it's the first contact, and it doesn't have any hashsets, it'd probably request filehash, second layer, and path to the block it's requesting, bringing this down to about 14 hashses or so - a very small amount that can be moved in a control packet. Later on, it could request, when requesting a block, the hashses it's missing to get the path to that block.

Now, when deciding on the hashing mechanism to be adopted, perhaps other clients should be investigated, too? I've understood that Shareza (or however it's spelled) and Gnutella2 both use Tigertree with merkle trees, but what about other clients? It would be beneficial if the hashing algorithm would be the same as with some other client, so independent of actual file search and transfer protocols, same file fingerprints could be used in different clients.
Oh yes, and what was that Bitzi bitprint? Was it based on Tiger, too?
Which blocksizes do these use? are the hash fingerprints compatible with multiple clients already? And if so, could that hash algorithm be adopted so that the same fingerprint would be useful in emule, too?
0

#10 User is offline   Champ 

  • Splendid Member
  • PipPipPipPip
  • Group: Members
  • Posts: 122
  • Joined: 14-November 02

Post icon  Posted 25 February 2003 - 07:43 AM

Hi,
i agree if there is an change in the Hashing there should be some Points that we should take care.

1. An Secure Hash function (MD4+MD5 is definitly none), i would sugest SHA1 or RIPMED-160
2. After an long Discusion in the Chat i agree that the Hash-Tree System provide an Nice Way to allow
- user defined granularity of checking.
3. We should not look to hard how other Files Descriptions of other P2P system where Build. Because
till now eDonkey system seemsthe most stable platform that have also many users.

So if we made an CUT than an real one, by defining an new "hash Protocol" . Also we should try
too ask mlDonkey and ShareReaktor if they will support it and maybe even Jed but i do nit thing he
will change.

My Tip is:

- SHA1 or RIPMED-160
- Minimum Block Size of 10k
- Using an Hashtree, but not with binary , but 16 Nodes Agregated
+ This reduce dramatical the count of levels and it safe overhead of memory,badnwith and cpu
0

#11 User is offline   VQB 

  • carrots, lots and lots of carrots
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 7,095
  • Joined: 23-September 02

Posted 25 February 2003 - 10:15 AM

Champ, on Feb 24 2003, 11:43 PM, said:

Minimum Block Size of 10k

This seems too fine for two reasons (both of which may be off-base):

(1) is it really important to be able to verify what amounts to a few seconds of transfer?

(2) might it be easier to find collisions for 10k block sizes? Wouldn't it be possible to try all possible block contents to see if one resulted in the same hash or do these hashing routines provide for unique values for all 8^10k combinations? I suppose "easier" could be relative since 8^10k is a pretty big number. ;)

A basic block size of 100kb would reduce the hashset size by 90% and this is a known and certain "expense" relative to the possibility (perhaps increasing) of detecting bad data. The files I typically DL have minimal corruption problems so my concerns are naturally more towards considering the overhead.

-VQB
0

#12 User is offline   Champ 

  • Splendid Member
  • PipPipPipPip
  • Group: Members
  • Posts: 122
  • Joined: 14-November 02

Posted 25 February 2003 - 10:29 AM

The point about the overhead is not compleatly true.
Hash Tree have the advantage that you can tell your client to request only larger hash.level
Because larger file blocks can are hashed by the way first hash small blocks and than hash that hash values.
so the client can chose what hash deep he wan't to verify.

Second the point about collisions is wrong in two way's.
1. Brute force scan of even 512 Bit is considered as unpractical.
2. Wich each Byte / Bit more is the block length, you give more "free variable" in the hash term.
As you should know from school with more variables it is easyier to find solutions.
If you think for each bit of the HASH as an forumular of 10240*8 variables. And Currently 128 Formular that need to fit.
And your sugestion with 102400*8 variables. Here you can choose more values you like that make the solve less dificult.
0

#13 User is offline   SlugFiller 

  • The one and only master slug
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 6,988
  • Joined: 15-September 02

Posted 25 February 2003 - 11:54 AM

There is not such thing as people that "just want to corrupt your files", if they wanted that they wouldn't send random data, as was mentioned in another thread it's easy for them to plant 64-byte spots of corrupt data that wouldn't be detected by MD4, so you'd complete the file, but then, all of the sudden it's like "What's this? Why isn't the zip opening? Why does the movie have errors? Why did the program freeze?". There's really very little we can do against that.

What your solution offers are ways to check for chunk-corruption before the 9mb ends so you won't have to redownload 9 whole mbs. As it stands, the only current attack which works to corrupt a chunk is the 0-filled chunk attack. Since a real malicious attack wouldn't allow itself to be detected with either a hash or a sub-hash, and a leeching based attack wouldn't use incompressable data, I think xrmb's patch is enough to validate a fix.
Why haven't you clicked yet?

SlugFiller rule #1: Unsolicited PMs is the second most efficient method to piss me off.
SlugFiller rule #2: The first most efficient method is unsolicited eMails.
SlugFiller rule #3: If it started in a thread, it should end in the same thread.
SlugFiller rule #4: There is absolutely no reason to perform the same discussion twice in parallel, especially if one side is done via PM.
SlugFiller rule #5: Does it say "Group: Moderators" under my name? No? Then stop telling me about who you want to ban! I really don't care! Go bother a moderator.
SlugFiller rule #6: I can understand English, Hebrew, and a bit of Japanese(standard) and Chinese(mandarin), but if you speak to me in anything but English, do expect to be utterly ignored, at best.
0

#14 User is offline   VQB 

  • carrots, lots and lots of carrots
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 7,095
  • Joined: 23-September 02

Posted 25 February 2003 - 02:12 PM

@Champ

Thanks for the explanation. As I wrote... I thought I could be off-base. I like the user-selectable depth on the "fineness" of the hashset.

-VQB
0

#15 User is offline   Elandal 

  • Nobody in particular
  • PipPipPipPipPipPipPip
  • Group: Debugger
  • Posts: 579
  • Joined: 25-November 02

Posted 26 February 2003 - 07:15 PM

Regarding the "user-selectable depth", there is one pretty big problem..
Unless a client first gets hashses for the blocksize that is actually hashed (remember that with treehashes, higher level hashses are hashes of hashes), there is no way to check blocks except by getting all required for the hash level (eg. with 16-node aggregate the second level is 16*blocksize), hashing the blocks, hashing the hashses to get second level hash, and then comparing. And if the hash doesn't match, the client has absolutely no idea which of the blocks is corrupted.

And if a the hash depth is selected by the user, eg. I would definitely not agree in storing hashes for less than 100k - it'd just take too much diskspace. So, with eg. 10k blocksize and 16-node tree, I'd store second level hashes only, or 160kB aggregate hash. If another client asked for single-block hash, my client could not provide it.

So, each client must have hashes available down to the blocksize for each file they share. Which means that blocksize must be such that the diskspace requirements aren't big.

What's some one gig of hashes compared to 220 gigs of files? Not much, you might say.. But more than I'm willing to reserve for hashsets. So, the atomic block must be of reasonable size.
0

#16 User is offline   Champ 

  • Splendid Member
  • PipPipPipPip
  • Group: Members
  • Posts: 122
  • Joined: 14-November 02

Posted 26 February 2003 - 09:26 PM

Hi,
we are not talking about an raltion 1 to 220 we are talking about 1/512
20Byte to 10240byte
because you only need to store the lowest level.
And this mean less than 1MB for one CD of data much less than currently even one bad block cause.
0

#17 User is offline   Elandal 

  • Nobody in particular
  • PipPipPipPipPipPipPip
  • Group: Debugger
  • Posts: 579
  • Joined: 25-November 02

Posted 26 February 2003 - 10:05 PM

I was assuming 192bit hash, which would be 24 bytes, using binary merkle tree. Also, I was assuming storing the whole hashstree, which in binary tree would be about twice the lowest layer. As I did note earlier, if wider tree is used, it'll use less storage.

So, with 160bit hash, not tree but sequential blockhash, and 10kB blocksize, yes - it'd be 1/512. But, with 192bit hash it'd already be 1/426, and adding just 16-node tree structure would bring it to about 1/400.

However, if that much would be used by hashses, I probably should move the mule to my shared files volume, where one gig isn't that much. On smaller volumes it is quite a lot.

Also note that while I agree in that fighting corruption is important, and that 9500kB is pretty large a block, 10k is really, really small when used in conjunction with 16-node tree. 10kB is really not much, and even 100kB wouldn't be much. The important thing of course would be to require each client to deliver full blocks after that, so any client could be flagged after it has delivered just one block.
0

#18 User is offline   Lobuz 

  • Premium Member
  • PipPipPipPipPip
  • Group: Members
  • Posts: 289
  • Joined: 01-November 02

Posted 27 February 2003 - 04:07 AM

I see that it's starting to create some cross-client "eDonkey Developers Community". That's cool B)

Now before we choose the algorythms we should investigate any weaknesses of proposals. There were some discussions and solutions:

https://www1.ietf.or...t/msg00121.html
http://zgp.org/piper...ber/000998.html

I hope this will be fruitful for all.

Regards
Lobuz

ps. Did I mention that I love idea of the hashtree. It could increase the speed of files propagation and reliability. :D
0

#19 User is offline   Champ 

  • Splendid Member
  • PipPipPipPip
  • Group: Members
  • Posts: 122
  • Joined: 14-November 02

Posted 27 February 2003 - 07:48 AM

Hi,
there papers describing the hashtree and and problem see problems that are not too critical for your protocol
1. The one attack szenario say that one file contain S1 , S2 with hash h1 , h2 and an second file h1 , h2 as data.
But if we use blocklen greater than hashlength h1,h2 would not be used as single block but become mixed.
2. We identify files ONLY by hash values without there size, even we know the size. That could be used to.
Because with given size it build always the samte hashtree, because we use fixed block len. and do not say split the file
in an fixed count of blocks.
0

#20 User is offline   BadWolf63 

  • Premium Member
  • PipPipPipPipPip
  • Group: Members
  • Posts: 261
  • Joined: 11-November 02

Posted 27 February 2003 - 11:04 AM

well... maybe i'm saying a very stupid thing and if so forgive me please...

Why don't leave things as they are and use a doubled hash (md4 & another one) on the same blocks? it should increase very little the overhead (always based on 700Mb files) and i think it should be very difficoult to fake both of them at the same time....


BadWolf
0

  • Member Options

  • (2 Pages)
  • +
  • 1
  • 2

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users