Rc4crypt Optimization (approx. 50% Faster)

Rc4crypt Optimization (approx. 50% Faster) Remove useless code

#1 netfinity

Master of WARP

Group: Members
Posts: 1658
Joined: 23-April 04

Posted 01 May 2008 - 09:01 PM

The RC4Crypt routine is one of the most CPU consuming in the eMule code and can easily be improved by removing the '% 256' operations. Those operations are pointless since an unsigned byte can only hold values in the range of 0 to 255, so the operation will essentially do nothing. However the existance of the remainder by 256 operations cause the compiler to do a cast from uint8 to int and then clear all bytes except the low 8 bits and the converted back to uint8. That is a complete waste and doubles the amount of code the compiler generates.

void RC4Crypt(const uchar* pachIn, uchar* pachOut, uint32 nLen, RC4_Key_Struct* key){
	ASSERT( key != NULL && nLen > 0 );
	if (key == NULL)
		return;
	
	uint8 byX = key->byX;;
	uint8 byY = key->byY;
	uint8* pabyState = &key->abyState[0];;
	uint8 byXorIndex;

	for (uint32 i = 0; i < nLen; i++)
	{
		byX = (byX + 1) /*% 256*/;
		byY = (pabyState[byX] + byY) /*% 256*/;
		swap_byte(&pabyState[byX], &pabyState[byY]);
		byXorIndex = (pabyState[byX] + pabyState[byY]) /*% 256*/;
		
		if (pachIn != NULL)
			pachOut[i] = pachIn[i] ^ pabyState[byXorIndex];
	}
	key->byX = byX;
	key->byY = byY;
}

eMule v0.50a [NetF WARP v0.3a]
- Compiled for 32 and 64 bit Windows versions
- Optimized for fast (100Mbit/s) Internet connections
- Faster file completion via Dynamic Block Requests and dropping of stalling sources
- Faster searching via KAD with equal or reduced overhead
- Less GUI lockups through multi-threaded disk IO operations
- VIP "Payback" queue
- Fakealyzer (helps you chosing the right files)
- Quality Of Service to keep eMule from disturbing VoIP and other important applications (Vista/7/8 only!)

#2 Some Support

Last eMule

Group: Yes
Posts: 3667
Joined: 27-June 03

Posted 01 May 2008 - 09:44 PM

Indeed, when working with unsigned values those are probably not needed. They were just added for safety back then, however if they really use up considerable CPU ressources they probably should be removed.

#3 netfinity

Master of WARP

Group: Members
Posts: 1658
Joined: 23-April 04

Posted 01 May 2008 - 09:54 PM

You can easily see the difference in code size if you halt eMule while in debug mode and the show the disassembly for the RC4Crypt function.

#4 tHeWiZaRdOfDoS

Man, what a bunch of jokers...

Group: Members
Posts: 5630
Joined: 28-December 02

Posted 02 May 2008 - 05:44 AM

And you can also use ++i instead of i++ - this will also give a (very small) speed improvement...

The first Kad only client: kMule is available, now!

Free and legal downloads - now on eMuleFuture!

#5 Some Support

Last eMule

Group: Yes
Posts: 3667
Joined: 27-June 03

Posted 02 May 2008 - 10:05 AM

tHeWiZaRdOfDoS, on May 2 2008, 05:44 AM, said:

And you can also use ++i instead of i++ - this will also give a (very small) speed improvement...

Thats a myth. A single ++i isn't faster than i++ (with i being a built in type like int), the compiler will create exact the same assembler code - at least the MS compiler (and every somewhat intelligent one).

#6 tHeWiZaRdOfDoS

Man, what a bunch of jokers...

Group: Members
Posts: 5630
Joined: 28-December 02

Posted 02 May 2008 - 10:43 AM

Well the postfix ++ would work like

int j = i;
i +=1;
return j;

while the prefix ++ would work like

i+=1;
return i;

which already shows that the latter is faster because it does not need to maintain a copy of the old value... but honestly I didn't check the ASM code created by the compilers, maybe they are (nowadays) smart enough to detect wether the return value is needed at all...

The first Kad only client: kMule is available, now!

Free and legal downloads - now on eMuleFuture!

#7 Some Support

Last eMule

Group: Yes
Posts: 3667
Joined: 27-June 03

Posted 02 May 2008 - 11:03 AM

i++; ++i; and i += 1; (with i being a built in type) will result in the same assembler code because they have the same semantic meaning (in this case). Back then when computers were fed with floppy disks and punched cards that might have been different for some compilers but its the 21th century

Anyway i checked netfinitys suggestion and it turns out it was really a bad idea to add those %s, even release optimized code is much larger indeed.

#8 netfinity

Master of WARP

Group: Members
Posts: 1658
Joined: 23-April 04

Posted 02 May 2008 - 12:46 PM

I did the tests by compiling the code with Visual C++ 2005 in 64bit release build mode and it was then I did notice the difference of removing '% 256' code. And since the 64bit compiler is generally much more effective optimizing the code, I was quite certain it would be atleast as bad with the 32bit compiler.

As for the statement with the ++i being faster than i++, I say it's true for composite objects but not for the basic types. I recall my old 68k compiler I had to my Amiga some 20 years ago did handle this. Actually what a smart compiler would recon is that the value j from the 'int j = i; i +=1; return j;' statement is never referenced and would therfore be eleminated. There are ofcourse cases there the compiler might fail and therefor it is safer to write ++i or i+=1 which would produce fast code even with a non-optimizing compiler.

A small side note about optimizing return values in Visual C++ which is somewhat related to the postfix ++ case when using class objects.

Consider the following code.

CSomeClass CSomeClass::DoSomething(int i)
{

CSomeClass Result(i);
return Result;

}

This code will first construct the object Result from i and then at the return statement call the copy constructor as Result is a temporary object and can't be returned. This copy operation can be eliminated (Requires full optimizations to be enabled) by constructing the object inside the return statement. The code would the look like this.

CSomeClass CSomeClass::DoSomething(int i)
{

return CSomeClass(i);

}

#9 fafner

Advanced Member

Group: Members
Posts: 79
Joined: 02-October 04

Posted 05 May 2008 - 02:34 PM

Quote

The RC4Crypt routine is one of the most CPU consuming in the eMule code and can easily be improved by removing the '% 256' operations.

There's a similar statement in RC4CreateKey.

Official eMule-Board: Rc4crypt Optimization (approx. 50% Faster) - Official eMule-Board