Official eMule-Board: Rc4crypt Optimization (approx. 50% Faster) - Official eMule-Board

Jump to content


Page 1 of 1

Rc4crypt Optimization (approx. 50% Faster) Remove useless code

#1 User is offline   netfinity 

  • Master of WARP
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1658
  • Joined: 23-April 04

Posted 01 May 2008 - 09:01 PM

The RC4Crypt routine is one of the most CPU consuming in the eMule code and can easily be improved by removing the '% 256' operations. Those operations are pointless since an unsigned byte can only hold values in the range of 0 to 255, so the operation will essentially do nothing. However the existance of the remainder by 256 operations cause the compiler to do a cast from uint8 to int and then clear all bytes except the low 8 bits and the converted back to uint8. That is a complete waste and doubles the amount of code the compiler generates.

void RC4Crypt(const uchar* pachIn, uchar* pachOut, uint32 nLen, RC4_Key_Struct* key){
	ASSERT( key != NULL && nLen > 0 );
	if (key == NULL)
		return;
	
	uint8 byX = key->byX;;
	uint8 byY = key->byY;
	uint8* pabyState = &key->abyState[0];;
	uint8 byXorIndex;

	for (uint32 i = 0; i < nLen; i++)
	{
		byX = (byX + 1) /*% 256*/;
		byY = (pabyState[byX] + byY) /*% 256*/;
		swap_byte(&pabyState[byX], &pabyState[byY]);
		byXorIndex = (pabyState[byX] + pabyState[byY]) /*% 256*/;
		
		if (pachIn != NULL)
			pachOut[i] = pachIn[i] ^ pabyState[byXorIndex];
	}
	key->byX = byX;
	key->byY = byY;
}

eMule v0.50a [NetF WARP v0.3a]
- Compiled for 32 and 64 bit Windows versions
- Optimized for fast (100Mbit/s) Internet connections
- Faster file completion via Dynamic Block Requests and dropping of stalling sources
- Faster searching via KAD with equal or reduced overhead
- Less GUI lockups through multi-threaded disk IO operations
- VIP "Payback" queue
- Fakealyzer (helps you chosing the right files)
- Quality Of Service to keep eMule from disturbing VoIP and other important applications (Vista/7/8 only!)
0

#2 User is offline   Some Support 

  • Last eMule
  • PipPipPipPipPipPipPip
  • Group: Yes
  • Posts: 3667
  • Joined: 27-June 03

Posted 01 May 2008 - 09:44 PM

Indeed, when working with unsigned values those are probably not needed. They were just added for safety back then, however if they really use up considerable CPU ressources they probably should be removed.

#3 User is offline   netfinity 

  • Master of WARP
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1658
  • Joined: 23-April 04

Posted 01 May 2008 - 09:54 PM

You can easily see the difference in code size if you halt eMule while in debug mode and the show the disassembly for the RC4Crypt function.
eMule v0.50a [NetF WARP v0.3a]
- Compiled for 32 and 64 bit Windows versions
- Optimized for fast (100Mbit/s) Internet connections
- Faster file completion via Dynamic Block Requests and dropping of stalling sources
- Faster searching via KAD with equal or reduced overhead
- Less GUI lockups through multi-threaded disk IO operations
- VIP "Payback" queue
- Fakealyzer (helps you chosing the right files)
- Quality Of Service to keep eMule from disturbing VoIP and other important applications (Vista/7/8 only!)
0

#4 User is offline   tHeWiZaRdOfDoS 

  • Man, what a bunch of jokers...
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 5630
  • Joined: 28-December 02

Posted 02 May 2008 - 05:44 AM

And you can also use ++i instead of i++ - this will also give a (very small) speed improvement...
0

#5 User is offline   Some Support 

  • Last eMule
  • PipPipPipPipPipPipPip
  • Group: Yes
  • Posts: 3667
  • Joined: 27-June 03

Posted 02 May 2008 - 10:05 AM

View PosttHeWiZaRdOfDoS, on May 2 2008, 05:44 AM, said:

And you can also use ++i instead of i++ - this will also give a (very small) speed improvement...


Thats a myth. A single ++i isn't faster than i++ (with i being a built in type like int), the compiler will create exact the same assembler code - at least the MS compiler (and every somewhat intelligent one).

#6 User is offline   tHeWiZaRdOfDoS 

  • Man, what a bunch of jokers...
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 5630
  • Joined: 28-December 02

Posted 02 May 2008 - 10:43 AM

Well the postfix ++ would work like

int j = i;
i +=1;
return j;

while the prefix ++ would work like

i+=1;
return i;

which already shows that the latter is faster because it does not need to maintain a copy of the old value... but honestly I didn't check the ASM code created by the compilers, maybe they are (nowadays) smart enough to detect wether the return value is needed at all...
0

#7 User is offline   Some Support 

  • Last eMule
  • PipPipPipPipPipPipPip
  • Group: Yes
  • Posts: 3667
  • Joined: 27-June 03

Posted 02 May 2008 - 11:03 AM

i++; ++i; and i += 1; (with i being a built in type) will result in the same assembler code because they have the same semantic meaning (in this case). Back then when computers were fed with floppy disks and punched cards that might have been different for some compilers but its the 21th century ;)
Anyway i checked netfinitys suggestion and it turns out it was really a bad idea to add those %s, even release optimized code is much larger indeed.

#8 User is offline   netfinity 

  • Master of WARP
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1658
  • Joined: 23-April 04

Posted 02 May 2008 - 12:46 PM

I did the tests by compiling the code with Visual C++ 2005 in 64bit release build mode and it was then I did notice the difference of removing '% 256' code. And since the 64bit compiler is generally much more effective optimizing the code, I was quite certain it would be atleast as bad with the 32bit compiler.

As for the statement with the ++i being faster than i++, I say it's true for composite objects but not for the basic types. I recall my old 68k compiler I had to my Amiga some 20 years ago did handle this. Actually what a smart compiler would recon is that the value j from the 'int j = i; i +=1; return j;' statement is never referenced and would therfore be eleminated. There are ofcourse cases there the compiler might fail and therefor it is safer to write ++i or i+=1 which would produce fast code even with a non-optimizing compiler.

A small side note about optimizing return values in Visual C++ which is somewhat related to the postfix ++ case when using class objects.

Consider the following code.

CSomeClass CSomeClass::DoSomething(int i)
{
CSomeClass Result(i);
return Result;
}

This code will first construct the object Result from i and then at the return statement call the copy constructor as Result is a temporary object and can't be returned. This copy operation can be eliminated (Requires full optimizations to be enabled) by constructing the object inside the return statement. The code would the look like this.

CSomeClass CSomeClass::DoSomething(int i)
{
return CSomeClass(i);
}
eMule v0.50a [NetF WARP v0.3a]
- Compiled for 32 and 64 bit Windows versions
- Optimized for fast (100Mbit/s) Internet connections
- Faster file completion via Dynamic Block Requests and dropping of stalling sources
- Faster searching via KAD with equal or reduced overhead
- Less GUI lockups through multi-threaded disk IO operations
- VIP "Payback" queue
- Fakealyzer (helps you chosing the right files)
- Quality Of Service to keep eMule from disturbing VoIP and other important applications (Vista/7/8 only!)
0

#9 User is offline   fafner 

  • Advanced Member
  • PipPipPip
  • Group: Members
  • Posts: 79
  • Joined: 02-October 04

Posted 05 May 2008 - 02:34 PM

Quote

The RC4Crypt routine is one of the most CPU consuming in the eMule code and can easily be improved by removing the '% 256' operations.

There's a similar statement in RC4CreateKey.
0

  • Member Options

Page 1 of 1

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users