Official eMule-Board: Emule with Intel Compiler 7.0 fail - Official eMule-Board

Jump to content


Page 1 of 1

Emule with Intel Compiler 7.0 fail

#1 User is offline   Dummy 

  • Splendid Member
  • PipPipPipPip
  • Group: Members
  • Posts: 116
  • Joined: 13-September 02

Posted 16 December 2002 - 09:13 PM

I am trying to use intel compiler 7.0 to compile the emule 2.3b, but it fail, I had disable all optimization of the intel compiler already. Anyone had try this?
0

#2 User is offline   xrmb 

  • Magnificent Member
  • PipPipPipPipPipPip
  • Group: Members
  • Posts: 442
  • Joined: 29-September 02

Posted 17 December 2002 - 01:04 AM

i only disabled the exceptions then i could compile it...

5min compiler
10min link+optimizer at a 2.4ghz P4

=2.2mb running emule version...but seems not faster, because i dont know where emule is slow :)
0

#3 User is offline   Dummy 

  • Splendid Member
  • PipPipPipPip
  • Group: Members
  • Posts: 116
  • Joined: 13-September 02

Posted 17 December 2002 - 01:24 AM

wow!!! then seems I need to disable many and set them back to MS and try again ^_^
0

#4 User is offline   Tahattmeruh2 

  • The One And Only
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 632
  • Joined: 29-November 02

Posted 12 February 2003 - 10:49 AM

xrmb, on Dec 17 2002, 01:04 AM, said:

i only disabled the exceptions then i could compile it...

5min compiler
10min link+optimizer at a 2.4ghz P4

=2.2mb running emule version...but seems not faster, because i dont know where emule is slow :)

Where can I disable the exceptions? I couldn't find an option for that.

Tahattmeruh
0

#5 User is offline   arafat 

  • Newbie
  • Pip
  • Group: Members
  • Posts: 6
  • Joined: 20-February 03

Posted 20 February 2003 - 06:13 PM

I running now emule 26d with Intel Compiled version (Thanks to Dummy) on a OCL Pentium II System.
Emule is now running more stable with less CPU consumption. Even Multitasking environment ( WindowsXP) is running smoother and the emule response time is much higher ( about 50 percent).
I run emule with Low priority on my System because Normal Priority is not needed for this program working in the Background, i am using a Traffic Shaping PPPOE Driver, ( CFOS) www.cfos.de, and it is also worth.
I don't use a Upload Limit, this controls the PPPOE Driver, so i can read my e-mails and write this message.


Amen
0

#6 User is offline   .:fl0yd:. 

  • Magnificent Member
  • PipPipPipPipPipPip
  • Group: Members
  • Posts: 355
  • Joined: 17-February 03

Posted 20 February 2003 - 07:26 PM

xrmb, on Dec 17 2002, 01:04 AM, said:

but seems not faster, because i dont know where emule is slow :)

Heavy use of 16bit integers, something that is especially bad on P4's. Each and every time a developer puts in a int16/sint16/uint16 and accesses it, the entire pipeline stalls. There's nothing even a beast like ic7 can do about dull-minded developers that don't know their hardware....
0

#7 User is offline   stobo 

  • Advanced Member
  • PipPipPip
  • Group: Members
  • Posts: 96
  • Joined: 26-November 02

Posted 21 February 2003 - 08:43 AM

the whole pipeline stalls?? way to go intel :D

(not to start a war but..)
0

#8 User is offline   .:fl0yd:. 

  • Magnificent Member
  • PipPipPipPipPipPip
  • Group: Members
  • Posts: 355
  • Joined: 17-February 03

Posted 21 February 2003 - 11:48 AM

stobo, on Feb 21 2003, 08:43 AM, said:

the whole pipeline stalls??

Yip, that is it -- whenever you access a 16bit-value in a 32bit-application, the entire pipeline needs to be flushed. This is necessary since the cpu has to switch to 16bit mode to process those values. Go through the compiled machine code and look for 0x66 prefixed opcodes -- the prefix is used to switch modes.

To eliminate the need for a flame war: This behaviour isn't just present on Intel cpu's -- running the code on AMD processors gives the same result. As a side note, this effect is also present in other architectures, that can run in different modes, e.g. StrongARM's RISC and Thumb-mode.

So what have we learned if nothing else? Avoid 16bit values whenever possible. 8bit values are a different story though. When accessing an 8bit subpart of a register the worst thing that can happen is a read-after-write stall, which is by far less intrusive. Moreover, x86 cpu's have a number of special cases where stalling is completely eliminated, e.g. using sub al, 127 instead add al, 128 will not force a read-after-write stall. Compilers do know about these things.
0

#9 User is offline   SunMaster 

  • Premium Member
  • PipPipPipPipPip
  • Group: Members
  • Posts: 277
  • Joined: 03-December 02

Posted 21 February 2003 - 12:24 PM

Quote

Yip, that is it -- whenever you access a 16bit-value in a 32bit-application, the entire pipeline needs to be flushed. This is necessary since the cpu has to switch to 16bit mode to process those values. Go through the compiled machine code and look for 0x66 prefixed opcodes -- the prefix is used to switch modes.


http://www.intel.com...um/instform.htm

http://www.packetsto...bly/opcode.html

A pentium doesn't "switch mode" when it accesses a non-native datasize associated with a segment, the value you are referring to has been called an override predix as long as I can remember and never had any penalty associated with it apart from wasting the added 1 byte instruction overhead. The same kind of prefixing applies to adressing.

At no stage did the pentium need to "switch to 16 bit mode" in order to process non-32 bit data, and the added overhead in order to remain with exclusively 32bit data and address would have been very large in many contexts. Considering intels have variable instruction length it seems odd this a change of mode has to be done. Maybe this has been changed in later pentiums - do you care to supply a pointer to support your claim ?
0

#10 User is offline   .:fl0yd:. 

  • Magnificent Member
  • PipPipPipPipPipPip
  • Group: Members
  • Posts: 355
  • Joined: 17-February 03

Posted 21 February 2003 - 12:47 PM

SunMaster, on Feb 21 2003, 12:24 PM, said:

do you care to supply a pointer to support your claim ?

Well, you were close already, only 8 to 18 years late.... Intel's developer site offers all those software optimization white papers. Is that enough?
0

#11 User is offline   SunMaster 

  • Premium Member
  • PipPipPipPipPip
  • Group: Members
  • Posts: 277
  • Joined: 03-December 02

Posted 21 February 2003 - 12:52 PM

Nope, I'm afraid you'd have to do better than that.
0

#12 User is offline   stobo 

  • Advanced Member
  • PipPipPip
  • Group: Members
  • Posts: 96
  • Joined: 26-November 02

Posted 24 February 2003 - 11:20 PM

without giving out any reference, i could just say that it would be worth the task, for both intel and AMD, to weed out any stupid "mode changes" you claim "db 66h" to have.

that would get them, like, 1000% boost in applications that only access 16-bit data in arrays. if pentium4 flushes all it's 18 stages in both pipelines, that's like, instead of stopping your car at red lights, to step out of it, change tyres and check the oil before continuing.

no, i don't think so. db 66h isn't probably even a micro-op, but rather single bit set in the instruction decoder unit. and yes, it belongs right before mov ax,dx, thank you.
0

#13 User is offline   SunMaster 

  • Premium Member
  • PipPipPipPipPip
  • Group: Members
  • Posts: 277
  • Joined: 03-December 02

Posted 25 February 2003 - 12:13 AM

Oh, I made a little test for it, but my CPU is a k7 so I don't know how it works on intel.

I didn't plan to wade through whitepapers to do this, but created a little testprogram, simple but afaik should test this.

#include <stdio.h>
#include <time.h>

int main (int argc, char *argv[])

{
unsigned long data_32 = 0xaabbccdd;
unsigned short data_16 = 0xeeff;

clock_t start, elapsed1, elapsed2, elapsed3;
double duration1, duration2, duration3;
start = clock ();

	__asm
	{
	mov	ecx, 0xffffffff
lab1:
	mov	eax,dword ptr [data_32]
	mov bx,word ptr [data_16]
	loopnz lab1
	}

elapsed1 = clock ();
	__asm
	{
	mov	ecx, 0xffffffff
lab2:
	mov	eax,dword ptr [data_32]
	mov ebx,dword ptr [data_32]
	loopnz lab2
	}
elapsed2 = clock ();
	__asm
	{
	mov	ecx, 0xffffffff
lab3:
	mov	ax,word ptr [data_16]
	mov bx,word ptr [data_16]
	loopnz lab3
	}
elapsed3 = clock ();

duration1 = (double) (elapsed1 - start) / CLOCKS_PER_SEC;
duration2 = (double) (elapsed2 - elapsed1) / CLOCKS_PER_SEC;
duration3 = (double) (elapsed3 - elapsed2) / CLOCKS_PER_SEC;
printf ("%f, %f, %f\n", duration1, duration2, duration3);
}


5 runs of the this little piece of code gave the following average values on my xp2000+

Loop1 (16 + 32) = 12,42 sec (min 12,22, max 12,55)
Loop2 (32 + 32) = 11,51 sec (min 11,34, max 11,89)
Loop3 (16 + 16) = 14,31 sec (min 14,24, max 14,47)

In other words, mixing 16+32 in a loop doing nothing but moving data took 7.9% more time than 32 bit only. Using 16 bit only took 24,4% more time.

Anyone care to run on Intels to compare, not that it would have any relevance to anything but just to satisfy my curiousity ?
0

#14 User is offline   LoneStar 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1,005
  • Joined: 27-September 02

Posted 25 February 2003 - 12:39 AM

Hmmm, trying for you, but no go yet. My zero flag is set when the loopnz comes up, so it doesn't loop. I'm looking into it and will post when I figure something out (been a LOOOONG time since I've assembled :))

The clock() call is setting the flag. I'm just going to replace the loopnz's with loop's. Should accomplish the same thing, without worrying about the flag.
-D

This post has been edited by LoneStar: 25 February 2003 - 12:46 AM

0

#15 User is offline   SunMaster 

  • Premium Member
  • PipPipPipPipPip
  • Group: Members
  • Posts: 277
  • Joined: 03-December 02

Posted 25 February 2003 - 12:51 AM

Just change loopnz to loop, which is what it should have been.
0

#16 User is offline   LoneStar 

  • Golden eMule
  • PipPipPipPipPipPipPip
  • Group: Members
  • Posts: 1,005
  • Joined: 27-September 02

Posted 25 February 2003 - 12:58 AM

Running now, I'll post when they're complete...

Processor - 16/32 - 32/32 - 16/16 (16/32 - 16/16)

P4-1.8GHz = 7.2436 / 6.5434 / 7.2252 ( 10.7% / 10.4% )
P3-800MHz = 30.9602 / 30.7186 / 46.4858 ( 0.8% / 51.3% )
P2-333MHz = 81.7472 / 79.2048 / 125.7926 ( 3.2% / 58.8% )
P5-233MHz = 111.1665 / 129.7195 / 129.6685 ( 14.3% / 0.01%)

Averages over 5 trials; I've left off the max/min values from here, but they only deviated a slight bit. We are, of course, limited by the Windows timings here, so it's not super-accurate, but good for comparison nonetheless. The percentages are the time > the 32-bit only time.

But yikes, look at the comparisons there. The P4 gets hit the hardest between 32/32 and 32/16, but doesn't lose nearly as much with the 16/16 as the other processors do. Except - the most interesting in my mind - the plain old Pentium. Very interesting. The saddest thing is that, clock for clock, the P5 outperforms the PII's with 16/16!! And it's mixed mode the fastest... I'd guess that 32 bit and 16 bit operations can coexist in the pipelines (or whatever, take separate pipes). Maybe out of x pipelines, y are reserved for 32-bit ops. Thus, 2 16-bit operations can't be executed simultaneously on procs with only 2 pipes. Maybe some can't have the same width operation in more than one pipe (could explain the P5 performance). But I don't remember the architectures that well. P5 was the first with 2 pipelines? How many do P4's have?

All were run on Win2k. Plain processors (no xeons or anything :P)
-D

This post has been edited by LoneStar: 25 February 2003 - 01:42 AM

0

#17 User is offline   .:fl0yd:. 

  • Magnificent Member
  • PipPipPipPipPipPip
  • Group: Members
  • Posts: 355
  • Joined: 17-February 03

Posted 25 February 2003 - 03:12 AM

Funny thing you didn't disable interrupts. Also quite funny that you don't rely on rdtsc to do your timings. I'd rather count clock cycles rather than using a coarse timer function.
0

#18 User is offline   .:fl0yd:. 

  • Magnificent Member
  • PipPipPipPipPipPip
  • Group: Members
  • Posts: 355
  • Joined: 17-February 03

Posted 25 February 2003 - 03:23 AM

As an addition a quote from the Intel Architecture Optimization Reference Manual (p.2-8):

Quote

On Pentium II and Pentium III processors, when a 32-bit register (for
example, eax) is read immediately after a 16- or 8-bit register (for example,
al, ah, ax) is written, the read is stalled until the write retires, after a
minimum of seven clock cycles.

Special cases of reading and writing small and large register pairs are implemented in PII/PIII's, but only for those 8bit-subregisters. No special cases are available for 16bit-subregisters.
0

  • Member Options

Page 1 of 1

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users