Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Testing your CFLAGS
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Fri Feb 25, 2005 8:44 am    Post subject: Testing your CFLAGS Reply with quote

Testing your CFLAGS

I have found many introductions to CFLAGS and suggestions on how one should set them up for what kinds of architectures. What I'm providing here is a way to actually test your CFLAGS. For this purpose I have written a small test program in C. Don't ask for any sense. It's just a test program written in a way that compiler flags have a noticable influence.


1. Preparing

First create some test directory and change to it:

Code:
mkdir ~/cftest
cd ~/cftest


Now you need to save the source code of the test program. For that type:

Code:
cat > cftest.c


Then paste the following into the terminal:

Code:
#include <math.h>
#include <stdio.h>
#include <time.h>


#define ITERATIONS (64)
#define USIZE (sizeof(unsigned) * 8)


struct teststruc {
  double a;
  float b;
  unsigned long long c;
  unsigned d;
};


int func2(char *str, size_t strl) {
  int i, k;
  short int f = 1;

  for (i = 0; i < 16; i++) {
    f += str[i % strl];
  }
  for (i = 0; i < strl; i++) {
    for (k = i; k < 64; k++) {
      str[k % strl]++;
    }
    for (k = 23; k < 35; k++) {
      str[(k * 3) % strl]--;
    }
    for (k = 14; k < 29; k++) {
      str[k % strl]--;
    }
    f += str[i] * str[(i + 1) % strl];
  }
  return (f >= 0 ? f % 13 : -f % 13);
}


double func1(double res, double *ref) {
  struct teststruc t;
  double tmp1;
  unsigned long long tmp2;
  unsigned rv;
  int i, k;

  t.a = 0.6532;
  t.b = 1;
  t.c = 13;
  t.d = time(0);
  for (i = 0; i < 5273; i++) {
    for (k = 0; k < 16; k++) {
      t.a *= 1.1212;
    }
    t.a += t.b * *ref;
    t.b *= res * 52.3734;
    t.c += t.a * t.b * res;
    tmp1 = t.a;
    t.a = fmod(t.b, 62.63);
    t.b = fmod(tmp1, 74.12);
    for (k = 0; k < 16; k++) {
      t.a *= 1.4141;
    }
    rv = 1 + t.c % 26;
    t.d += i;
    tmp2 = t.d >> rv;
    t.d <<= (USIZE - rv);
    t.d |= tmp2;
    t.a *= func2((char *)&t.c, sizeof(t.c));
  }
  return t.a * t.b + t.c + t.d;
}


int main() {
  double res = 1.541;
  double ref = 2.631;
  int i;

  for (i = 0; i < ITERATIONS; i++) {
    res = func1(res, &ref);
    res = fmod(res, ref);
  }
  printf("%lg %lg\n", res, ref);
  return 0;
}


After pasting press the return key and then the EOF key (Ctrl+D on Intel machines). If it doesn't give back a command prompt, press EOF key again. If it still doesn't press the interrupt key (Ctrl+C on Intel machines). Then compile the program:

Code:
gcc -lm -o cftest cftest.c


If you don't get any output, then everything went fine. Now it's important that you don't have any system load. Close all background programs like P2P software. The less system load you have, the more accurate the timing is. I need to say that even moving the mouse or typing with the keyboard needs a lot of CPU power. While running the test program, don't do anything else. Don't even move the mouse.

Run the test program with:

Code:
time ./cftest


When it finishes it prints two numbers and timing statistics below them. You can ignore the two numbers, but the times are important. Intentionally it will take some time to finish. If it takes less than 20 seconds, open the source file ~/cftest/cftest.c with your favorite editor and increase the value of ITERATIONS from 64 to something higher. If it takes very long (like more than one minute), decrease it (you can abort the program with the interrupt key: Ctrl+C on Intel machines). Then compile it again and see how long it takes using the two commands above.

Now what you really need is the user time. It should be nearly equal to the real time. If it isn't, then you have too much system load. Remember: don't even move the mouse while running the test program. In bash it looks something like this:

Code:
$ time ./cftest
0.802501 2.631

real    0m8.954s
user    0m8.700s
sys     0m0.022s


Save the user time somewhere, like in a text file. When this is done, close the editor if it's still open and don't change the source code again.


2. Running the tests

What you will try here is to increase program speed and therefore decrease the time it takes to finish. Set some CFLAGS to test with the following command:

Code:
CFLAGS="<flags>"


Of course I don't mean the CFLAGS-line in your make.conf file. Just enter this command into your shell. Example:

Code:
CFLAGS="-O2 -fomit-frame-pointer"


Then type the following:

Code:
gcc -lm $CFLAGS -o cftest cftest.c && time ./cftest


Now compare the user time with the one you got earlier. You can experiment with other CFLAGS. Just set them with the command above and run the test again. Try to decrease the user time as much as possible. If a flag doesn't make any difference, don't use it (most flags slow down the compiler). However, there is a little variance. Take into account only the first or the first two fractional digits of the user time or run a test with the same flags multple times.

Here are some flags you can experiment with:


  • -O1, -O2, -O3 (only one of them)
  • -mmmx, -m3dnow (if your machine supports them; check the flags in /proc/cpuinfo)
  • -msse, -mfpmath=sse (if your machine supports SSE; check the flags in /proc/cpuinfo)
  • -march=<architecture> (example: -march=pentium3)
  • -fomit-frame-pointer (you will want to include this anyway)
  • -funroll-loops (will make noticable difference, but produces large binaries)
  • -maccumulate-outgoing-args (produces large binaries without really noticable difference in speed)
  • -ftracer (doesn't seem to do anything; it's a rather new flag - don't use it)


The -pipe option does not produce faster code. It just speeds up the compilation process, so include it.


3. Even more flags (advanced users)

If you are an advanced user, you can find additional flags with descriptions in the GCC info pages with the following command:

Code:
info gcc


Position the cursor on the text "Invoking GCC::" and press the return key. Then scroll down, position the cursor on "Optimize Options::" and press the return key again. What you get then is a list of C compiler flags (CFLAGS) related to code optimization. If you want to go even further, you can set submodel specific options. Press the L key and then select "Submodel Options::". Choose your submodel there ("i386 and x86-64 Options::" for most users). You can leave the info program with the Q key.

More advanced users (like programmers) can even look for other options in the info pages. Some of them are specific to C++ (those about templates and classes). You will never want to set C++ specific flags globally in your make.conf. However if you really know what you are doing and want to set them anyway, they belong to the CXXFLAGS in your make.conf, not to the CFLAGS. You cannot test C++ flags with this test program as it's a C program. If you wanted to set a C++ flag "-fsome-flag" in your make.conf, it should look like this:

Code:
CXXFLAGS="$CFLAGS -fsome-flag"


However, you should not set any compiler options besides optimization flags and submodel options. If you have found a set of flags that produces fast code on your architecture, take them into your make.conf file.


4. Applying the new CFLAGS

After changing your make.conf, your system will still use the old flags until you recompile it. If you don't want to recompile the entire system, my suggestion is to recompile the following packages (in that order):


  1. gcc
  2. glibc
  3. binutils
  4. coreutils
  5. xorg-x11 (if you use it)
  6. gtk+ (if you use it)
  7. qt (if you use it)


Then recompile all focus programs like your window manager, programs you work with and so on. Another way to apply the new CFLAGS to the most important packages:

Code:
emerge -e system


This will take some time. If you are even more patient, you can recompile your entire system:

Code:
emerge -e world


If you have lots of packages installed, you will be waiting a few days; especially when you have lots of KDE/Qt programs installed.

5. Removing the test program

When everything is done, delete the directory of the test program:

Code:
cd ..
rm cftest/cftest*
rmdir cftest


6. Enjoy your new CFLAGS
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Fri Feb 25, 2005 9:07 pm    Post subject: Reply with quote

Okay, I've given it a go, and have some troubling data to report.

First, some info:
Pentium-3 1000MHz / 256MB PC133 / ASUS CUSL2-C
gcc 3.4.3 NPTL / kernel 2.6.10-r6

My portage flags are -O2 -march=pentium-3 -fomit-frame-pointer -pipe; the entire system was built with them (stage 1 on 3).

All tests were compiled with gcc -v to check the flags in use.
Each test was run 3 times to eliminate momentary differences.

I started out with no flags, got 14.5 seconds user time - NOTE gcc defaults to -mtune=pentiumpro, probably because CHOST=686 and it copies that, but I'm not sure.

Next I put in my make.conf flags, reduced to 10.2 seconds user time - almost 30% less, amazing really

Next I thought, okay, but can I reduce it further ?
So I put in -O2 and ran it 3 times again - still 10.2 seconds...
Okay, nudge it up to -O3 ... nope, still 10.2 seconds, all with minor variations of less than 0.05 seconds.

This was strange, so I decided to test it more thoroughly.
First, strip all flags again, and this time force it to build for i386 - back up to 14.5 seconds, the same as for pentium-3.
Add the -O flag back in, test again - 10.2 seconds again!
Exactly the same test time, without any optimisations above the 386!

Getting weirder now... I started putting in extra flags, that according to the gcc man page should half break my system and half actually increase the run time: -mfpmath=sse, -fforce-addr, -funroll-all-loops, and some more stuf I forgot...

Shortening a confusing story considerably, my conclusions are worrying to say the least:

- no matter what extra optimisation flags I set, the run time is 30% less with -O than without it.
- increasing the optimisations to -O2 or -O3 has a negligible effect compared to that 30% speed increase.
- I saw no difference at all from any processor-specific optimisations, nor did enabling mmx or sse give any noticable performance gains

This I find most puzzling, especially since a good part of this test code is mathematical.

The only sensible explanation I can find for these results is one you won't liek one bit: the test is biased or flawed to such an extent that you can not actually use it to test optimisations beyond the basic on or off, -O1 or nothing.

Even so, nice to see that just adding that flag gave me a 30% boost - but again, probably only for this highly selective piece of code...

Comments ?
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
Rufy
n00b
n00b


Joined: 21 Aug 2003
Posts: 17

PostPosted: Fri Feb 25, 2005 10:15 pm    Post subject: Reply with quote

I've had similar results here with similar hardware (Pentum3 1ghz 512mb). In fact, so far my tests have shown the best combination is with "-O1 -march=pentium3 -fomit-frame-pointer", and adding additional flags has done no better.

I've run tests like this before, but with real applications like gzip and povray, and have seen very different results. So, without a deeper analysis of the code above, I'd say it's simply an isolated case of "less is more".

Also, optimizing applications like GCC and X involves more than testing for speedy math code. With GCC you're doing lexical analysis, table lookups, and other fun stuff. With X you've got tons of hardware-level calls going on, all dependent on how well supported your driver happens to be (and those of us running nvidia are stuck with their CFLAGS for the most part anyway, so not much to improve there).

A better approach to CFLAGS testing would be to build a test-bed that compiles and runs a predefined set of applications using various CFLAGS settings. Include apps that you would normally use in any given session for more accurate results.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Fri Feb 25, 2005 10:26 pm    Post subject: Reply with quote

Shouldn't gcc itself have a benchmark suite - or at least a conformity testing suite ?

ADDENDUM:

After thinking about this some more, this isn't actually all that simple to achieve...
Since to give a true benchmarkable solution one would have to re-flag every single dependency as well - including glibc! - it will be non-trivial to set up and execute.

Even if every library were built with the same set of flags, this would only be representative for that target system, since other systems might be built another way.

So either one would have to build one monolithic suite that is not dependent on any libraries - almost impossible I think - or a certain set of flags would be requisite for the pre-existing system to test against.

This would almost certainly require a distribution built specifically for this purpose, to enable people to fire up a CD and start testing their hardware with a wide range of flags.

Another option would be to incorporate the timing in the test suite itself; the flags the libraries were built with become less of an issue if just the internal portions of the test suite code are benchmarked.
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Sat Feb 26, 2005 4:07 am    Post subject: Reply with quote

Well, that's what I meant. Mostly you won't notice any difference. The code is written in a way that most any compiler flag can be tested, but unfortunately most flags cause little difference. -O1 and -O2 are just simple aliases for a hell of a lot flags, which you could also enable by hand. If you test them individually, the speed gain is very low. The only flag that causes a considerable speed boost is -funroll-loops, but it has the drawback of increasing code size a lot.

As far as I've seen, the -mmmx and -m3dnow flags aren't really used, until you write your code to do so (see "vector variables" in the gcc info pages). -msse is similar, but you can force SSE to be used for floating point arithmetic, if you set -mfpmath=sse as well.

Submodel specific optimization (like the difference between pentium3 and 386) is really only to reduce cache fails, but the speed gain is not really noticable. You might notice it a bit if you use -fno-inline-functions and then test -march=386 vs. -march=pentium3. Here with my Duron, it does make a difference of less than 50 ms.

Well, this experiment should make clear, that insane optimization does not make such a difference. For example in Debian most packages are compiled with just -O2 and -fomit-frame-pointer and this is about the highest optimization level you can get without having binaries of twice the size. I've used to use -funroll-loops before, because it really makes code faster, but then I realized that my whole system runs faster without it. Now I enable it for speed-critical apps only, like mplayer or games.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Sat Feb 26, 2005 4:20 am    Post subject: Reply with quote

Rufy wrote:
Also, optimizing applications like GCC and X involves more than testing for speedy math code. With GCC you're doing lexical analysis, table lookups, and other fun stuff. With X you've got tons of hardware-level calls going on, all dependent on how well supported your driver happens to be (and those of us running nvidia are stuck with their CFLAGS for the most part anyway, so not much to improve there).


You are absolutely right. And I have tried to write a code that does most of those things. Table lookups are mostly just loops with string or integer comparison and so is lexical analysis. It sure has a lot more overhead than this little code, but it's essentially the same and you will get no different results. At least I didn't.

Maybe the best solution would be, that the maintainer of a package sets individual optimization CFLAGS for every package. Then the user does not set them anymore, but only sets submodel options (like -march or the like). If you still feel like having your own global optimiziation flags, you could still set them via make.conf/CFLAGS and they would override the maintainer's flags.

adaptr wrote:
Since to give a true benchmarkable solution one would have to re-flag every single dependency as well - including glibc! - it will be non-trivial to set up and execute.


This is not true. If you don't use any external functions (and I didn't - the compiler inlines the math functions), then glibc isn't even touched by this program. You will still need to have glibc (or another libc), but the code itself runs without it.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Sat Feb 26, 2005 7:33 pm    Post subject: Reply with quote

That's obvious nonsense - if your code does not need the math library then why does the linker want it ?
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Sun Feb 27, 2005 10:35 pm    Post subject: Reply with quote

No, it is not nonsense. The glibc is used before entering main() and after leaving it (same as exit() - in fact, it's also the same technically). The math library is used, if you do not use -ffast-math. If I hadn't used math functions, then the actual test loop would run without the math library at all, but still not without glibc, because it is needed outside of the test loop.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Mon Feb 28, 2005 12:28 pm    Post subject: Reply with quote

Okay.

I'm not disputing what you say (i'm sure it's correct) but take a moment to think about this:

- any meaningful benchmark for CFLAGS must take into account real-world conditions, i.e. consist of the kind of code that people actually run on their machine every day.
- any non-trivial, not-optimised-for-benchmarks program will spend up to 50% of its time in either one or more of the system libraries or in kernel calls - that's what they're for.

The key word in the above is meaningful - people tend to be impressed by raw figures of how fast your system can compute pi to a gazillion decimal places, but as a measure of your system's overall speed it's pretty useless.

Not benchmarking glibc and the kernel with your chosen CFLAGS is therefore unrealistic to say the least (I'm not gonna say silly ;-))

So yes, to effectively benchmark a given, known system against a set of CFLAGS you will have to rebuild the system .

Anything less just isn't a representative benchmark.

(As an aside, I understand that this would be prohibitively costly to set up and run, especially when you want to test out multiple flag combinations in sequence... sort of rules it out for your idea of a quick'n'dirty CFLAG tester, eh ? ;-))
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Mon Feb 28, 2005 12:49 pm    Post subject: Reply with quote

adaptr wrote:
(As an aside, I understand that this would be prohibitively costly to set up and run, especially when you want to test out multiple flag combinations in sequence... sort of rules it out for your idea of a quick'n'dirty CFLAG tester, eh ? ;-))


And that's the purpose. =)

Sure, it can't replace a system benchmark, but this test gives users an idea of how much a specific flag does change things. The stuff that gets optimized, is algorithms and iterations and so this is the most realistic quick'n'dirty test.

Well, the whole story about the CFLAGS - it's just a hype. Even worse than antivirus software and firewalls under Windows are. I had enough from nonsense like "Use -funroll-loops, it makes your system a lot faster", so I wanted to show what CFLAGS actually change. You cannot have much more than 25% speed gain over the unoptimized state and this is a maximum. In the real world, you won't even get that much.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Mon Feb 28, 2005 3:17 pm    Post subject: Reply with quote

I'm not so sure of that - or, at least, I would start reading the gcc mailing archives for discussions about this, of which I can assure you there are plenty...

I'm having trouble accepting that the difference between -O0 and -O1 always gives the same improvement - regardless of any other optimisation flags.
Even more scepsis is had when I see from your code that using -O2 or -O3 actually decreases performance - every single time, again regardless of other flags.

They might as well not have bothered then, and yet the gcc code base is - like so many things GNU - jointly developed by hundreds of people.

I have a hard time believing nobody would have noticed this before now.
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
NewBlackDak
Guru
Guru


Joined: 02 Nov 2003
Posts: 512
Location: Utah County, UT

PostPosted: Mon Feb 28, 2005 6:50 pm    Post subject: Reply with quote

This is on the Athlon system clocked at default(cleaned my watercooling out, so it's on air until I get it back in this weekend)
CFLAGS=""
1.6531 2.631

real 0m6.832s
user 0m6.823s
sys 0m0.008s

CFLAGS="-march=athlon-xp -O3 -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"
0.48502 2.631

real 0m5.282s
user 0m5.275s
sys 0m0.006s

CFLAGS="-march=athlon-xp -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"
0.156335 2.631

real 0m6.842s
user 0m6.806s
sys 0m0.008s

CFLAGS="-march=athlon-xp -O2 -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"
0.711392 2.631

real 0m5.373s
user 0m5.328s
sys 0m0.010s
_________________
Gentoo systems.
X2 4200+@2.6 - Athy
X2 3600+ - Myth
UltraSparc5 440 - sparcy
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Tue Mar 01, 2005 12:33 pm    Post subject: Reply with quote

adaptr wrote:
I'm not so sure of that - or, at least, I would start reading the gcc mailing archives for discussions about this, of which I can assure you there are plenty...

I'm having trouble accepting that the difference between -O0 and -O1 always gives the same improvement - regardless of any other optimisation flags.
Even more scepsis is had when I see from your code that using -O2 or -O3 actually decreases performance - every single time, again regardless of other flags.

They might as well not have bothered then, and yet the gcc code base is - like so many things GNU - jointly developed by hundreds of people.

I have a hard time believing nobody would have noticed this before now.


I have never said that -O3 actually increases code speed. I just suggested experimenting around with them. And yes, depending on your architecture setting (-march) it may even decrease code speed. This might be a reason why most distributions use -O2 instead. One thing I fully agree with you is that there can't be a single program to test compiler flags with. But this test gives users a brief idea of what has what influence to the code size and speed. And that is the purpose.

NewBlackDak wrote:
CFLAGS="-march=athlon-xp -O3 -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"


You would never want to use -ffast-math globally. It may cause some programs to fail. Read the GCC info pages for more information.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Tue Mar 01, 2005 12:41 pm    Post subject: Reply with quote

Quote:
But this test gives users a brief idea of what has what influence to the code size and speed

Yes, and the one and only conclusion every test seems to show is that -O1 gives 25-30% gains over -O0.
On any platfrom, regardless of other optimisation.

So you could just have ended your test article with saying: use -O and don't bother with anything else ;-)
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Tue Mar 01, 2005 1:38 pm    Post subject: Reply with quote

adaptr wrote:
Quote:
But this test gives users a brief idea of what has what influence to the code size and speed

Yes, and the one and only conclusion every test seems to show is that -O1 gives 25-30% gains over -O0.
On any platfrom, regardless of other optimisation.

So you could just have ended your test article with saying: use -O and don't bother with anything else ;-)


Expressed in other words, actually. =)

Well, in fact I would recommend -O2 instead of -O. It's a good deal between code size and speed.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Tue Mar 01, 2005 1:56 pm    Post subject: Reply with quote

Well, I wouldn't know about that - as I said, my tests indicate that there is a zero difference between -O2 and -O3.
However, the difference between -O1 and -O2 is on the order of -5%.
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Tue Mar 01, 2005 2:00 pm    Post subject: Reply with quote

adaptr wrote:
Well, I wouldn't know about that - as I said, my tests indicate that there is a zero difference between -O2 and -O3.
However, the difference between -O1 and -O2 is on the order of -5%.


Yes, but -O3 increases code size a lot, because of function inlining, so I wouldn't recommend to use it. It made code faster on older processors, but current processors handle function calling very well, so there isn't much need for inlining.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Tue Mar 01, 2005 2:23 pm    Post subject: Reply with quote

You're still missing my point : -O2 is consistently 5% slower than -O1 in every combination.
So why enable it ?
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Tue Mar 01, 2005 2:44 pm    Post subject: Reply with quote

adaptr wrote:
You're still missing my point : -O2 is consistently 5% slower than -O1 in every combination.
So why enable it ?


Even without your -march setting?
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Tue Mar 01, 2005 3:35 pm    Post subject: Reply with quote

For any arch; I've tested it - as shown above - with pentium3, i686 and i386 - no difference whatsoever.
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
thebigslide
l33t
l33t


Joined: 23 Dec 2004
Posts: 790
Location: under a car or on top of a keyboard

PostPosted: Tue Mar 01, 2005 6:27 pm    Post subject: Reply with quote

Bumping -march won't make a diff. MMX was more of a marketing thing than anything else, SSE isn't used by that code. 3dnow isn't either. The only optimizations gcc can make processor-specific is alignments (minimal) on that code. There aren't enough loops and recursive functions and other things that are optimized by the tweaky flags. Packages work the same way, tho. I don't think it's that simple to make a benchmark that will tell you the best flags to use with any particular package because each packages is written a little differently and will take different optimizations in different ways. Perhaps a good way would be a script that takes different combinations of CFLAGS and repeatedly compiles a given package's source files and times a subsequent execution to determine this on a case by case basis? Just a thought.

My specs: Athlon-XP 2.4GHz w/ 256k of L2 cache
Iterations = 1024
GCC 3.4.3-r1

Code:
CFLAGS="-march=pentium-mmx -O1 -pipe";gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest                                                         
0.0250746 2.631

real    1m18.980s
user    1m18.973s
sys     0m0.008s


and optimized:
Code:
CFLAGS="-march=athlon -O3 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse"; gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest
0.636471 2.631

real    1m17.489s
user    1m17.480s
sys     0m0.008s


now again w/o SSE:
Code:
CFLAGS="-march=athlon -O3 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387" gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest
1.61745 2.631

real    1m17.519s
user    1m17.512s
sys     0m0.006s

and the difference there is negligible, but repeateably apparent. Now to see if Olevel does anything for us:
Code:
CFLAGS="-march=athlon -O1 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse";gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest
0.619176 2.631

real    1m18.379s
user    1m18.372s
sys     0m0.007s


I think those results speak for themselves.
I think it would use the SSE registers more if one did some math with matrices of floats.

In summation, I get about 2% optimization max.


Last edited by thebigslide on Wed Mar 02, 2005 2:05 am; edited 2 times in total
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Tue Mar 01, 2005 6:55 pm    Post subject: Reply with quote

Erm.. according to the gcc manual using -mfpmath=sse forces it to use SSE for anything that floats...
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
_never_
Apprentice
Apprentice


Joined: 10 Jun 2004
Posts: 285
Location: BW, Germany

PostPosted: Tue Mar 01, 2005 9:25 pm    Post subject: Reply with quote

adaptr wrote:
For any arch; I've tested it - as shown above - with pentium3, i686 and i386 - no difference whatsoever.


I own a Duron 1.6 GHz. This isn't true for me. On my system it improves code speed, but just a few milliseconds. It also reduces code size a bit (not only for this program, but for a few others as well), so I set -O2. Might be different for other architectures.

thebigslide wrote:
MMX was more of a marketing thing than anything else, SSE isn't used by that code. 3dnow isn't either.


MMX doesn't make any difference. Even with a little assembler code I wrote, which makes use of saturated addition (an MMX feature) on large blocks, it didn't increase code speed at all. Intel has just implemented MMX for ad purposes, I guess. I have tested this on many architectures, even non-Intel.

SSE is used, as soon as you add -mfpmath=sse or -mfpmath=sse,387. However, it doesn't make much difference, either. 3dnow isn't used, that's right, but it might be used on future versions of GCC.

Currently MMX and 3dnow are only used if you work with vector variables, which this code intentionally doesn't use, since there are very few packages that use them.

thebigslide wrote:
There aren't enough loops and recursive functions and other things that are optimized by the tweaky flags.


Enough loops to make the flags optimize (or deoptimize) noticably. The code optimizer doesn't handle recursion specifically.

And, thebigslide, you did one mistake when performing the test. Your commands just run gcc with empty CFLAGS. Test this command sequence:

Code:
ABC="test1"
ABC="test2" echo $ABC
ABC="test3" echo $ABC


You'll get "test1" both times. If I tell you to set the CFLAGS variable separately, I have some good reason to do so.
_________________
Knowledge is Power.
Back to top
View user's profile Send private message
thebigslide
l33t
l33t


Joined: 23 Dec 2004
Posts: 790
Location: under a car or on top of a keyboard

PostPosted: Wed Mar 02, 2005 1:22 am    Post subject: Reply with quote

adaptr wrote:
Erm.. according to the gcc manual using -mfpmath=sse forces it to use SSE for anything that floats...


True, it will use the registers, but the benefit of SSE is doing math on multiple registers simultaneously. This won't happen with the above code. This is shown by all our results, or turning on SSE would give 20-25% boost (my off the cuff estimate) in performance. Try recompiling mencoder with and without SSE and benchmarking it and you'll see what I mean. SSE is a HUGE boost in float applications like encoding audio and video. Not sure how lame would react.
Back to top
View user's profile Send private message
thebigslide
l33t
l33t


Joined: 23 Dec 2004
Posts: 790
Location: under a car or on top of a keyboard

PostPosted: Wed Mar 02, 2005 2:36 am    Post subject: Reply with quote

It turns out Lame doesn't react to cflags much (note that I axe'd -ffast-math, which is a default to to favour more intensive use of float registers) The configure script actually hard-codes them in the makefiles, but this is easily overridden:

CFLAGS="-march=athlon-xp -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse"
time lame -mj -t -S -h -cbr -b 64 -Y mt\ -\ i\ don\'t\ wanna\ hear\ it.wav mt\ -\ i\ don\'t\ wanna\ hear\ it.mp3
LAME version 3.96.1 (http://lame.sourceforge.net/)
CPU features: MMX (ASM used), 3DNow! (ASM used), SSE
Resampling: input 44.1 kHz output 24 kHz
Using polyphase lowpass filter, transition band: 10935 Hz - 11226 Hz
Encoding mt - i don't wanna hear it.wav to mt - i don't wanna hear it.mp3
Encoding as 24 kHz 64 kbps j-stereo MPEG-2 Layer III (12x) qval=2

real 0m4.151s
user 0m4.100s (+/-.015 over 10 trials)
sys 0m0.030s

CFLAGS="-march=athlon-xp -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387"
LAME version 3.96.1 (http://lame.sourceforge.net/)
CPU features: MMX (ASM used), 3DNow! (ASM used), SSE
Resampling: input 44.1 kHz output 24 kHz
Using polyphase lowpass filter, transition band: 10935 Hz - 11226 Hz
Encoding mt - i don't wanna hear it.wav to mt - i don't wanna hear it.mp3
Encoding as 24 kHz 64 kbps j-stereo MPEG-2 Layer III (12x) qval=2

real 0m4.148s
user 0m4.100s (+/-.025 over 10 trials)
sys 0m0.031s

No difference

CFLAGE="march=pentium-mmx -O1 -s -w -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387"
LAME version 3.96.1 (http://lame.sourceforge.net/)
CPU features: MMX (ASM used), 3DNow! (ASM used), SSE
Resampling: input 44.1 kHz output 24 kHz
Using polyphase lowpass filter, transition band: 10935 Hz - 11226 Hz
Encoding mt - i don't wanna hear it.wav to mt - i don't wanna hear it.mp3
Encoding as 24 kHz 64 kbps j-stereo MPEG-2 Layer III (12x) qval=2

real 0m4.350s
user 0m4.282s (+/-.020! over 10 trials)
sys 0m0.031s

Not sure why it's still saying 3DNow! is ASM used, but it seems to have been effected by the CFLAGS setting this time.

EDIT: I botched the first run through this by not doing a make clean after the first test :oops:


Last edited by thebigslide on Wed Mar 02, 2005 4:01 am; edited 1 time in total
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum