Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
-O3 Optimizations - Implications on portage:
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Installing Gentoo
View previous topic :: View next topic  
Author Message
nelsonwcf
Tux's lil' helper
Tux's lil' helper


Joined: 31 Oct 2012
Posts: 112

PostPosted: Thu Mar 16, 2017 3:48 am    Post subject: -O3 Optimizations - Implications on portage: Reply with quote

Hi,

I'm looking for a list of packages that work/doesn't work with -O3 at the Gentoo documentation but I couldn't find anything. The only reference I could find was on the Gentoo Handbook mentioning that using -O3 is not a good idea as some packages will have problems. However, in my ARM Banana Pi, my CFLAGS have -O3 and I have been using it for more than a year without any implications. Are there any information sources available on this subject, but specific to Gentoo?

Thank you,
Nelson
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7127
Location: almost Mile High in the USA

PostPosted: Thu Mar 16, 2017 5:34 am    Post subject: Reply with quote

Technically programs that compile incorrectly with -O3 is a gcc bug.
However as a lot of the -O3 are experimental, it can change from version to version, and if things are stable, these optimizations will go to -O2.

I'd treat using -O3 the equivalent of using ~arch ... Assumed unstable but likely will work.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6691
Location: &akkara

PostPosted: Thu Mar 16, 2017 8:10 am    Post subject: Reply with quote

I don't recommend -O3 globally.

It often causes immense code-bloat as it loop-unrolls anything it can, vectorizing anything it can -- while at the same time keeping a copy of the scalar code and dynamically picking which one to use each time thru because it generally can't prove that the vectors will be aligned properly or that the iteration count will be a even multiple of the vector length.

... and topping it off, usually the only loops that are hyper-optimized in this way, are initialization loops. Those tend to be the only ones simple enough for its heuristics to find something.

In the end, it often makes things slower because the reduced effectiveness of the cache swamps any performance benefit it might have otherwise achieved.

However, it can be an excellent flag to use on a per-file basis: after you've profiled the code, found the hot-spots; improved the algorithms as much as possible; reduced the data inter-dependencies as much as possible; peppered your argument lists with "restrict"s to indicate which pointers never alias with any other; attached __attributes(...) to indicate buffer alignments (and allocated the buffers for maximally friendly alignment); sprinkled whatever #pragmas further assists conveying your intentions ... after all that, -O3, used when compiling the files thusly blessed (and only those files), can be an invaluable asset to getting a nice performance boost.

Try it for yourself: Use 'objdump' to look at the '.o's after the compilation stage has finished, or, better yet, pass -S as part of CFLAGS and inspect the generated assembler and compare the -O -O2 and -O3 versions. (But don't expect emerge to complete successfully if you add it to make.conf :) )
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43178
Location: 56N 3W

PostPosted: Thu Mar 16, 2017 9:40 am    Post subject: Reply with quote

nelsonwcf,

-O3, if it works can produce slower code than -O2 or -Os.
The compiler makes the code bigger to eliminate instructions, particularly branch instructions that add nothing to solving the problem.
This bigger code no longer fits into the CPU cache which increases cache evictions, cache misses and fetches from much slower main memory.
As a result of this 'cache thrashing' execution slows down.

ARM CPUs are not noted for huge CPU caches, so a global -O3 is probably counter productive.
A few apps may benefit but the only way to find out is to compare -O2, -O3 and -Os.

Donald Knuth wrote:
Premature Optimization

_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
nelsonwcf
Tux's lil' helper
Tux's lil' helper


Joined: 31 Oct 2012
Posts: 112

PostPosted: Thu Mar 16, 2017 12:47 pm    Post subject: Reply with quote

Very nice answers, guys. Thank you very much. Will test changing the -O3 to -O2 in my ARMv7.

As an additional but related question, is it worth changing from gcc to icc in Gentoo (not for ARM, obviously)? Since the main benefit from Gentoo is to have the packages optimized to your system, I'm guessing that it would be possible to get an additional boost by using icc. Is my assumption correct?

Thank you again!
Back to top
View user's profile Send private message
Yamakuzure
Advocate
Advocate


Joined: 21 Jun 2006
Posts: 2273
Location: Bardowick, Germany

PostPosted: Thu Mar 16, 2017 3:34 pm    Post subject: Reply with quote

According to https://software.intel.com/en-us/forums/intel-c-compiler/topic/327585 the answer is no.

Generally speaking, you can get real speed gains using icc if, and only if, the source code is written in the right way.

However, here are some real numbers:
http://insights.dice.com/2013/11/04/speed-test-comparing-intel-c-gnu-c-and-llvm-clang-compilers/
_________________
Important German:
  1. "Aha" - German reaction to pretend that you are really interested while giving no f*ck.
  2. "Tja" - German reaction to the apocalypse, nuclear war, an alien invasion or no bread in the house.
Back to top
View user's profile Send private message
nelsonwcf
Tux's lil' helper
Tux's lil' helper


Joined: 31 Oct 2012
Posts: 112

PostPosted: Thu Mar 16, 2017 6:50 pm    Post subject: Reply with quote

Hi Yamakuzure,

I've saw these posts as well but they are old and consider only punctual applications. I'm looking for some insight on more current versions and using it as the general compiler in portage from Gentoo. Obviously, not all packages can be compiled with icc due to different "dialects" of C (this is especially try for the GLIBC and GCC).

If fact, if it was possible to set icc as the general compiler but force portage to use gcc on a package basis (or the other way around), that would be a great solution. However, I don't know if there is any simple way to do that in Gentoo, reason I'm looking for an updated Gentoo packages that are known to work/don't work with icc. Newer benchmarks are also useful, but I couldn't find any.

Thank you again.
Back to top
View user's profile Send private message
Drone4four
Apprentice
Apprentice


Joined: 09 May 2006
Posts: 247

PostPosted: Sun Mar 19, 2017 2:04 am    Post subject: Reply with quote

Now if only we add a compiler which used GPUs instead of CPUs, then we could put to good use the 3584 cuda cores potentially at our disposal. I wonder how long a GPU based compiler would build the linux kernel or the Gnome DE.

What an unrealistic fantasy! har har
_________________
My rig:
IBM Personal System/2 Model 30-286 - - Intel 80286 (16 bit) 10 Mhz - - 1MB DRAM - - Integrated VGA Display adapter
1.44MB capacity Floppy Disk - - PS/2 keyboard (no mouse)
Back to top
View user's profile Send private message
axl
Guru
Guru


Joined: 11 Oct 2002
Posts: 536
Location: Romania

PostPosted: Sun Mar 19, 2017 2:35 am    Post subject: Reply with quote

-O3 is not as experimental as it use to be back when gcc was age 2.

these stories, it's funny. it's like the chinese whispers game. in my country it's called the telephone without the wire game.

https://en.wikipedia.org/wiki/Chinese_whispers

some packages will not compile with -O3. right now, chromium is the only one i know.

and yes, binaries will be phater. bigger. and it's a really bad idea for an arm platform.

it only makes sense when you have a fast / or /usr storage. like an m2 ssd.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7127
Location: almost Mile High in the USA

PostPosted: Sun Mar 19, 2017 6:44 am    Post subject: Reply with quote

Well, if the compilation (or runtime speed) breaks with -O3, wouldn't that mean it's not quite prime time and thus "experimental"? Not only the cache size hit, -O3 may generate very slow code sequences for x86 too; no way to tell without trying (or knowing what your code is and what gcc does with the code).

Until the day gcc can automatically tell what optimizations are best during static code analysis and always generate the fastest/smallest/code compatible with anything, the optimizations in -O3 are just experimental - experiment with it, it could go one way or the other, and badly.

To be safe for most cases, simply use -O2 - where the gcc developers deem the optimizations tend to not cause worst case behavior. YMMV.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6691
Location: &akkara

PostPosted: Sun Mar 19, 2017 8:06 am    Post subject: Reply with quote

Drone4four wrote:
Now if only we add a compiler which used GPUs instead of CPUs, then we could put to good use the 3584 cuda cores potentially at our disposal. I wonder how long a GPU based compiler would build the linux kernel or the Gnome DE.

I don't think it will make much of a difference. Compilation isn't usually the bottleneck.

At work, I compile on a 16-core (32-threads) xeon-based server with more than enough RAM. And it just doesn't seem to make much of a difference for most packages, compared to a 4-core laptop (that has a strong fan blowing at it during the deed). Factor of 2 faster... maybe.

Some things are fast. Kernels take 30-35 seconds give or take. Emerging GCC itself runs in about 15-20 minutes, if I recall.

But for most packages, the vast majority of the time seems to be spent in autoconf and related tools. And those are woefully serial:
    Checking for fabs... ok
    Checking for fstat... ok
    Checking that fstat works... ok
    ...
It goes on and on and on, multiple hundreds of such questions, asked and answered at a rate of a handful per second. Every package asks nearly the same set of questions, and they all (hopefully!) receive the same set of answers. Then there's a blip of compilation and a moment later that's finished, and then libtool starts up doing its thing, serially, followed by emerge itself, serially installing what has been built. (This last one likely needs to be serial.)

I've even tried giving preposterous --jobs= numbers to emerge. Ran one with --jobs=300 or similar silliness not that long ago, trying to accelerate a emerge -e @world. It starts off good: doing ~30 or so in parallel. But it soon hits these long strings of serial dependencies and it's back to one at a time again, with an occasional break where it might find 2 or 3 to do at once.

I've often wondered whether there's some way of caching that. Not like compiler-cache, but a autoconf-cache, one that works across packages. It won't be easy: it needs to be smart enough to know to clear out and re-do the checks for things provided by the package that was just merged. But we'd need to somehow break the autoconf bottleneck before there's a serious reduction in the end-to-end merge time.

eccerr0r wrote:
Well, if the compilation (or runtime speed) breaks with -O3, wouldn't that mean it's not quite prime time and thus "experimental"? Not only the cache size hit, -O3 may generate very slow code sequences for x86 too; no way to tell without trying (or knowing what your code is and what gcc does with the code).

I don't think experimental is the right word. -O3 generally works and does what it says in the manual. It just happens that what it does isn't usually applicable or appropriate to do without thinking. It is an excellent flag to use within a package's makefile, where the developer has measured and peppered it at just the right places. It is a bad flag to use globally, because 90+% of the code out there is either run-once initialization, or debugging printfs, both of which benefit more from space optimization than from speed.

What you're asking for is for the compiler to somehow know what transformations to apply where. Would be nice if it could. Maybe it gets there someday. With automated profiling and similar tools it might be possible to come up with something. But even then, it'll be up to you to give it relevant test-cases, so that it profiles and optimizes the things that actually matter. And is coming up with relevant test cases significantly easier than picking flags according to your intuition and seeing how they do? Not an easy problem.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 5761

PostPosted: Sun Mar 19, 2017 6:24 pm    Post subject: Reply with quote

Autoconf is pretty awful. Is confcache still maintained nowadays? I don't see it in portage any more.

Mind you, Portage itself can be just as bad at times... that "resolving dependencies" spinner is often 50% of the time spent installing single packages for me.
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Sun Mar 19, 2017 6:35 pm    Post subject: Reply with quote

Did not some ebuilds remove any bad optimizations?

I use this for quite a while
Quote:
CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer"


In the old days it matters to myself playing with those flags.

Since ivybridge i7 + SSD + 16GB RAM + tmpfs for building my stuff it does not really matter anymore

-2 minutes for libreoffice build time is not that much worth tinkering around.

What matters tehse days: Smaller + regular full system backups
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Sun Mar 19, 2017 8:37 pm    Post subject: Reply with quote

march native pretty much eliminated the necessity for custom CFLAGS

used to be you had to look up the correct safe cflags for your processor, now the compiler does it for you. yay.

O3 makes for slower binaries, sometimes. I once had a broken system like that.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Installing Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum