Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Making full use of cpu registers in CFLAGS
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4, 5, 6, 7  
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
Gnufsh
Guru
Guru


Joined: 28 Dec 2002
Posts: 400
Location: Portland, OR

PostPosted: Mon Sep 27, 2004 1:22 pm    Post subject: Reply with quote

I'm using simply:
CFLAGS="-march=athlon-xp -O2 -fomit-frame-pointer -pipe"
right now on my athlon-xp with gcc-3.4.2-r2.
Back to top
View user's profile Send private message
Nate_S
Guru
Guru


Joined: 18 Mar 2004
Posts: 414

PostPosted: Mon Sep 27, 2004 8:43 pm    Post subject: Reply with quote

For those of you with athlon-xp's: I've heard that -mfpmath=sse can actually slow things down. The reason for this is because AMD's sse implementation is kind of weak compared to intel's (it's there so that it can run precompiled sse code,) however, it has one monster of a 387 coprocesser. (I have not tried both of them at once)

-Nate
Back to top
View user's profile Send private message
ktm
Tux's lil' helper
Tux's lil' helper


Joined: 13 Aug 2004
Posts: 144
Location: Denmark

PostPosted: Mon Sep 27, 2004 10:36 pm    Post subject: Reply with quote

The only way to find out what cflags is the best for your cpu, is to do some kind of benchmarking. In the big cfalg threat, CFlags central https://forums.gentoo.org/viewtopic.php?t=5717&start=775&postdays=0&postorder=asc, someone called blackcat4 made a script to benchmark different cflag http://blackcat.ca/dist/bench_gcc

I used the script to benchmark a Pentium II 366Mhz (old IBM Thinkpad 570) with lame (the mp3 encoder). The script compiles lame using one kind of cflags, then run lame three times converting a wav-file to mp3. After this, it does the same again, just using different cflags.

I tried about 15 different cflags, and to my surprise almost all of them just made my run time slower, including -pipe and -funroll-loops. I found that the fastest cflags for my old Pentium 2 is:

Code:
 -O3 -march=pentium2 -fomit-frame-pointer -ffast-math


Some says that the -ffast-math is a risky option, but so far everything worked just fine.

I also tested with the -mmmx that TheCoop claims to be faster, but on my system it got lame running about 3% slower.

I'm sure most of the cflags people suggest is faster, but only on newer cpu's. I'm soon gonna benchmark my p4 system, to find out.
Back to top
View user's profile Send private message
chickaroo
Tux's lil' helper
Tux's lil' helper


Joined: 21 Sep 2004
Posts: 102
Location: #!/usr/bin/girl

PostPosted: Sat Oct 02, 2004 5:23 am    Post subject: Reply with quote

wow i've read every post here and now i'm even more confused as to what i should use lol. i've read the gcc manual about 5 times... and also other sources. my latest CFLAGS have been:

Code:
CFLAGS="-O3 -march=athlon-xp -m3dnow -msse -mmmx -mfpmath=sse,387 -funroll-loops -fforce-addr -ffast-math -fprefetch-loop-arrays -pipe -ftracer -fomit-frame-pointer -finline-limit=800"


and

Code:
CFLAGS="-O3 -march=athlon-xp -m3dnow -msse -mmmx -mfpmath=sse -funroll-loops -fpeel-loops -funit-at-a-time -fforce-addr -ffast-math -fprefetch-loop-arrays -pipe -ftracer -fomit-frame-pointer -finline-limit=1200"


i've been curious about the -mfpmath. that's actually the keywords that brought me here. I was wondering if the Athlon XP has seperate execution units for sse and 387

it says that -mfpmath=sse should produce considerably faster code, and sse,387 uses BOTH. however, does that mean that some of the code will not be optimized for sse and some only for 387? i have no idea.

i haven't done much benchmarking yet, but i tend to agree with whoever said that the athlon xp has a hell of a 387 fpu., but i don't see how sse could slow stuff down (maybe the athlon xp has a poor sse?). i might try -mfpmath=387 (or is that default? shouldn't hurt even if it is) or ,maybe it would be better if it uses both...

i'm also thinking of dropping the -funroll-loops (and maybe some others?) because i do think that the major bottleneck in systems is the I/O so actually larger code may be slower. my monster of a cpu (Athlon XP 2600+ mobile barton overclocked to 2.6GHz) should be able to handle slightly less optimized code.

i'm just all confused now, especially with -falign-functions falign-jumps and falign-loops.

not much info in man gcc about those. maybe someone can clear some things up for me? or give me some opinions? i think i'm gonna wait before changing cflags and doing emerge -e world. that takes about 40 hours.
_________________
Registered Linux user #364515 (Jun, 2004)
Back to top
View user's profile Send private message
Warped_Dragon
Tux's lil' helper
Tux's lil' helper


Joined: 16 Sep 2004
Posts: 149
Location: Canada Eh?

PostPosted: Sun Oct 03, 2004 4:56 am    Post subject: Reply with quote

I clicked this thread looking for some ways to further optimize my CFLAGS and/or LDFLAGS. Now I'm just confused :/ Is there a list anywhere, listing CPU types and suggested flags to be used with them?

For instance, mine are:

Code:

CFLAGS="-march=i686 -pipe -O3 -fomit-frame-pointer"
LDFLAGS="-Wl,-z,now -Wl,-O1 -Wl,--relax -Wl,--enable-new-dtags -Wl,--sort-common -s"


My chip is an amd Duron. Is i686 what I should be using for march? I've seen athlon used, and athlon-xp, but can't find out what a duron should be. Oh, and to get back to the 'original' topic, my cat /proc/cpuinfo output:

Code:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 7
model name      : AMD Duron(tm) processor
stepping        : 0
cpu MHz         : 1002.275
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 1998.84


So, assuming I am supposed to be using i686, I should add "-mmx -sse -mmxext -3dnowext -3dnow" to my CFLAGS?
Back to top
View user's profile Send private message
chickaroo
Tux's lil' helper
Tux's lil' helper


Joined: 21 Sep 2004
Posts: 102
Location: #!/usr/bin/girl

PostPosted: Sun Oct 03, 2004 5:49 am    Post subject: Reply with quote

well, there are different types of durons... some based on the athlon and some based on the athlon-xp. that's a 1GHz so i would be guessing it's an athlon. so i'd suggest -march=athlon
_________________
Registered Linux user #364515 (Jun, 2004)
Back to top
View user's profile Send private message
Warped_Dragon
Tux's lil' helper
Tux's lil' helper


Joined: 16 Sep 2004
Posts: 149
Location: Canada Eh?

PostPosted: Sun Oct 03, 2004 5:50 am    Post subject: Reply with quote

Ah. I didn't even know what it was based on :/ Thanks :)
Back to top
View user's profile Send private message
Nate_S
Guru
Guru


Joined: 18 Mar 2004
Posts: 414

PostPosted: Sun Oct 03, 2004 10:24 am    Post subject: Reply with quote

chickaroo:

AFAIK there is seperate execution units for 387 code and sse code. Also AFAIK, the athlon does have poor sse, compared to a pentium of the same class. (and it's 387 is better than the same pentium's) However, which one will actually be faster really, really depends on the code in question. For a P4, it's safe to always set it to sse, where as for an athlon, some code may do better with it set to 387 or vice versa.

My understanding (and I could be way off) is that using both 387 and sse optimizes the code for both, such that it can execute them both in parallel, effectively doubleing your floating point processing power. however, this often causes very wierd results, and some code will crash and burn with this.

I'd advise against dropping -funroll-loops, as your startup times may get faster, as an app is loaded into memory, but running the app will most likely get slower, as once in memory you want free registers and quick execution.

Man I wish I could find the thread I was given this advice in, as the guy there states it far more elequently than me, but search hates me. :(

-Nate
Back to top
View user's profile Send private message
LockeAverame
Tux's lil' helper
Tux's lil' helper


Joined: 14 Jul 2003
Posts: 108

PostPosted: Sun Oct 03, 2004 11:29 pm    Post subject: Reply with quote

some people never learn:
first: there are no cflags which are perfect for every program because the compiler mostly uses heuristics for optimization decisions and gcc has not a very good register allocator.

second: your duron is based on the morgan core (you can see it on the cpu frequency and the sse support), so use -march=athlon-xp.

third: mostly the binaries get about a maximum of 5-10% bigger because of optimizations from -O2 to -O3, these means for most binaries an increasement of 100Kb. believe me, it takes longer for the hdd to search the chunks on the drive than to load it actually (hdd's today read about 30-40mb per second but seeking and sequential read/write is quite slow).
the problem with bigger binaries mostly lies in bigger cache usage which is quite limited (even though 256kb to 512kb is quite much).

fourth: march activates mmx sse and 3dnow if appropriate, so don't care about these flags except you don't use march.

fifth: in nbench and freebench i get a performance increasement of 3-4% with -mfpmath=sse on an athlon-xp in comparison of i387 (which is the default). sse has only a precision of 64bit so it can only be used for floats not for double (which is mostly used), so you mostly don't see a big improvement, even though gcc-3.3 and 3.4 don't use sse in vectorize mode (gcc-4.0 will use it). to use i387 and sse in parallel doesn't gain much (nearly nothing) and is very risky, i wouldn't use it. sse2 with its 128bit precision is more precise than i387 with a maximum of 80bit and has fewer problems with exceptions, gcc devels mostly prefer to use sse2 as i remember but some sse2 instructions are buggy, so hope for a fix.
Back to top
View user's profile Send private message
Warped_Dragon
Tux's lil' helper
Tux's lil' helper


Joined: 16 Sep 2004
Posts: 149
Location: Canada Eh?

PostPosted: Mon Oct 04, 2004 5:02 am    Post subject: Reply with quote

1) I'm well aware of that. I wasn't asking for a 'perfect' set of flags that would give every program I run a 500% speedboost. Or even a 10% speedboost. Or 0.2%. *reads over his first post* Wait a tic, I wasn't asking for a 'perfect' set of CFLAGS at all, just if my current ones were taking full advantage of the different optimization options supported by my CPU. I'll assume your first comment wasn't directed at me. Many apologies :)

2) Thank you. This is the only AMD system I've ever used, all my experiance is with Intel. Now I've got one suggestion for -march=athlon and one for -march=athlon-xp. Can you point me to some documentation so I can confirm/deny this myself? In fact, I'll start at the AMD site.

EDIT - Found something. Quote is from http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_1260_1202%5E1018,00.html#9947

Quote:

Q: How is the AMD Duron processor different from the AMD Athlon™ processor? Is the AMD Duron processor merely a stripped down version of the AMD Athlon processor?

A: The AMD Duron processor is a derivative of the state-of-the-art AMD Athlon processor. Although the two processors are related, there are key differences in the CPUs and the platforms designed to support them, reflecting the requirements of their target markets. Specifically, the AMD Athlon processor is available for users who demand the highest level of application performance and features more full-speed, on-chip cache memory. The AMD Duron processor was designed to consume less power than the AMD Athlon processor, thereby enabling lower cost systems. Additionally, AMD Duron processor-based PCs are likely to employ lower cost memory and graphics solutions, including low cost DDR memory and Unified Memory Architecture (UMA) graphics.


I'm still looking for the actual tech specs to back up their claim. This system is about three years old (jan 2001). Were athlon-xp's being sold then?

EDIT #2: Found specs. No mention of a Duron being based on the athlon-xp though :/ Athlon is mentioned a few times, but it doesn't say that it's based on an athlon either. Link is http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24310.pdf if anyone cares. I'm inclined to say screw it and just use -march=athlon

3) So the increase in binary size more then offsets the speed increase in most cases?

4) Ok, so I can drop those flags and still have them activated by march. That's good.

5) 3-4% is better then nothing at all. I'll add that flag if I confirm that my duron should use -march=athlon-xp. Would it result in a similer increase on a reguler athlon?
Back to top
View user's profile Send private message
Nate_S
Guru
Guru


Joined: 18 Mar 2004
Posts: 414

PostPosted: Mon Oct 04, 2004 6:38 am    Post subject: Reply with quote

While doing your reasearch you'll want to look at what core your proc is based on. I believe Palimino (sp?) is the first athlon xp core, that's what's in my athlon-xp 1800, if it's not that it's prolly an athlon.

Yes, generally -O3 is considered by most to be faster than -Os or -O2. also, I think -funroll-loops (though not -funroll-all-loops,) is also worth it.

I'd like to point out that redundant flags really don't hurt anything.

As I mentioned, wheather 387 or sse was better on the athlon would depend heavily on the type of app used, I don't know that a benchmark would give a good overall estimate.

Also, sse2 is only on P4s and 64 bit AMD chips, right? Or did some of the later athlon-xps have it?

-Nate
Back to top
View user's profile Send private message
chickaroo
Tux's lil' helper
Tux's lil' helper


Joined: 21 Sep 2004
Posts: 102
Location: #!/usr/bin/girl

PostPosted: Mon Oct 04, 2004 10:57 am    Post subject: Reply with quote

Nate_S wrote:

Also, sse2 is only on P4s and 64 bit AMD chips, right? Or did some of the later athlon-xps have it?


the K8 (Athlon 64) chips were the first AMD chips to include SSE2 support.
_________________
Registered Linux user #364515 (Jun, 2004)
Back to top
View user's profile Send private message
LockeAverame
Tux's lil' helper
Tux's lil' helper


Joined: 14 Jul 2003
Posts: 108

PostPosted: Mon Oct 04, 2004 5:16 pm    Post subject: Reply with quote

your duron has sse support, so it's a morgan core which is derived from a palomino core but with fewer cache (64kb instead of 256kb L2-cache), that's the only difference ( the older athlons didn't have sse like the durons in those days ^^). -march=athlonx-p works perfectly (i had this duron myself, so believe it). if you don't want to believe it than read the white papers for your stepping of the cpu (cat /proc/cpuinfo will tell you the stepping).
sse2 was never available for athlon-xp and beneath, it was first integrated with the k8 (opteron/athlon64), if people think this is wrong than they say bullshit.
well, no benchmark can test every possible codesnippet available. i used nbench and freebench and both give a 3-4% increasement with -mfpmath=sse and nothing higher with -mfpmath=sse,387.
only apps which use float (rarely used in standard software, more often used in games and codecs).
redundant flags don't hurt but don't do anything sensefull either, only polluting your bashoutput in compilations ^^.
Back to top
View user's profile Send private message
headgap
n00b
n00b


Joined: 19 Nov 2004
Posts: 39
Location: between the ears

PostPosted: Sat Dec 18, 2004 12:42 am    Post subject: Reply with quote

ok, my two cents. from a developer. yes, i know it says n00b, but trust me,
i'm a developer :)

the nice thing about being a 'retired' developer, you get to find bugs, and
not be obliged to fix them...

here's the deal with speed optimizations:

there's an old saw in the industry: programs spend 90% of their time in 10%
of the code. there's no point trying to optimize the other 90% of the code, it
won't get you any advantage. developers typically take the stand-point:
simpler is better, and don't do any optimization until release-time.

when it comes to optimizing code, you can: a) use a better (faster) algorithm,
b) code bits in assembler, c) do both. i once wrote a message passing queue
based system entirely in pentium assembler (about 20k of asm, vs 120k of C).
it ran 10x faster than the portable C code did, and tripled the development
time.

if you take a really close look at the compile process for the sources, you'll
notice individual packages very selectively using additional compile flags,
such as glibc: -freorder-blocks. this is the developer's way of descending in
a portable manner partially to the assembler level. gaming code and real-time
instrumentation contain way more assembler, and device drivers are almost
pure assembler: because that's where speed counts.

any program requiring feedback from the user, or that needs to hit slow
storage (disk systems, networks) won't benefit significantly from speed optimization (meaning, more than 10-15% increase).

*unless*

it's development code, the bugs are still being worked out, the code base
hasn't been finalized yet, so it doesn't make sense to optimize, since it might
get thrown away eventually.

since gentoo is providing bleeding edge packages, using higher levels of
optimization will tend to a) give that 10% increase, and b) break things.

since you're not the developer of the package, and have no idea what sort of code is being used, you can't pick the one or two special flags that would
make that code really fly. the best you can hope for is apply 'everything', and cross your fingers. optimizations that work for some code, won't do a thing for the next package. but in general, you'll see some sort of
improvement, on average, since the underlying code isn't already optimized.

so,

-O2 for CPUs with large L1 cache, -Os for those earlier ones (Coppermine and lower) with less L1.

-march if you don't need portable code will select hardware-only features appropriate to your cpu

-ffast-math at your own peril, if you do anything with double-precision
floating point (financials, spreadsheets, DV, imaging, mJpegTools, et al)

and, selectively, very specific individual flags based on the nature of the source, on a per-file basis, if you know what you're trying to achieve, and
need your optimization to be as platform transparent as possible.
_________________
If at any time you find yourself on the side of the majority, it is time to reconsider your position - Mark Twain
Back to top
View user's profile Send private message
evilshenaniganz
Tux's lil' helper
Tux's lil' helper


Joined: 18 Dec 2003
Posts: 107
Location: /dev/random

PostPosted: Mon Jan 24, 2005 11:56 pm    Post subject: Reply with quote

I've been having a helluva time with a K6-2 of mine. I would really like to get as much optimization out of it as I can. I notice I often have builds fail when I use
Code:
-march=k6-2 -O3 -pipe
The error message is saying it's a hardware problem, OS-related... Segmentation faults are common as well. When I back it off to
Code:
-march=k6-2 -O2 -pipe
these problems go away. Here's the skinny on my CPU. (For readability I decided to just link to a page instead of messing up the tabs posting it here in the forum.)

As you can see from the link, the chip is a k6-2 "chomper". I googled around and didn't find much of anything. Does anybody have any suggestions as to how I can optimize the hell out of my CFLAGs for this chip? Here are the CFLAGs I've been using that are giving me errors, and like I said, backing it off to -O2 takes care of them:
Code:

CHOST="i586-pc-linux-gnu"
CFLAGS="-march=k6-2 -O3 -pipe -fomit-frame-pointer -mfpmath=387 -mmmx -m3dnow -m128bit-long-double"
CXXFLAGS="${CFLAGS}"


Any suggestions, feedback, or other places to look would be greatly appreciated! Thanks! :)
Back to top
View user's profile Send private message
evilshenaniganz
Tux's lil' helper
Tux's lil' helper


Joined: 18 Dec 2003
Posts: 107
Location: /dev/random

PostPosted: Tue Jan 25, 2005 12:04 am    Post subject: Reply with quote

Hey, it's me again...

I figure if I'm coming with a problem, I better bring something else to the table. I noticed a lot of people are asking about resources for finding out more about CFLAGs. Well, apart from Ye Olde CFLAGS Central, I have also found a few pages that might help someone out. Here they are:

http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html
http://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

HTH :D
Back to top
View user's profile Send private message
schiznik
n00b
n00b


Joined: 01 Apr 2005
Posts: 66
Location: Gloucester, UK

PostPosted: Tue Apr 19, 2005 1:50 pm    Post subject: Reply with quote

Wow, I'm surpised no-one has mentioned the gentoo-wiki page http://gentoo-wiki.com/Safe_Cflags yet.

I use
Code:
CFLAGS="-march=athlon-xp -mtune=athlon-xp -O3 -pipe -fomit-frame-pointer -msse -mmmx -m3dnow"

on my athlon 2000+ XP.

I'm in the process of installing a copy on my main gentoo install onto a friend's pc (Duron 800 iirc - early duron, NOT a morgan core'd Duron). His CFLAGS will probably be:
Code:
CFLAGS="-march=athlon-tbird -mtune=athlon-tbird -O3 -pipe -fomit-frame-pointer -mmmx -m3dnow"

I might change -O3 to -Os to save a little space on his relatively small HD.

Note: I use -mtune= instead of -mpcu= as I use an ~x86 toolchain on an x86 install (ie these machines are using gcc 3.4.x instead of 3.3.x) - adjust back to -mcpu= if you use an x86 toolchain.

Edit: Old Durons dont support sse, removed -sse (and am starting to recompile.. again...)
Back to top
View user's profile Send private message
idahoduk
n00b
n00b


Joined: 08 Nov 2006
Posts: 1

PostPosted: Wed Nov 08, 2006 4:51 am    Post subject: Reply with quote

Gnufsh wrote:
I'm using simply:
CFLAGS="-march=athlon-xp -O2 -fomit-frame-pointer -pipe"
right now on my athlon-xp with gcc-3.4.2-r2.


I've been trying to figure out what to set my flag to for my CPU, I have an AMD X2 3800. I'm a little confused after reading the last seven pages. Will this be a good option for enabling the SRS, 3DNOW etc...

CHOST="i686-pc-linux-gnu"
CFLAGS="-march=athlon64 -O3 -pipe -fomit-frame-pointer"
CXXFLAGS="${CFLAGS}"

This is from the safe SFLAGS section on the Gentoo site, there seems to be some debate as to what this enables on the CPU. I did change the -02 to -03 since it's a dual core.

Thanks for the help guys, I came over to Gentoo from slack and Suse, I've been more then impressed with the documentation and the community you guys have here. I'm looking forward to learning more and getting to know everyone. THANKS!!!
_________________
Sager 17inch - AMD X2 - Nvidia 7800GTX - DDR400 - RAID (disabled) - And it's ultra portable!!!
I'm just waiting to completely delete XP - I've been windows free since TBD.
Back to top
View user's profile Send private message
ebfe
n00b
n00b


Joined: 29 Jan 2006
Posts: 20

PostPosted: Wed Nov 08, 2006 11:58 am    Post subject: Reply with quote

moderators should delete or at least close such threads as ALL of what has been posted here about cflags is bogus information and give no speed increase at all. Read this last sentence again.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7
Page 7 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum