Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] grep is too slow
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Thu Apr 10, 2008 4:23 am    Post subject: [SOLVED] grep is too slow Reply with quote

The command grep in my gentoo system is extremely slow.
Code:

grep -w 10.0.0.1 log > log1

This command just picks up the lines in log with ip address 10.0.0.1 and writes to other file log1. The log file size is about 69M.
Usually, this process takes only several seconds in Ubuntu or some other systems, however, in my gentoo system, it takes 5mins!!!.
I am pretty sure the hard disk DMA is open. Here is the output of hdparm -tT /dev/sda
Code:

/dev/sda:
 Timing cached reads:   2066 MB in  2.00 seconds = 1033.05 MB/sec
 Timing buffered disk reads:  174 MB in  3.01 seconds =  57.72 MB/sec

I pretty sure there is no any other process is eating CPU. And everything except that is normal.
Hardware: IBM T60. (I think that's quick enough to do that simple job.)

Is this my system problem or the grep in gentoo has something wrong?


Last edited by kidoln on Mon Apr 14, 2008 12:04 am; edited 3 times in total
Back to top
View user's profile Send private message
alex.blackbit
Advocate
Advocate


Joined: 26 Jul 2005
Posts: 2397

PostPosted: Thu Apr 10, 2008 5:37 am    Post subject: Reply with quote

on my aging firewall i tested what you want to do on my cron.log file, the hdd is not the fastest :lol:
Code:
wall ~ # hdparm -tT /dev/hda

/dev/hda:
 Timing cached reads:   276 MB in  2.01 seconds = 137.20 MB/sec
 Timing buffered disk reads:   38 MB in  3.15 seconds =  12.05 MB/sec
wall ~ #

Code:
mydate=`date`; grep -w 8995 /var/log/cron.log; echo $mydate; date

and it takes about 7 seconds when the stuff is not in the cache.
so you really seem to have some kind of problem, but i think it is very unlikely that it's grep's fault.
please describe your hardware setup.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Thu Apr 10, 2008 6:38 pm    Post subject: Reply with quote

What's meaning of hardware setup?
IBM T60:
CPU Core Duo 1.6G,
RAM 1G,
Integrated Video Card.
Harddisk 40G (I am not sure what's the brand)

I just install gentoo by following the install manual, no too much special configuration. WM is xfce with xrandr extend screen.

Even I didn't start X windows, the grep still takes so long time.
I did check the cpu work load by "top", there is no any process is eating CPU resource and the hard disk is also idle.

I will download grep 1.4 source code from the gnu website and take a look is it still slow.
My concern is if it is not grep problem, what's the other possibility to cause this wired problem.

Is there any USE configuration would impact the grep behaviour?
Back to top
View user's profile Send private message
alex.blackbit
Advocate
Advocate


Joined: 26 Jul 2005
Posts: 2397

PostPosted: Fri Apr 11, 2008 1:41 pm    Post subject: Reply with quote

kidoln wrote:
Is there any USE configuration would impact the grep behaviour?

i don't think so.
are there any other tools that work slower than normally?
Back to top
View user's profile Send private message
marduk
Retired Dev
Retired Dev


Joined: 20 Sep 2002
Posts: 74
Location: Raleigh, NC, USA

PostPosted: Fri Apr 11, 2008 2:29 pm    Post subject: Re: grep is too slow Reply with quote

kidoln wrote:
The command grep in my gentoo system is extremely slow.
Code:

grep -w 10.0.0.1 log > log1



If you are searching for IP addresses you should get faster results by using "grep -F" (or fgrep).
_________________
http://starship.python.net/crew/marduk/
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10342
Location: Somewhere over Atlanta, Georgia

PostPosted: Fri Apr 11, 2008 3:38 pm    Post subject: Reply with quote

It's probably not a problem with grep. Here's timing on a lightweight Dell desktop, although the log file is only about 19MiB:
Code:
vesta log # time grep '192\.133\.199\.40' messages >/dev/null

real    0m1.903s
user    0m1.896s
sys     0m0.008s
And here's how it is built on my system:
Code:
vesta log # emerge -1vp grep

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   ] sys-apps/grep-2.5.1a-r1  USE="nls pcre -static" 0 kB

Total: 1 package (1 reinstall), Size of downloads: 0 kB
Could you just time the copying of that file with "cp" as a performance test?

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Fri Apr 11, 2008 11:59 pm    Post subject: Reply with quote

I think there is some problems in my system.

The "hdparm -tT /dev/sda" output
Code:
/dev/sda:
 Timing cached reads:   2004 MB in  2.00 seconds = 1002.78 MB/sec
 Timing buffered disk reads:  100 MB in  3.05 seconds =  32.74 MB/sec

However, the "hdparm /dev/sda" report some errors.
Code:
/dev/sda:
 IO_support    =  0 (default)
16-bit)
 HDIO_GET_UNMASKINTR failed: Inappropriate ioctl for device
 HDIO_GET_DMA failed: Inappropriate ioctl for device
 HDIO_GET_KEEPSETTINGS failed: Inappropriate ioctl for device
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 4864/255/63, sectors = 78140160, start = 0

Does it mean the DMA hasn't been open? I tried to measure the performance by copying a file.
"time cp log tmp" the log is 69M. The report:
Code:
real   0m4.120s
user   0m0.006s
sys   0m0.368s

It seems a little bit slower. I really confused why the hdparm report a relative good speed, but the cp got only a fair result?

Even though, grep is no reason so low as that.
Code:
 time grep -w 10.0.0.1 log > tmp
real   4m29.960s
user   4m29.423s
sys   0m0.368s

the log file is 69M, and output file is 4.7M. Too slow....
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10342
Location: Somewhere over Atlanta, Georgia

PostPosted: Sat Apr 12, 2008 12:25 am    Post subject: Reply with quote

Those errors are normal when you use hdparm (designed for IDE drives) with a SATA drive. Check out sdparm. The file copy time doesn't seem too long, though. Also, please note that the "." character is a regular expression metacharacter and must be "escaped" if you want it to be matched literally. Try
Code:
grep '10\.0\.0\.1' log >tmp
Might just be that you're inadvertently matching a lot. :)

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
i92guboj
Bodhisattva
Bodhisattva


Joined: 30 Nov 2004
Posts: 10306
Location: Córdoba (Spain)

PostPosted: Sat Apr 12, 2008 2:14 am    Post subject: Reply with quote

I can't give exact results, since my logs never get that big (ever heard of logrotate? ;) ). I don't really know right now if that times are too big or not.

Is the log in the ubuntu machine that big? If not, then the tests are invalid.

kidoln wrote:
I think there is some problems in my system.

The "hdparm -tT /dev/sda" output


As someone said above, hdparm is for IDE drives. You shouldn't be using it on SATA ones.

Quote:
It seems a little bit slower. I really confused why the hdparm report a relative good speed, but the cp got only a fair result?


Of course. Hdparm doesn't take into account the filesystem. While cp copies actual files, that might be fragmented, and that ARE into a filesystem. In fact, hdparm -tT doesn't even need a filesystem. You can test block devices without having formated them.

Quote:

the log file is 69M, and output file is 4.7M. Too slow....


That's not what I call a small log file. Though it depends on what you are loging. The size of the results doesn't have anything to do with the timings. Most time is spent processing, I bet.

john_r_graham wrote:
Those errors are normal when you use hdparm (designed for IDE drives) with a SATA drive. Check out sdparm. The file copy time doesn't seem too long, though. Also, please note that the "." character is a regular expression metacharacter and must be "escaped" if you want it to be matched literally. Try
Code:
grep '10\.0\.0\.1' log >tmp
Might just be that you're inadvertently matching a lot. :)

- John


This is one of the main reasons for slowness: incorrect usage. It is not only that the output is bigger. It's that matching concrete strings is easier on your cpu. So, if you scape them as shown in that post above, then you will probably get better results and you will get them faster. I really don't have such a long log file to test, so I can't tell if that times are too big or not...
_________________
Gentoo Handbook | My website
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Sat Apr 12, 2008 4:39 am    Post subject: Reply with quote

john_r_graham wrote:
Those errors are normal when you use hdparm (designed for IDE drives) with a SATA drive. Check out sdparm. The file copy time doesn't seem too long, though. Also, please note that the "." character is a regular expression metacharacter and must be "escaped" if you want it to be matched literally. Try
Code:
grep '10\.0\.0\.1' log >tmp
Might just be that you're inadvertently matching a lot. :)

- John


The thing is I did try to process the same log file either in IBM T42 Ubuntu and Dell Desktop Ubuntu. Both of them can finish the job in several seconds. This is really too abnormal.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Sat Apr 12, 2008 5:05 am    Post subject: Reply with quote

I find the solution, but the problem is still there.
I use awk to implement the same function and process the same log file in same laptop.

It takes only 17s instead of 5mins! This is reasonable time.

See the output of processing same log file.
Code:
time awk '/10.0.0.1 /' log > tmp
real   0m17.806s
user   0m17.707s
sys   0m0.076s

I think it is very clear, the grep program does have problem at least in my system.

This is my packet information:
Code:
sudo emerge -plv grep

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   ] sys-apps/grep-2.5.1a-r1  USE="nls pcre -static" 0 kB

Total: 1 package (1 reinstall), Size of downloads: 0 kB
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10342
Location: Somewhere over Atlanta, Georgia

PostPosted: Sat Apr 12, 2008 11:56 am    Post subject: Reply with quote

Okay, two things.
  • The comments on regular expression metacharacters still apply to awk. The regular expression "10.0.0.1" matches "10000001", "10102031", "10B0A0B1", and many other examples that I think you don't want to match. It's also a performance issue. Betcha awk is faster if you escape the periods.
  • I think you've demonstrated that there is a problem with grep. You're running the exact same version as I am, by the way. Please post your "emerge --info".
I also have one other suggestion, which I'll defer until I've seen the "emerge --info" output.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Sat Apr 12, 2008 12:42 pm    Post subject: Reply with quote

Some possibilities that come to mind, but I don't know whether it applies to your situation:


Is there an environment variable set that causes grep to operate in line-buffered or unbuffered mode?

Are you are using a locale that perhaps isn't supported as efficiently as the others?

Did "grep" happen to be set to be an alias for something else?

(Edit: retracting the following after noticing you tested the same expression with awk, which means this can't be it:
Quote:
Do you have /var in a separate partition? Is it very fragmented? Portage's what's-installed database is kept in /var and consists of many small files that get updated frequently. I imagine a 65MB log file that is incrementally appended to would probably not be very happy in the mix.
)
Back to top
View user's profile Send private message
i92guboj
Bodhisattva
Bodhisattva


Joined: 30 Nov 2004
Posts: 10306
Location: Córdoba (Spain)

PostPosted: Sat Apr 12, 2008 1:39 pm    Post subject: Reply with quote

You can use strace to see in a generic fashion what grep is doing and what files are it trying to reach. Maybe there's some problem. Now it's clear that there's some problem.

It could be a filesystem issue as well. I don't think that the problem is in grep itself, since it's a very well tested and widely used tool.
_________________
Gentoo Handbook | My website
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Sat Apr 12, 2008 4:04 pm    Post subject: Reply with quote

First of all, I post a piece of the log file, you can see the file is pretty simple, just a lot of lines.

Code:
1207783936.943767 Inf: 10.0.0.1 2059 4 12 1207783936.927028
1207783936.944374 Inf: 10.0.0.3 1539 43 64 1207783936.905640
1207783936.947331 Inf: 10.0.0.1 2059 5 12 1207783936.927219
1207783936.948472 Inf: 10.0.0.3 1539 45 64 1207783936.905935
1207783936.951397 Inf: 10.0.0.1 2059 6 12 1207783936.927405
1207783937.36383 Inf: 10.0.0.6 124 11 14 1207783937.21950
1207783937.576009 Inf: 10.0.0.3 1540 16 65 1207783937.548766
....................................................


Quote:

I think you've demonstrated that there is a problem with grep. You're running the exact same version as I am, by the way. Please post your "emerge --info".


Code:
 sudo emerge --info
Portage 2.1.4.4 (default-linux/x86/2007.0/desktop, gcc-4.1.2, glibc-2.6.1-r0, 2.6.24-gentoo-r4 i686)
=================================================================
System uname: 2.6.24-gentoo-r4 i686 Genuine Intel(R) CPU T2400 @ 1.83GHz
Timestamp of tree: Wed, 02 Apr 2008 23:45:01 +0000
app-shells/bash:     3.2_p17-r1
dev-lang/python:     2.4.4-r9
dev-python/pycrypto: 2.0.1-r6
sys-apps/baselayout: 1.12.11.1
sys-apps/sandbox:    1.2.18.1-r2
sys-devel/autoconf:  2.13, 2.61-r1
sys-devel/automake:  1.7.9-r1, 1.9.6-r2, 1.10
sys-devel/binutils:  2.18-r1
sys-devel/gcc-config: 1.4.0-r4
sys-devel/libtool:   1.5.26
virtual/os-headers:  2.6.23-r3
ACCEPT_KEYWORDS="x86"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=prescott -O2 -pipe -msse3 -fomit-frame-pointer"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c /etc/udev/rules.d"
CXXFLAGS="-march=prescott -O2 -pipe -msse3 -fomit-frame-pointer"
DISTDIR="/usr/portage/distfiles"
FEATURES="distlocks metadata-transfer sandbox sfperms strict unmerge-orphans userfetch"
GENTOO_MIRRORS="http://gentoo.chem.wisc.edu/gentoo/ http://lug.mtu.edu/gentoo/ "
LANG="en_US.UTF-8"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local/layman/gentoo-china"
SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage"
USE="X acl acpi alsa arts bash-completion berkdb cairo cdr cli context cracklib crypt cscope cups dbus dri dvd dvdr dvdread eds emboss encode esd evo fam ffmpeg firefox fortran gdbm gif gpm graphics gstreamer gtk hal iconv isdnlog jpeg kerberos latex ldap mad madwifi midi mikmod mmx mp3 mpeg mudflap ncurses nls nptl nptlonly ogg opengl openmp oss pam pcre pdf perl plotutils png pppd python qt3support quicktime readline reflection science sdl session spell spl sse ssl startup-notification svg tcpd tetex tiff truetype unicode v4l v4l2 vim-syntax vorbis win32codecs x264 x86 xinerama xml xorg xrandr xscreensaver xv zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="keyboard mouse synaptics" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="i810"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

Quote:

Is there an environment variable set that causes grep to operate in line-buffered or unbuffered mode?


Code:
 env
LC_PAPER=en_US.UTF-8
MANPATH=/usr/local/share/man:/usr/share/man:/usr/share/binutils-data/i686-pc-linux-gnu/2.18/man:
/usr/share/gcc-data/i686-pc-linux-gnu/4.1.2/man:/usr/qt/3/doc/man
LC_ADDRESS=en_US.UTF-8
SSH_AGENT_PID=6246
LC_MONETARY=en_US.UTF-8
SHELL=/bin/bash
TERM=xterm
XDG_SESSION_COOKIE=
HUSHLOGIN=FALSE
WINDOWID=27263006
LC_NUMERIC=en_US.UTF-8
XIM_PROGRAM=fcitx
QTDIR=/usr/qt/3
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;
33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:su=37;41:sg=30;43:tw=30;42:
ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:
*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:
*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:
*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.mng=01;35:*.pcx=01;35:
*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:
*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:
*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.pdf=00;32:*.ps=00;32:*.txt=00;32:*.patch=00;32:
*.diff=00;32:*.log=00;32:*.tex=00;32:*.doc=00;32:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:
*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:
LC_TELEPHONE=en_US.UTF-8
GUILE_LOAD_PATH=/usr/share/guile/1.8
GDK_USE_XFT=1
SSH_AUTH_SOCK=/tmp/ssh-XCftSm6244/agent.6244
SESSION_MANAGER=local/apollo:/tmp/.ICE-unix/6254
PAGER=/usr/bin/less
CONFIG_PROTECT_MASK=/etc/udev/rules.d /etc/fonts/fonts.conf /etc/terminfo /etc/texmf/web2c /etc/revdep-rebuild
MAIL=/var/mail/nanli
PATH=.:/sbin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2:/usr/qt/3/bin
LC_MESSAGES=en_US.UTF-8
XIM=fcitx
LC_COLLATE=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
XMODIFIERS=@im=fcitx
EDITOR=vi
LANG=en_US.UTF-8
QMAKESPEC=linux-g++
LC_MEASUREMENT=en_US.UTF-8
HISTCONTROL=ignoreboth
SHLVL=4
PYTHONPATH=/usr/lib/portage/pym
LESS=-R -M --shift 5
CVS_RSH=ssh
GCC_SPECS=
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-1MQ24uXqBV,guid=7fb576563ec33d5c15e008f64800d46e
XDG_DATA_DIRS=/usr/local/share:/usr/share:/usr/share
LC_CTYPE=zh_CN.UTF-8
PKG_CONFIG_PATH=/usr/qt/3/lib/pkgconfig
LESSOPEN=|lesspipe.sh %s
INFOPATH=/usr/share/info:/usr/share/binutils-data/i686-pc-linux-gnu/2.18/info:/usr/share/gcc-data/i686-pc-linux-gnu/4.1.2/info
WINDOWPATH=7
DISPLAY=:0.0
OPENGL_PROFILE=xorg-x11
LC_TIME=en_US.UTF-8
G_BROKEN_FILENAMES=1
LC_NAME=en_US.UTF-8
COLORTERM=Terminal
_=/usr/bin/env


I don't think there is any env variable impact the buffer or nobuffer.

Quote:

Are you are using a locale that perhaps isn't supported as efficiently as the others?


Yes. See above.

Code:
LC_CTYPE=zh_CN.UTF-8


But I don't think this would impact the grep. And the current problem is 17s vs 5min, it is too different.

Quote:
Did "grep" happen to be set to be an alias for something else?

Yes
Code:
alias -p
alias grep='grep --colour=auto'
alias l='ls -CF'
alias la='ls -A'
alias ll='ls -lh'
alias ls='ls --color=auto'
alias up='cd ..'
alias up2='cd ../..'
alias up3='cd ../../..'
alias up4='cd ../../../..'

I did try
Code:
/bin/grep -w 10.0.0.1 log > tmp

Nothing different.
Quote:
It could be a filesystem issue as well. I don't think that the problem is in grep itself, since it's a very well tested and widely used tool.

I think that the awk has proved that the filesystem is no problem in this case. And I agree that the regular expression would impact the efficiency, but considering my log is so simple and time difference (grep and awk) is so large, I don't think regular expression is an issue in my situation.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Sat Apr 12, 2008 4:10 pm    Post subject: Reply with quote

I do find the problem. I grab the grep program from ubuntu7.10, and it also grep 2.5.1 same as gentoo.

I processed the same log file. See the output:

Code:
 time ./grep -w 10.0.0.1 log > tmp
real   0m13.526s
user   0m13.429s
sys   0m0.093s


I tried to emerge grep again. But nothing changed. So I guess the problem could be the compile environment causes the grep program low performance.

BTW: The grep from ubuntu is only 94k, and grep in gentoo is 178k.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10342
Location: Somewhere over Atlanta, Georgia

PostPosted: Sat Apr 12, 2008 4:23 pm    Post subject: Reply with quote

The one thing I noticed is that you've got the pcre USE flag set, which causes the grep to be built with "Perl compatible regular expression" support. This is a different regular expression engine than what grep uses normally and may have an effect on execution time. Try
Code:
USE="-pcre" emerge -1v grep
This will make a grep that is probably closer to what is in Ubuntu. That also explains the size difference.

You're still not properly escaping your metacharacters. Did you ever try using proper escaping to see what the performance difference is?

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
alex.blackbit
Advocate
Advocate


Joined: 26 Jul 2005
Posts: 2397

PostPosted: Sat Apr 12, 2008 4:27 pm    Post subject: Reply with quote

my grep executable has a size of 136964 bytes, which seems to be a much lower value than in your case.
on a ubuntu installation i have access to it is 96228 bytes, which seems to be the value that you got.
there must be a reason for that. at first i thought maybe you emerged with the static use flag, but you already posted that you didn't.
what size do you get when you compile grep by hand?
Back to top
View user's profile Send private message
alex.blackbit
Advocate
Advocate


Joined: 26 Jul 2005
Posts: 2397

PostPosted: Sat Apr 12, 2008 4:45 pm    Post subject: Reply with quote

i tried it by hand and got 190192 bytes before stripping, 72268 afterwards (--strip-unneeded).
that's smaller than i expected, because the version of portage is stripped too, and if configure is honest, then i compiled pcre and nls support in. strange.
by the way... the grep we are all using is most likely 2.5.1a, just like on ubuntu 7.10.
there is a masked versionn (2.5.3) in the portage tree. have you already tried that?
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Sat Apr 12, 2008 6:11 pm    Post subject: Reply with quote

I emerge the grep again with "USE="-pcre" emerge -1v grep"

The new grep is 71300 bytes. However, the speed is still so slow.

Quote:
You're still not properly escaping your metacharacters. Did you ever try using proper escaping to see what the performance difference is?


Thanks your suggestion, but I think this is another topic. The grep from Ubuntu does have a good performance. I just try to make the problem simple.
I guess the current problem is how to compile the grep to make it work as grep from ubuntu.

Quote:
hat size do you get when you compile grep by hand?


What means "by hand"? I will try to download the source code from gnu site and take a look what's different.

Quote:
by the way... the grep we are all using is most likely 2.5.1a, just like on ubuntu 7.10.


Yes. My grep version is also 2.5.1a.

Quote:
there is a masked versionn (2.5.3) in the portage tree. have you already tried that?


I will try this if no other choice. I still believe there is no reason the 2.5.1a doesn't work in my system.
Back to top
View user's profile Send private message
alex.blackbit
Advocate
Advocate


Joined: 26 Jul 2005
Posts: 2397

PostPosted: Sun Apr 13, 2008 3:19 pm    Post subject: Reply with quote

with "by hand" i just meant to do ./configure; make on the console and not let portage do it.
Back to top
View user's profile Send private message
neysx
Retired Dev
Retired Dev


Joined: 27 Jan 2003
Posts: 795

PostPosted: Sun Apr 13, 2008 4:13 pm    Post subject: Reply with quote

Use LC_ALL=C grep ...
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Mon Apr 14, 2008 12:04 am    Post subject: Reply with quote

Problem is solved. And got amazing result.

Thank this solution.
neysx wrote:
Use LC_ALL=C grep ...


And this solution is also helpful in ubuntu. See the output in ubuntu
Code:
time grep -w '10.0.0.1' corrected.log > tmp

real   0m9.561s
user   0m9.473s
sys   0m0.088s

time LC_ALL=C grep -w '10.0.0.1' corrected.log > tmp

real   0m0.537s
user   0m0.468s
sys   0m0.068s


See the result in gentoo
Code:
time grep -w '10.0.0.1' corrected.log > tmp

real   4m29.185s
user   4m28.836s
sys   0m0.206s

time LC_ALL=C grep -w '10.0.0.1' corrected.log > tmp

real   0m1.043s
user   0m0.962s
sys   0m0.081s


Would you briefly explain why this can improve the performance so much? Is this going to impact any other program? I tried to find the answer in google, but no useful information.
Back to top
View user's profile Send private message
colo
Apprentice
Apprentice


Joined: 21 Mar 2004
Posts: 160
Location: Austria

PostPosted: Mon Apr 14, 2008 7:44 am    Post subject: Reply with quote

Depending on your locale, a . (dot) in RegEx can mean an abundance of things, including loads of multibyte characters - all of which the parser has to branch for (depending on how the automaton working on your RegEx is built, this comes at a very high cost in terms of performance).
_________________
Free Software. Free Sociecty. Better Lives.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Mon Apr 14, 2008 2:27 pm    Post subject: Reply with quote

colo wrote:
Depending on your locale, a . (dot) in RegEx can mean an abundance of things, including loads of multibyte characters - all of which the parser has to branch for (depending on how the automaton working on your RegEx is built, this comes at a very high cost in terms of performance).


Is there any way to let grep always working for LC_ALL=C. I can not put this in an initial script which would impact my other program. Add a new alias is not working for grep in a bsh script.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum