Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] grep is too slow
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10387
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Apr 14, 2008 3:31 pm    Post subject: Reply with quote

Have you heard this old joke? A man goes to see his doctor, and says, "Doc, it hurts when I move my arm like this." The doctor replies, "Well, don't do that!" Which brings me to a question. Were you really trying to search for the following?
  • Literal string "10", followed by
  • Any (possibly multibyte) character, followed by
  • Literal string "0" followed by
  • Any (possibly multibyte) character, followed by
  • Literal string "0" followed by
  • Any (possibly multibyte) character, followed by
  • Literal string "1"
I didn't think you were. marduk didn't think you were. He suggested that you use "grep -F" or "fgrep". I suggested that you escape the metacharacters that you didn't really want to be metacharacters. Now, in light of all of this, I think your most recent question devolves into
Quote:
How to I Band-Aid my system so that I can continue to use this tool badly and not suffer the performance penalty?
My advice to you is, don't do that! :? It'll eventually bite you in an unexpected way. In fact, it just did.

Now, if a properly formed search pattern is shown to have low performance with grep or a literal search has low performance in fgrep, then I'm still very interested in helping you figure out why. :wink:

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Tue Apr 15, 2008 12:55 am    Post subject: Reply with quote

john_r_graham wrote:
Have you heard this old joke? A man goes to see his doctor, and says, "Doc, it hurts when I move my arm like this." The doctor replies, "Well, don't do that!" Which brings me to a question. Were you really trying to search for the following?
  • Literal string "10", followed by
  • Any (possibly multibyte) character, followed by
  • Literal string "0" followed by
  • Any (possibly multibyte) character, followed by
  • Literal string "0" followed by
  • Any (possibly multibyte) character, followed by
  • Literal string "1"
I didn't think you were. marduk didn't think you were. He suggested that you use "grep -F" or "fgrep". I suggested that you escape the metacharacters that you didn't really want to be metacharacters. Now, in light of all of this, I think your most recent question devolves into
Quote:
How to I Band-Aid my system so that I can continue to use this tool badly and not suffer the performance penalty?
My advice to you is, don't do that! :? It'll eventually bite you in an unexpected way. In fact, it just did.

Now, if a properly formed search pattern is shown to have low performance with grep or a literal search has low performance in fgrep, then I'm still very interested in helping you figure out why. :wink:

- John


I am so agree with you. But see the output:
Code:
time /bin/grep '10\.0\.0\.1 ' log > tmp

real   3m42.600s
user   3m42.381s
sys   0m0.108s



There is no big difference, 3min vs 5min. Apparently, there are still some problems in my the current grep. Consider the better performance of grep from ubuntu without LC_ALL=C.

Even I use the bad way to filter the string, by adding LC_ALL=C, grep spends only 1 second to finish the job.

The reason that I keep using "grep -w 10.0.0.1 log" is because I want to show the performance improvement for different solutions. In my real work, the script has changed to your suggestion. Thank you. See your suggestion with LC_ALL=C
Code:
time grep '10\.0\.0\.1 ' log > tmp

real   0m0.252s
user   0m0.188s
sys   0m0.064s
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Tue Apr 15, 2008 10:08 am    Post subject: Reply with quote

Quote:
Consider the better performance of grep from ubuntu without LC_ALL=C


Are all the locale settings identical in gentoo and ubuntu? If not that could explain the speed difference. If they are identical, it might point to a problem with how gentoo processes your locale.
Back to top
View user's profile Send private message
kidoln
n00b
n00b


Joined: 03 Apr 2008
Posts: 15

PostPosted: Tue Apr 15, 2008 7:03 pm    Post subject: Reply with quote

Akkara wrote:
Quote:
Consider the better performance of grep from ubuntu without LC_ALL=C


Are all the locale settings identical in gentoo and ubuntu? If not that could explain the speed difference. If they are identical, it might point to a problem with how gentoo processes your locale.


I believe that they are same.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1579
Location: KUUSANKOSKI, Finland

PostPosted: Thu May 15, 2008 1:36 pm    Post subject: Thanks Reply with quote

I have same problems too. And over remote connection (ssh) the results are faster.
Problem was that I use UTF and ISO locales in different cituations.
I would never have believed that locale could cause this slow grep processing.

Now instead of 2 minutes my log stat script runs trough all the processes in 15 seconds. :)

Thanks again!
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
colo
Apprentice
Apprentice


Joined: 21 Mar 2004
Posts: 160
Location: Austria

PostPosted: Fri Aug 29, 2008 7:51 pm    Post subject: Reply with quote

I've been hit by this once again today, and the thing that REALLY startles me is that my Ubuntu 8.04 machine does not suffer from the speed decrease when using a multibyte locale (en_US.utf8). My gentoo version of `grep` does, no matter if version 2.5.1 or 2.5.3, or if compiled with PCRE support or not...

A quick survey on IRC suggested the same for other fellow Gentoo users. Anyone more clever than I here who can explain that to me?
_________________
Free Software. Free Sociecty. Better Lives.
Back to top
View user's profile Send private message
colo
Apprentice
Apprentice


Joined: 21 Mar 2004
Posts: 160
Location: Austria

PostPosted: Sat Aug 30, 2008 7:52 am    Post subject: Reply with quote

By the way, if I advise grep to interpret my regex (just a string literal in my test, actually) to interpret it as a PCRE (using libpcre for matching in turn, I guess) with -P, my locale does not have this abhorrent impact on performance. Is there some defect in glibc's regex(3) functions that I'm not aware of?
_________________
Free Software. Free Sociecty. Better Lives.
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Mon Sep 28, 2009 5:51 am    Post subject: Why is egrep so much slower than sed? [Solved] Reply with quote

I have a ~11MB text file, regular ascii 7-bit. I want to extract all lines that begin with "have" or "want".

Using egrep is really slow:
Code:
$ time egrep '^have|^want' file.txt >/dev/null
real   1m5.023s
user   1m4.886s
sys   0m0.093s

$ time egrep '^(have|want)' file.txt >/dev/null
real   1m5.469s
user   1m5.346s
sys   0m0.070s


Using sed is fast:
Code:
$ time sed -n -e '/^have/p' -e '/^want/p' file.txt >/dev/null
real   0m0.264s
user   0m0.263s
sys   0m0.000s

Why is there such a large disparity?


Last edited by Akkara on Mon Sep 28, 2009 11:40 am; edited 1 time in total
Back to top
View user's profile Send private message
truc
Advocate
Advocate


Joined: 25 Jul 2005
Posts: 3199

PostPosted: Mon Sep 28, 2009 8:21 am    Post subject: Reply with quote

I don't know if that's really the reason, since I can't test on such a big file, but you're using extended regular expressions with egrep (grep -E) and BASIC regular expressions with sed. (and you're not even using the same regexp in both cases)

Could you actually time the following and report back:
Code:
sed -nr '/^have|^want/p' big_file


PS:
Quote:
In addition, two variant programs egrep and fgrep are available. egrep is the same as grep -E. fgrep is the same as grep -F. Direct invocation as either egrep or fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified.
;)
_________________
The End of the Internet!
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Mon Sep 28, 2009 9:46 am    Post subject: Reply with quote

I don't have the original file (it was debugging output that's changing with every build).

So I re-ran all of the benchmarks on the current version of the file, which is now 22.5 MB. Of 1050374 lines in that file, 262144 match. I had also piped them through 'md5sum' to make sure all tests produce the same output. They do.
Code:
$ time egrep '^have|^want' file.txt >/dev/null
real   2m10.062s
user   2m9.678s
sys   0m0.160s

$ time sed -n -e '/^have/p' -e '/^want/p' file.txt >/dev/null
real   0m0.479s
user   0m0.460s
sys   0m0.013s

$ time sed -nr '/^have|^want/p' file.txt >/dev/null
real   0m0.384s
user   0m0.363s
sys   0m0.010s

$ time grep '^[hw]a[vn][et]' file.txt >/dev/null
real   2m9.849s
user   2m9.575s
sys   0m0.163s

Your suggestion is even faster than my sed equivalent.

I also tried regular grep with a modified expression in case it was the extended expressions that are causing problems. This isn't equivalent to the others, although in this file the same lines were matched. It was just as slow as the other greps.

I'm starting to wonder whether by grep is broken. Isn't there some kind of regular-expressions library that these sorts of apps all use? Going to try to re-emerge it and see what happens.

Edit: re-emerge of sys-apps/grep-2.5.4-r1 complete. I'm still getting similar slow times.
Back to top
View user's profile Send private message
Bill Cosby
Guru
Guru


Joined: 22 Jan 2005
Posts: 430
Location: Aachen, Germany

PostPosted: Mon Sep 28, 2009 10:15 am    Post subject: Reply with quote

Hm, how long does this script take to execute for you:

Code:
#!/usr/bin/perl
$op=shift;
$file=shift;

open(FILE, "<$file");
while(<FILE>) {
    chomp;
    print "$_\n" if (eval $op);
}


Start it like
Code:
scriptname 'regex' file

_________________
The Creature from Jekyll Island.
Back to top
View user's profile Send private message
Genone
Retired Dev
Retired Dev


Joined: 14 Mar 2003
Posts: 9236
Location: beyond the rim

PostPosted: Mon Sep 28, 2009 11:01 am    Post subject: Reply with quote

grep is apparently heavily affected by locale settings (e.g. bug 283149), so try running it with LC_ALL=C to see if that changes anything.
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Mon Sep 28, 2009 11:26 am    Post subject: Reply with quote

Genone wrote:
try running it with LC_ALL=C to see if that changes anything.

That's the problem!
Code:
$ time LC_ALL=C egrep '^have|^want' file.txt >/dev/null
real   0m0.143s
user   0m0.143s
sys   0m0.000s

Chalk up another one to locale silliness.

Any idea what a good interim solution is?

This particular use is pure ascii so setting LC_ALL works. What's the recommended way of doing this in a makefile?

But at the same time, I like 8-bit stuff parsed as utf8, since I'm tired of things like song name tags getting munged if I'm not super-careful if I happen to start a music app while in my coding environment, and then go edit a tag.

But I don't want the other silliness that comes with locales. In fact, I use LC_COLLATE=C globally. I hate-hate-hate the language-specific sortings, especially the fact that it seems to ignore space and punctuation and puts things like 'ab.x' and 'ab x' ahead of 'abc' - what kind of twisted thinking was going on to propose that, and how did it ever get through committee?

The ideal for me would be some kind of LC setting that effectively says, "parse the raw bytes as utf8 into their entity-integers, then sort/match/etc. against those integers in regular integer order, and display the results (converted back to utf8). Is there such a thing?
Back to top
View user's profile Send private message
Mike Hunt
Watchman
Watchman


Joined: 19 Jul 2009
Posts: 5287

PostPosted: Mon Sep 28, 2009 11:35 am    Post subject: Reply with quote

you could alias egrep and temporarily disable it when you need to
Code:
alias egrep='LC_ALL=C egrep'

and when needed run egrep unaliased:
Code:
\egrep something
Back to top
View user's profile Send private message
desultory
Administrator
Administrator


Joined: 04 Nov 2005
Posts: 9290

PostPosted: Tue Oct 20, 2009 10:48 am    Post subject: Reply with quote

Mike Hunt wrote:
you could alias egrep and temporarily disable it when you need to
Code:
alias egrep='LC_ALL=C egrep'

and when needed run egrep unaliased:
Code:
\egrep something
Another approach would be create a more permanent alias, perhaps c_egrep and use egrep normally. Though the underlying problem still remains.

Merged the preceding seven posts.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum