LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices



Reply
 
Search this Thread
Old 10-02-2007, 11:57 AM   #1
zer0x333
Member
 
Registered: Oct 2007
Posts: 31

Rep: Reputation: 16
Question Unexpected output from grep


Hi all,

I have a bit of a problem with grep, and was hoping someone might be able to point me in the right direction!

example cmd line:
grep -oH --file=needles.txt *.txt

example needles.txt:
\( AB1234\|^AB1234\)
\( AB1235\|^AB1235\)
\( AB1236\|^AB1236\)

example .txt file (1 of many in current dir, for example haystack.txt):
dsfkljsdf AB1234 sdflkds
gklsd AB1234 AB1235 sdfkls
AB1236 dgf

I would expect grep's output in this example to be:
haystack.txt: AB1234
haystack.txt: AB1234
haystack.txt: AB1235
haystack.txt:AB1236

However, having both AB1234 and AB1235 on one line in haystack.txt seems to cause this:
haystack.txt: AB1234
haystack.txt: AB1234
AB1235
haystack.txt:AB1236

I don't understand why grep is not prefixing 'haystack.txt:' on 'AB1235'? I would be most grateful if someone could enlighten me
 
Old 10-02-2007, 12:08 PM   #2
kmellzey
LQ Newbie
 
Registered: Jan 2007
Distribution: Fedora5 & 6, Red Hat WS4
Posts: 4

Rep: Reputation: 0
that's the way grep works

This is exactly the correct output for grep. As you see, you have 2 hits on AB1234 and AB1235. Grep splits these two fields with a newline. If you want the output for both strings to be on the same line, you'll have to do some horsing around. I'm too lazy to do it...

Regards,
Mark
 
Old 10-02-2007, 12:36 PM   #3
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729Reputation: 729
My guess:

GREP goes line by line looking for one of the two matches. For each line that it parses, it applies the filename tag. Where it finds both matches, it still lists only one entry (because they are on the same line)
Note that--if you remove the "o" tag, the behavior is more like you would expect.

In short, the answer is: "that's the way it is."
 
Old 10-02-2007, 12:52 PM   #4
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
It's easy in Perl:
Code:
perl -ne 'while ( s/\b(AB123[456])// ) { print "$ARGV:$1\n"; }' example.txt
The s/\b(AB123[456])// says,
Quote:
substitute (s/) the pattern which comes after a word boundary (denoted by \b, meaning newline, spaces or tabs), with a blank.
The pattern is in (brackets), which means the contents of the matches string are assigned to $1, which is printed with the current input file name. The expression with the s/...// evaluates to true so long as a match was found and replaced, which means it works multiple times if you have several patterns on the same line.

Perl can be a bit read-only, but it's amazingly good for this sort of thing. For sure you can do the same sort of thing with sed, too, which is a little lighter, or in awk. I'm sure some people will post those solutions too.
 
Old 10-02-2007, 01:35 PM   #5
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
echo `grep -oH --file=needles.txt haystack.txt` | sed -e s/haystack.txt/\\nhaystack.txt/g

I got intrigued, and learned something at the same time! This may be not what you had in mind, but I am quite pleased LOL

output with my filenames looks like:

bash-3.2$ echo `grep -oH --file=needles.txt file.txt` | sed -e s/file.txt/\\nfile.txt/g

file.txt: AB1234
file.txt: AB1234 AB1235
file.txt:AB1236

But this is better, as it will work when there are multiple files in the folder:

bash-3.2$ for I in *.txt; do echo `grep -oH --file=needles.txt $I` | sed -e s/"$I"/"\n$I"/g | grep $I; done
file.txt: AB1234
file.txt: AB1234 AB1235
file.txt:AB1236
needles.txt: AB1234
needles.txt: AB1235
bash-3.2$

Last edited by GrapefruiTgirl; 10-02-2007 at 02:09 PM.
 
Old 10-03-2007, 10:03 AM   #6
zer0x333
Member
 
Registered: Oct 2007
Posts: 31

Original Poster
Rep: Reputation: 16
getting closer!

Thanks for the responses all I guess grep isn't the solution in this case!

Quote:
Originally Posted by matthewg42 View Post
It's easy in Perl:
Code:
perl -ne 'while ( s/\b(AB123[456])// ) { print "$ARGV:$1\n"; }' example.txt
This gives exactly the type of output I'm looking for, I want every individual pattern match on its own line along with the current filename.

The only issue I have is that my list of patterns is about 5000 lines, and there are a heck of a lot of text files to process..

I can use a very loose regex and *.txt wildcard in the above example and then grep that output for the list of patterns (to remove incorrectly matched lines), but that just seems silly!?

Does anyone know of a way I can use the list of patterns instead of the one regex in the above or similar example?
 
Old 10-03-2007, 10:17 AM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Code:
awk '{
       for(i=1;i<=NF;i++){
	  if ($i ~ /AB123[456]/){
   	     print FILENAME ":" $i
	  }
       }
}' *txt
 
Old 10-04-2007, 06:52 AM   #8
zer0x333
Member
 
Registered: Oct 2007
Posts: 31

Original Poster
Rep: Reputation: 16
Still stuck!

Thanks for the awk suggestion ghostdog74, I really should have a play with awk! How would I emulate grep's '-o' option with this?

I still have the same problem though, how can I use a list of patterns in place of the one regex?
(AB123, AB124, AB125.. was a very oversimplified example).

I've not used perl or awk before, and im a bit stuck! I've been trying to put together a perl script that will do it with little success
The perl matthewg42 suggested is surprisingly quick! Along with the capability to read patterns from a file it would be ideal
 
Old 10-04-2007, 08:38 AM   #9
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Quote:
Originally Posted by zer0x333 View Post
I've not used perl or awk before, and im a bit stuck! I've been trying to put together a perl script that will do it with little success
The perl matthewg42 suggested is surprisingly quick! Along with the capability to read patterns from a file it would be ideal
Perl is very quick for a lot of tasks. The reason I started to use it over awk was that I once wrote a long-running report in awk, and just for kicks ran the awk program to a2p, which produces a Perl version of an awk program (for fairly simple programs). This auto-generated program ran in about half the time as the awk version, which blew me away.

Lets say you have your pattern list in a file called "patterns", here's a perl version of the program which reads the list of patterns and then does the check as before with each pattern:

Code:
#!/usr/bin/perl 

use warnings;
use strict;

# name of file which contains patterns, one per line.
my $patterns_file = "patterns";

# array which will hold patterns
my @patterns;    

# open the patterns file for reading
open(P, "<patterns") || die "cannot open patterns file: $patterns_file : $!\n";

# slurp up the file into the array (one line per file).
@patterns = <P>;

# for each element in @patterns, call chomp, which removes the trailing \n character.
map(chomp, @patterns);  

close(P);

# Now for each line in each input file which is specified on the command line
# we will do our checks.  Note the <> operator does a lot automatically...
# it opens each input file in turn, and returns the lines in those files
# some variables are set automatically.  $_ is the current line, $ARGV is
# the current input file name.  Lots of function in perl work on $_ if no
# other argument is passed by default, e.g. chomp.
while(<>) {
    foreach my $re (@patterns) {
        while(s/($re)//) {
                print "$ARGV:$1\n";
        }
    }
}
An example patterns file might looks like this:
Code:
\bAB1230
\bAB1231
\bAB1232
\bAB1233
\bAB1234
\bAB1235
\bAB1236
\bAB1237
\bAB1238
\bAB1239
Note that this example file could actually be reduced to a single regular expression:
Code:
\bAB123\d
\d means "any digit, 0-9". Of course, you don't have to simplify your list of expressions. I can imagine if the list is auto-generated from another program, simplification might not be feasible.


Edit:One last thing... I didn't explain where the $1 comes from when the output is printed. When regular expressions are matched, if any part of the expression is in round brackets (), it will be assigned to $1. A second pair of brackets will go to $2 and so on. This is very useful for extracting parts of a complex pattern while at the same time validating or finding the whole pattern.

Last edited by matthewg42; 10-04-2007 at 08:42 AM. Reason: one last thing...
 
Old 10-04-2007, 09:58 AM   #10
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by zer0x333 View Post
Thanks for the awk suggestion ghostdog74, I really should have a play with awk!
you should.
Quote:
How would I emulate grep's '-o' option with this?
grep's -o is just to show only the part of a matching line that matches the pattern. in the awk script i provided, its the $i. I seldom use grep so if its not what you want, describe more clearly with an example, so i could understand.

Quote:
I still have the same problem though, how can I use a list of patterns in place of the one regex?
(AB123, AB124, AB125.. was a very oversimplified example).
show a snippet of that "not so simplified" example
 
Old 10-04-2007, 10:20 AM   #11
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
This makes me wonder... does anyone know of an automatic regex reducer, e.g. takes "(abc|abd|abe)" and returns "(ab[cde])"? One optimised for fast execution would be ideal. If not, maybe it would be a nice project for someone...
 
Old 12-04-2007, 11:10 AM   #12
zer0x333
Member
 
Registered: Oct 2007
Posts: 31

Original Poster
Rep: Reputation: 16
Talking Back at last!

Thanks again for all your help everyone! I had forgotten about this for a while! I ended up going with the following solution, it works very well

Code:
#!/usr/bin/perl
while(<>) {
        while(s/(\bACAD011|\bACAD015 ...(removed)... |\bZZZ042ET02)//) {
                print "$ARGV:$1\n";
        }
}
Quote:
Originally Posted by matthewg42 View Post
This makes me wonder... does anyone know of an automatic regex reducer, e.g. takes "(abc|abd|abe)" and returns "(ab[cde])"? One optimised for fast execution would be ideal. If not, maybe it would be a nice project for someone...
I came across this during my googling frenzy...

http://search.cpan.org/~dland/Regexp...32/Assemble.pm

Cheers,
zer0x
 
Old 12-04-2007, 11:42 AM   #13
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Quote:
Originally Posted by zer0x333 View Post
I came across this during my googling frenzy...

http://search.cpan.org/~dland/Regexp...32/Assemble.pm

Cheers,
zer0x
Thanks - that looks really interesting!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
C++: playing with arrays+pointers -unexpected output kpachopoulos Programming 1 08-26-2007 03:23 PM
grep output on stdout and grep output to file don't match xnomad Linux - General 3 01-13-2007 05:56 AM
pvscan has unexpected output when displaying SAN storage bret Linux - Enterprise 1 07-25-2006 02:56 PM
C: malloc arrays- unexpected output kpachopoulos Programming 3 03-24-2006 10:30 PM
Unexpected output from 'ls' when using glob expressions psiakr3w Linux - General 7 07-22-2004 04:21 AM


All times are GMT -5. The time now is 02:01 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration