LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Unexpected output from grep (http://www.linuxquestions.org/questions/linux-general-1/unexpected-output-from-grep-588902/)

zer0x333 10-02-2007 10:57 AM

Unexpected output from grep
 
Hi all,

I have a bit of a problem with grep, and was hoping someone might be able to point me in the right direction!

example cmd line:
grep -oH --file=needles.txt *.txt

example needles.txt:
\( AB1234\|^AB1234\)
\( AB1235\|^AB1235\)
\( AB1236\|^AB1236\)

example .txt file (1 of many in current dir, for example haystack.txt):
dsfkljsdf AB1234 sdflkds
gklsd AB1234 AB1235 sdfkls
AB1236 dgf

I would expect grep's output in this example to be:
haystack.txt: AB1234
haystack.txt: AB1234
haystack.txt: AB1235
haystack.txt:AB1236

However, having both AB1234 and AB1235 on one line in haystack.txt seems to cause this:
haystack.txt: AB1234
haystack.txt: AB1234
AB1235
haystack.txt:AB1236

I don't understand why grep is not prefixing 'haystack.txt:' on 'AB1235'? I would be most grateful if someone could enlighten me :)

kmellzey 10-02-2007 11:08 AM

that's the way grep works
 
This is exactly the correct output for grep. As you see, you have 2 hits on AB1234 and AB1235. Grep splits these two fields with a newline. If you want the output for both strings to be on the same line, you'll have to do some horsing around. I'm too lazy to do it...

Regards,
Mark

pixellany 10-02-2007 11:36 AM

My guess:

GREP goes line by line looking for one of the two matches. For each line that it parses, it applies the filename tag. Where it finds both matches, it still lists only one entry (because they are on the same line)
Note that--if you remove the "o" tag, the behavior is more like you would expect.

In short, the answer is: "that's the way it is."

matthewg42 10-02-2007 11:52 AM

It's easy in Perl:
Code:

perl -ne 'while ( s/\b(AB123[456])// ) { print "$ARGV:$1\n"; }' example.txt
The s/\b(AB123[456])// says,
Quote:

substitute (s/) the pattern which comes after a word boundary (denoted by \b, meaning newline, spaces or tabs), with a blank.
The pattern is in (brackets), which means the contents of the matches string are assigned to $1, which is printed with the current input file name. The expression with the s/...// evaluates to true so long as a match was found and replaced, which means it works multiple times if you have several patterns on the same line.

Perl can be a bit read-only, but it's amazingly good for this sort of thing. For sure you can do the same sort of thing with sed, too, which is a little lighter, or in awk. I'm sure some people will post those solutions too.

GrapefruiTgirl 10-02-2007 12:35 PM

echo `grep -oH --file=needles.txt haystack.txt` | sed -e s/haystack.txt/\\nhaystack.txt/g

:D:D I got intrigued, and learned something at the same time! This may be not what you had in mind, but I am quite pleased :) LOL

output with my filenames looks like:

bash-3.2$ echo `grep -oH --file=needles.txt file.txt` | sed -e s/file.txt/\\nfile.txt/g

file.txt: AB1234
file.txt: AB1234 AB1235
file.txt:AB1236

But this is better, as it will work when there are multiple files in the folder:

bash-3.2$ for I in *.txt; do echo `grep -oH --file=needles.txt $I` | sed -e s/"$I"/"\n$I"/g | grep $I; done
file.txt: AB1234
file.txt: AB1234 AB1235
file.txt:AB1236
needles.txt: AB1234
needles.txt: AB1235
bash-3.2$

zer0x333 10-03-2007 09:03 AM

getting closer!
 
Thanks for the responses all :D I guess grep isn't the solution in this case!

Quote:

Originally Posted by matthewg42 (Post 2910685)
It's easy in Perl:
Code:

perl -ne 'while ( s/\b(AB123[456])// ) { print "$ARGV:$1\n"; }' example.txt

This gives exactly the type of output I'm looking for, I want every individual pattern match on its own line along with the current filename.

The only issue I have is that my list of patterns is about 5000 lines, and there are a heck of a lot of text files to process..

I can use a very loose regex and *.txt wildcard in the above example and then grep that output for the list of patterns (to remove incorrectly matched lines), but that just seems silly!?

Does anyone know of a way I can use the list of patterns instead of the one regex in the above or similar example?

ghostdog74 10-03-2007 09:17 AM

Code:

awk '{
      for(i=1;i<=NF;i++){
          if ($i ~ /AB123[456]/){
                print FILENAME ":" $i
          }
      }
}' *txt


zer0x333 10-04-2007 05:52 AM

Still stuck!
 
Thanks for the awk suggestion ghostdog74, I really should have a play with awk! How would I emulate grep's '-o' option with this?

I still have the same problem though, how can I use a list of patterns in place of the one regex?
(AB123, AB124, AB125.. was a very oversimplified example).

I've not used perl or awk before, and im a bit stuck! I've been trying to put together a perl script that will do it with little success :(
The perl matthewg42 suggested is surprisingly quick! Along with the capability to read patterns from a file it would be ideal :)

matthewg42 10-04-2007 07:38 AM

Quote:

Originally Posted by zer0x333 (Post 2912861)
I've not used perl or awk before, and im a bit stuck! I've been trying to put together a perl script that will do it with little success :(
The perl matthewg42 suggested is surprisingly quick! Along with the capability to read patterns from a file it would be ideal :)

Perl is very quick for a lot of tasks. The reason I started to use it over awk was that I once wrote a long-running report in awk, and just for kicks ran the awk program to a2p, which produces a Perl version of an awk program (for fairly simple programs). This auto-generated program ran in about half the time as the awk version, which blew me away.

Lets say you have your pattern list in a file called "patterns", here's a perl version of the program which reads the list of patterns and then does the check as before with each pattern:

Code:

#!/usr/bin/perl

use warnings;
use strict;

# name of file which contains patterns, one per line.
my $patterns_file = "patterns";

# array which will hold patterns
my @patterns;   

# open the patterns file for reading
open(P, "<patterns") || die "cannot open patterns file: $patterns_file : $!\n";

# slurp up the file into the array (one line per file).
@patterns = <P>;

# for each element in @patterns, call chomp, which removes the trailing \n character.
map(chomp, @patterns); 

close(P);

# Now for each line in each input file which is specified on the command line
# we will do our checks.  Note the <> operator does a lot automatically...
# it opens each input file in turn, and returns the lines in those files
# some variables are set automatically.  $_ is the current line, $ARGV is
# the current input file name.  Lots of function in perl work on $_ if no
# other argument is passed by default, e.g. chomp.
while(<>) {
    foreach my $re (@patterns) {
        while(s/($re)//) {
                print "$ARGV:$1\n";
        }
    }
}

An example patterns file might looks like this:
Code:

\bAB1230
\bAB1231
\bAB1232
\bAB1233
\bAB1234
\bAB1235
\bAB1236
\bAB1237
\bAB1238
\bAB1239

Note that this example file could actually be reduced to a single regular expression:
Code:

\bAB123\d
\d means "any digit, 0-9". Of course, you don't have to simplify your list of expressions. I can imagine if the list is auto-generated from another program, simplification might not be feasible.


Edit:One last thing... I didn't explain where the $1 comes from when the output is printed. When regular expressions are matched, if any part of the expression is in round brackets (), it will be assigned to $1. A second pair of brackets will go to $2 and so on. This is very useful for extracting parts of a complex pattern while at the same time validating or finding the whole pattern.

ghostdog74 10-04-2007 08:58 AM

Quote:

Originally Posted by zer0x333 (Post 2912861)
Thanks for the awk suggestion ghostdog74, I really should have a play with awk!

you should.
Quote:

How would I emulate grep's '-o' option with this?
grep's -o is just to show only the part of a matching line that matches the pattern. in the awk script i provided, its the $i. I seldom use grep so if its not what you want, describe more clearly with an example, so i could understand.

Quote:

I still have the same problem though, how can I use a list of patterns in place of the one regex?
(AB123, AB124, AB125.. was a very oversimplified example).
show a snippet of that "not so simplified" example

matthewg42 10-04-2007 09:20 AM

This makes me wonder... does anyone know of an automatic regex reducer, e.g. takes "(abc|abd|abe)" and returns "(ab[cde])"? One optimised for fast execution would be ideal. If not, maybe it would be a nice project for someone...

zer0x333 12-04-2007 10:10 AM

Back at last!
 
Thanks again for all your help everyone! I had forgotten about this for a while! I ended up going with the following solution, it works very well :)

Code:

#!/usr/bin/perl
while(<>) {
        while(s/(\bACAD011|\bACAD015 ...(removed)... |\bZZZ042ET02)//) {
                print "$ARGV:$1\n";
        }
}

Quote:

Originally Posted by matthewg42 (Post 2913052)
This makes me wonder... does anyone know of an automatic regex reducer, e.g. takes "(abc|abd|abe)" and returns "(ab[cde])"? One optimised for fast execution would be ideal. If not, maybe it would be a nice project for someone...

I came across this during my googling frenzy...

http://search.cpan.org/~dland/Regexp...32/Assemble.pm

Cheers,
zer0x

matthewg42 12-04-2007 10:42 AM

Quote:

Originally Posted by zer0x333 (Post 2979881)
I came across this during my googling frenzy...

http://search.cpan.org/~dland/Regexp...32/Assemble.pm

Cheers,
zer0x

Thanks - that looks really interesting!


All times are GMT -5. The time now is 02:13 PM.