Pattern matching in a text file

wtaicken · 12-08-2008, 11:00 AM

I need to do some scripting to read through a text file, and find the last occurrence of a word in the file that corresponds to a look up list. When that line containing that word has been found, I need to extract out the last numerical character from that line, and substitute it for another character in another text file which will then be appended to the first.

e.g. the original file will contain something like

ARCHIVE 1
store
begin
*********************************
* Retrieve interface by default into INTERFACE001
ARCHIVE 1
retrieve
begin
*********************************
CRITIC 1 2

what I want to do is find CRITIC, since thats the last occurrence of one of the words in my lookup list. I need to then extract out the number 2, and substitute that for something like x in another text file. Guess I can do the last part using sed. But should I use AWK or GREP for the first bit.

W

TB0ne · 12-08-2008, 12:44 PM

Quote:

Originally Posted by wtaicken

I need to do some scripting to read through a text file, and find the last occurrence of a word in the file that corresponds to a look up list. When that line containing that word has been found, I need to extract out the last numerical character from that line, and substitute it for another character in another text file which will then be appended to the first.

e.g. the original file will contain something like

ARCHIVE 1
store
begin
*********************************
* Retrieve interface by default into INTERFACE001
ARCHIVE 1
retrieve
begin
*********************************
CRITIC 1 2

what I want to do is find CRITIC, since thats the last occurrence of one of the words in my lookup list. I need to then extract out the number 2, and substitute that for something like x in another text file. Guess I can do the last part using sed. But should I use AWK or GREP for the first bit.

W

I'd grep it, since if you're only looking for the CRITIC lines, it'll just return those. Doing "grep CRITIC <filename>" would work.

x_terminat_or_3 · 12-08-2008, 12:59 PM

. . . and to get the last occurrence, of your grep output, pipe it to tail

like this:

grep CRITIC filename | tail -n 1

then pipe all that to sed/awk

Tinkster · 12-08-2008, 01:16 PM

Or in awk

Code:

awk '/CRITIC/{line=$0} END{$0=line; print $NF}' file

jan61 · 12-08-2008, 03:31 PM

Moin,

Quote:

Originally Posted by Tinkster

Or in awk

Code:

awk '/CRITIC/{line=$0} END{$0=line; print $NF}' file

Probably you can save time by reverting the file first, because you can stop analysing the file at the first match:

Code:

tac file | awk '/CRITIC/{print $NF; exit;}'

Jan

Tinkster · 12-08-2008, 03:36 PM

Good idea - would be worthwhile to time executions.

PTrenholme · 12-08-2008, 03:44 PM

You're all ignoring the "list of words in another file" part of the OP's problem.

Consider this possibility:

Code:

$ cat fields
ARCHIVE                         
CRITIC                          
$ cat comp_test
ARCHIVE 1                          
store                              
begin                              
*********************************  
* Retrieve interface by default into INTERFACE001
ARCHIVE 1
retrieve
begin
*********************************
CRITIC 1 2
$ gawk -f comp.awk -v fields=fields comp_test
2

<edit>

Sorry. There's an error in this code. See my post below for commented corrected code.
</edit>
Using this code:

PHP Code:



$ cat comp.awk 
#!/bin/gawk 
BEGIN { 
  if (!fields) { 
    printf "Usage: gawk -v fields=list-of-words -F " ARGV[0] " file-to-search\n"; 
    exit 1; 
  } 
  while (getline < fields) { 
    words = (words) ? words "|(" $0 ")" : "(" $0 ")"; 
  } 
} 
 
{ 
  if ($0 ~ words)  matched = $0; 
} 
 
END { 
  if (matched) { 
    printf $ NF "\n"; <edit> This is not correct. </edit> 
  } 
  else { 
    printf "No line in any input file matched any word in the field list.\n"; 
  } 
}

Tinkster · 12-08-2008, 04:31 PM

Quote:

Originally Posted by PTrenholme

You're all ignoring the "list of words in another file" part of the OP's problem.

Not really ... he only asked for the extraction part.

Quote:

what I want to do is find CRITIC, since thats the last occurrence of one of the words in my lookup list. I need to then extract out the number 2, and substitute that for something like x in another text file. Guess I can do the last part using sed. But should I use AWK or GREP for the first bit.

And didn't mention any specifics what so ever what
the criteria for that replacement might be, either.

Cheers,
Tink

PTrenholme · 12-08-2008, 09:36 PM

Um, Tink, look in the last quote you posted: "... since that's one of the words in my lookup list." I think that's a fairly clear indication that the OP had a "list" of words, not just a single word, in mind. The "CRITIC" part was just an example of a match from the list. (That's why I used a two-word list in my example code.)

<edit>
And so I looked at my code and realized I was reporting $ NF, which is the last field in the last line of the file, not the matching line.

Here's a corrected version of the the code with some added comments:

PHP Code:



#!/bin/gawk 
BEGIN { 
  if (!fields) { 
    print "Usage: gawk -v fields=list-of-words -f comp.awk file-to-search"; 
    skip = 1; 
    exit; 
  } 
  # Build a regular expression that will match any word in the "fields" file 
  # Note that the "words" in the "fields" file may, themselves, be regular expressions. 
  while (getline < fields) { 
    words = (words) ? words "|(" $0 ")" : "(" $0 ")"; 
  } 
} 
 
# Read the input file and check each line for a match in the word list 
{ 
  if (skip) break; 
  if (match($0, words, val)) { # Use the "match" function to extract the matched string 
    matched = $0;          # Save the line containing the match, overwriting any prior value 
    matched_str = val[0];     # Save the matching token 
    matched_val = $ NF;       # And the last field in the line. Other "values" could be selected by, e.g., $1, $2, etc. 
  } 
} 
 
# All done. Report the matched information, if any. 
END { 
  if (matched) { 
    print "\"" matched "\" contained \"" matched_str "\" and was the last line containing any word in the list. The last field in that string is" 
    # Placing the field value on the last output line for later use. 
    print matched_val; 
  } 
  else if (!skip) { 
    print "No line in any input file matched any word in the field list."; 
  } 
}

wtaicken · 12-09-2008, 05:16 AM

Ok, thanks, thats works a treat! I did mean a word from a lookup list...........sorry if it was a bit vague to earlier posters

Can I bed this within another parent script, and if so what would the syntax be? The parent script cd's to a specific directory(supplied at the commandline), and spools through all files, performing various actions. This above is the first action, and the output from that will be used to substitute for characters in another block of text, which will ultimately be appended to the orig file. Hope thats clear!

PTrenholme · 12-09-2008, 09:12 AM

As to the embedding, if you're using a bash shell, you are, in effect, already embedded. . .

Anyhow, the syntax is the same as it would be on a command line. Something like this:

Code:

#/bin/bash
word_list="$1"
file_name="$2"
token=$(gawk -f comp.awk -v fields=$word_list $file_name | tail -n 1)
[  $? != 0 ] && echo "error" && exit

Note that the print . . . stuff in the final section of the sample code I provided can be simplified to just produce the output you want so you don't need the pipe into the tail command.

wtaicken · 12-09-2008, 12:29 PM

Ok, that works. Ta v much

wtaicken · 12-15-2008, 03:49 AM

I need to ensure this awk script just carries out the matching process with the first word on the line. Currently it looks for the last occurrence of a word anywhere on the line, which is messing up my results

The current syntax is

Code:

if (match($0, words, val)) { # Use the "match" function to extract the matched string

How can I mod this to look at the first word in the line. Will swapping $0 for $1 work?

Any help gratefully received

PTrenholme · 12-15-2008, 11:08 AM

Yes, substituting $1 for $0 in the call to the match function will match the regular expression in words to the first input field rather than the whole input line.

wtaicken · 12-23-2008, 04:44 AM

I now note that the script will only pick up matches in the same case. If I wanted to look for matches in either upper or lower case, and the list to lookup against is in uppercase, do I have to add words in lowercase?

W