[SOLVED] specifying fields for printing in gawk from command line

David the H. · 08-02-2009, 11:34 AM

If I want to print out only specified lines (fields) from a file using gawk, I've found can use a bash loop that looks something like this:

Code:

#!/bin/bash

for x; do

        gawk 'BEGIN{RS="\0"; FS="\n"}
        {print '$x' ": " $'$x'}
        ' <./inputfile.txt

done

===

$ ./script.sh 1 3 2

1: value of line1
3: value of line3
2: value of line2

But this is not particularly efficient, especially if the input file is very large, as gawk has to read in the entire file for each iteration of the loop. Also, I've read that using 'RS="\0" is not recommended as a way to tell it to treat the whole file as a single record.

I think it would be better to do this entirely from within gawk so it can print out all the wanted fields at one time, but I'm not sure how to do it. I've been studying awk/gawk tutorials for hours but I can't figure it out. Should I try to use a for loop, an array, or what? Can any awk experts help me out?

(Note, printing single lines is just for the example; the actual text I want to extract will be more complex, which is why I want to use awk instead of sed or other options.)

catkin · 08-02-2009, 11:47 AM

Hello David the H.

A loop with the next command in it to iterate over the lines ...

Best

Charles

David the H. · 08-02-2009, 11:54 AM

Actually, to explain what I want in more detail, I have a text file that contains several hundred sections/records, and I want to be able to print out the records that I specify.

Each record consists of about a dozen lines, but not in a completely uniform pattern, which is why I need something like awk to parse them out. The only thing that's consistent is the starting line. The general pattern looks like this:

Code:

#1#  This is record 1.

 some data
 some more data
 some more data

#2#  This is record 2.

 some data
 etc.

Edit: Sorry catkin, I don't quite follow your suggestion. I really need some specifics, because I'm completely confused here. How do I get the input from the command line into the loop? I suppose I could use gawk -v list="$@" or something, but then how do I loop through them once I have them?

catkin · 08-02-2009, 01:15 PM

Quote:

Originally Posted by David the H.

Edit: Sorry catkin, I don't quite follow your suggestion. I really need some specifics, because I'm completely confused here. How do I get the input from the command line into the loop? I suppose I could use gawk -v list="$@" or something, but then how do I loop through them once I have them?

I understand that the values you want to pass to gawk are in the arguments to the shell script that calls gawk.

That being the case "$@" is good but will not work just like that because bash expands "$@" to "$1" "$2" ... "$n" (where n may be max 10?). This is feature is usefule whne there is whitespace in the arguments. Bash would thus expand the gawk command would expand to

Code:

gawk -v list="$1" "$2" ... "$n" <stuff>

and "$2" etc would not end up in gawk variable "list". Assuming there is no whitespace in the arguments to the bash script then you could use gawk -v list="$*" which bash would expand to a single word of space separated values and this would end up in gawk variable "list".

Will the arguments to the bash script be the numbers that appear between the "#" characters in "#1# This is record 1." and will they be in the same order they appear in your text file?

If so, you could parse the first word out of "list" and set "list" to the remainder, start the outer loop and keep doing "next" statements until you match <n> in "#<n># This is record <n>.", when you could parse the next word out of "list" ready for the next match, start an inner loop doing "next" statements and printing each line until you find another "#<*># This is record <*>." when you break out of the inner loop and iterate the outer loop.

David the H. · 08-02-2009, 01:41 PM

Quote:

Originally Posted by catkin

Assuming there is no whitespace in the arguments to the bash script then you could use gawk -v list="$*" which bash would expand to a single word of space separated values and this would end up in gawk variable "list".

Yeah, After a bit of experimenting I've kinda gathered that. But the whole thing is still confusing me greatly.

Quote:

Will the arguments to the bash script be the numbers that appear between the "#" characters in "#1# This is record 1." and will they be in the same order they appear in your text file?

Yes, ideally this would be the case. It would match the number in the first line of the record, then print it and every line after until the start of the next record.

Or since the records are in numerical order, it could just as well print "record number n" from the file, if that would be easier.

I should be able to pass the arguments to the script in any order however, and the records should ideally be output in that same order.

Actually, I've already found a way to do it with sed, but I have to pipe it through the command twice for each record I want. I'm sure awk would do a better job of it, once I figure out how.

Quote:

If so, you could parse the first word out of "list" and set "list" to the remainder, start the outer loop and keep doing "next" statements until you match <n> in "#<n># This is record <n>.", when you could parse the next word out of "list" ready for the next match, start an inner loop doing "next" statements and printing each line until you find another "#<*># This is record <*>." when you break out of the inner loop and iterate the outer loop.

Would you mind posting some code for this? I think I get the concept (well, maybe), but I don't comprehend at all how to go about implementing it. I've been trying to bend my head around variables and arrays and loops in awk for half a day now, and I still can't really grasp how any of it is supposed to work. Nothing I've tried so far has come anywhere close to giving me a usable output, or even anything other than an error most of the time.

catkin · 08-02-2009, 03:50 PM

Quote:

Originally Posted by David the H.

I should be able to pass the arguments to the script in any order however, and the records should ideally be output in that same order.

That shifts the goal posts! awk essentially runs through the file line by line, looking for patterns and, when it matches one, does actions. If you move away from that sequential approach then you have to bend awk. Fortunately it's flexible and powerful so what you have now asked for is possible but requires a different approach -- reading the whole file into awk variable(s) -- to allow printing lines in a sequence different from the one in the input file.

ghostdog74 · 08-02-2009, 08:27 PM

here's an approach not using RS.

Code:

#!/bin/bash
args="$*"
awk  -v args="$args" 'BEGIN{
    # split up the args and store in array
    m=split(args,a," ")
}
f && /^#/{f=0}
/^#/{
    ++c #set counter whenever the line starts with #
    f=1
}
f{
    g=0
    for(i=1;i<=m;i++){       
        if(a[i] == c){
            g=1
        }
    }
    if (g){  print }
}' file

output

Code:

# more file
#1#  This is record 1.

 some data
 some more data
 some more data

#2#  This is record 2.

 some data
 etc.

#3#  This is record 3.

 some data
 some more data
 some more data

#4#  This is record 4.

 some data
 etc.
 last ...4

# ./test.sh 1 3
#1#  This is record 1.

 some data
 some more data
 some more data

#3#  This is record 3.

 some data
 some more data
 some more data

# ./test.sh 1 2 4
#1#  This is record 1.

 some data
 some more data
 some more data

#2#  This is record 2.

 some data
 etc.

#4#  This is record 4.

 some data
 etc.
 last ...4

catkin · 08-03-2009, 11:52 AM

Hello David

Quote:

Originally Posted by David the H.

Would you mind posting some code for this? I think I get the concept (well, maybe), but I don't comprehend at all how to go about implementing it. I've been trying to bend my head around variables and arrays and loops in awk for half a day now, and I still can't really grasp how any of it is supposed to work. Nothing I've tried so far has come anywhere close to giving me a usable output, or even anything other than an error most of the time.

Could do, although it would not be easy because I haven't used awk in a non-trivial way for a while. I did want to get clear on your requirements first, though, especially as your last requirements would mean a very different overall algorithm from the first.

Best

Charles

David the H. · 08-04-2009, 03:32 PM

Sorry to be late replying. I had a tiring couple of days.

Ghostdog74, Thank you so much. It works perfectly. Now I just need to go through it to understand exactly what it's doing.

Of course I wasn't married to using RS or anything. I just didn't know of any other way to go about it.

And Catkin, no, I don't absolutely NEED the output to be in the same order as the input, but it seems to me that a script should generally process things in the order that they're given. And having the output in a different order from the input can be a bit confusing sometimes. In any case, the code above does just what I want.