[SOLVED] specifying fields for printing in gawk from command line
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
specifying fields for printing in gawk from command line
If I want to print out only specified lines (fields) from a file using gawk, I've found can use a bash loop that looks something like this:
Code:
#!/bin/bash
for x; do
gawk 'BEGIN{RS="\0"; FS="\n"}
{print '$x' ": " $'$x'}
' <./inputfile.txt
done
===
$ ./script.sh 1 3 2
1: value of line1
3: value of line3
2: value of line2
But this is not particularly efficient, especially if the input file is very large, as gawk has to read in the entire file for each iteration of the loop. Also, I've read that using 'RS="\0" is not recommended as a way to tell it to treat the whole file as a single record.
I think it would be better to do this entirely from within gawk so it can print out all the wanted fields at one time, but I'm not sure how to do it. I've been studying awk/gawk tutorials for hours but I can't figure it out. Should I try to use a for loop, an array, or what? Can any awk experts help me out?
(Note, printing single lines is just for the example; the actual text I want to extract will be more complex, which is why I want to use awk instead of sed or other options.)
Actually, to explain what I want in more detail, I have a text file that contains several hundred sections/records, and I want to be able to print out the records that I specify.
Each record consists of about a dozen lines, but not in a completely uniform pattern, which is why I need something like awk to parse them out. The only thing that's consistent is the starting line. The general pattern looks like this:
Code:
#1# This is record 1.
some data
some more data
some more data
#2# This is record 2.
some data
etc.
Edit: Sorry catkin, I don't quite follow your suggestion. I really need some specifics, because I'm completely confused here. How do I get the input from the command line into the loop? I suppose I could use gawk -v list="$@" or something, but then how do I loop through them once I have them?
Last edited by David the H.; 08-02-2009 at 11:58 AM.
Edit: Sorry catkin, I don't quite follow your suggestion. I really need some specifics, because I'm completely confused here. How do I get the input from the command line into the loop? I suppose I could use gawk -v list="$@" or something, but then how do I loop through them once I have them?
I understand that the values you want to pass to gawk are in the arguments to the shell script that calls gawk.
That being the case "$@" is good but will not work just like that because bash expands "$@" to "$1" "$2" ... "$n" (where n may be max 10?). This is feature is usefule whne there is whitespace in the arguments. Bash would thus expand the gawk command would expand to
Code:
gawk -v list="$1" "$2" ... "$n" <stuff>
and "$2" etc would not end up in gawk variable "list". Assuming there is no whitespace in the arguments to the bash script then you could use gawk -v list="$*" which bash would expand to a single word of space separated values and this would end up in gawk variable "list".
Will the arguments to the bash script be the numbers that appear between the "#" characters in "#1# This is record 1." and will they be in the same order they appear in your text file?
If so, you could parse the first word out of "list" and set "list" to the remainder, start the outer loop and keep doing "next" statements until you match <n> in "#<n># This is record <n>.", when you could parse the next word out of "list" ready for the next match, start an inner loop doing "next" statements and printing each line until you find another "#<*># This is record <*>." when you break out of the inner loop and iterate the outer loop.
Assuming there is no whitespace in the arguments to the bash script then you could use gawk -v list="$*" which bash would expand to a single word of space separated values and this would end up in gawk variable "list".
Yeah, After a bit of experimenting I've kinda gathered that. But the whole thing is still confusing me greatly.
Quote:
Will the arguments to the bash script be the numbers that appear between the "#" characters in "#1# This is record 1." and will they be in the same order they appear in your text file?
Yes, ideally this would be the case. It would match the number in the first line of the record, then print it and every line after until the start of the next record.
Or since the records are in numerical order, it could just as well print "record number n" from the file, if that would be easier.
I should be able to pass the arguments to the script in any order however, and the records should ideally be output in that same order.
Actually, I've already found a way to do it with sed, but I have to pipe it through the command twice for each record I want. I'm sure awk would do a better job of it, once I figure out how.
Quote:
If so, you could parse the first word out of "list" and set "list" to the remainder, start the outer loop and keep doing "next" statements until you match <n> in "#<n># This is record <n>.", when you could parse the next word out of "list" ready for the next match, start an inner loop doing "next" statements and printing each line until you find another "#<*># This is record <*>." when you break out of the inner loop and iterate the outer loop.
Would you mind posting some code for this? I think I get the concept (well, maybe), but I don't comprehend at all how to go about implementing it. I've been trying to bend my head around variables and arrays and loops in awk for half a day now, and I still can't really grasp how any of it is supposed to work. Nothing I've tried so far has come anywhere close to giving me a usable output, or even anything other than an error most of the time.
Last edited by David the H.; 08-02-2009 at 01:47 PM.
I should be able to pass the arguments to the script in any order however, and the records should ideally be output in that same order.
That shifts the goal posts! awk essentially runs through the file line by line, looking for patterns and, when it matches one, does actions. If you move away from that sequential approach then you have to bend awk. Fortunately it's flexible and powerful so what you have now asked for is possible but requires a different approach -- reading the whole file into awk variable(s) -- to allow printing lines in a sequence different from the one in the input file.
#!/bin/bash
args="$*"
awk -v args="$args" 'BEGIN{
# split up the args and store in array
m=split(args,a," ")
}
f && /^#/{f=0}
/^#/{
++c #set counter whenever the line starts with #
f=1
}
f{
g=0
for(i=1;i<=m;i++){
if(a[i] == c){
g=1
}
}
if (g){ print }
}' file
output
Code:
# more file
#1# This is record 1.
some data
some more data
some more data
#2# This is record 2.
some data
etc.
#3# This is record 3.
some data
some more data
some more data
#4# This is record 4.
some data
etc.
last ...4
# ./test.sh 1 3
#1# This is record 1.
some data
some more data
some more data
#3# This is record 3.
some data
some more data
some more data
# ./test.sh 1 2 4
#1# This is record 1.
some data
some more data
some more data
#2# This is record 2.
some data
etc.
#4# This is record 4.
some data
etc.
last ...4
Would you mind posting some code for this? I think I get the concept (well, maybe), but I don't comprehend at all how to go about implementing it. I've been trying to bend my head around variables and arrays and loops in awk for half a day now, and I still can't really grasp how any of it is supposed to work. Nothing I've tried so far has come anywhere close to giving me a usable output, or even anything other than an error most of the time.
Could do, although it would not be easy because I haven't used awk in a non-trivial way for a while. I did want to get clear on your requirements first, though, especially as your last requirements would mean a very different overall algorithm from the first.
Sorry to be late replying. I had a tiring couple of days.
Ghostdog74, Thank you so much. It works perfectly. Now I just need to go through it to understand exactly what it's doing. Of course I wasn't married to using RS or anything. I just didn't know of any other way to go about it.
And Catkin, no, I don't absolutely NEED the output to be in the same order as the input, but it seems to me that a script should generally process things in the order that they're given. And having the output in a different order from the input can be a bit confusing sometimes. In any case, the code above does just what I want.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.