LinuxQuestions.org - feeding grep output to awk

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - feeding grep output to awk (https://www.linuxquestions.org/questions/programming-9/feeding-grep-output-to-awk-4175440173/)

feeding grep output to awk

Hi everybody,

I'm trying to filter a number of titles from an .html file using grep and awk. I want to feed the output of grep to awk and then store the results in an array.

Unfortunately, grep doesn't seem to understand the | symbol... The loop runs just fine though, except that it doesn't seem like awk is being executed at all.

This is the output that I get from grep:
grep: |: No such file or directory
grep: 'NR: No such file or directory
grep: ==: No such file or directory
grep: 10: No such file or directory
grep: {print}': No such file or directory

And this is my code (which, admittedly, probably looks like noobcode to you guys - but hey, at least I'm trying to learn :) )

Code:

declare -a TITLES

ArrayCounter=0

LineCounter=1

while [ $LineCounter -le ${#CNNUMMERS} ]; do

    commando="grep ^$Dag..*$Jaar..* $FILE_TEMP | awk 'NR == $LineCounter {print}'"

    echo The loop runs for the $LineCounter th time.

    echo The command is $commando

  TITLES[$ArrayCounter]=`$commando`

    echo We have added $LineCounter titles to the array, the latest one is: ${TITLES[$ArrayCounter]}

    ((ArrayCounter=$ArrayCounter+1))

    ((LineCounter=$LineCounter+1))

done

Code:

commando="grep ^$Dag..*$Jaar..* $FILE_TEMP | awk 'NR == $LineCounter {print}'"

TITLES[$ArrayCounter]=`$commando`

1) You really shouldn't store complex commands like that in a variable. This makes the shell think "|" is an argument for grep, rather than a pipe.
2) You should quote the pattern and your variables

I am not sure what you are trying to do, exactly, but something like this might do about the same thing:

Code:

#!/bin/bash

IFS='

'

TITLES=( $(grep "^$Dag..*$Jaar..*" "$FILE_TEMP") )

Quote:

Originally Posted by millgates (Post 4843443)

Code:

#!/bin/bash

IFS='

'

TITLES=( $(grep "^$Dag..*$Jaar..*" "$FILE_TEMP") )

Hey, thank you, this works for me! :)
I'll keep the tips in mind.

See here for more detail on what happens when you try to put a command in a variable:

I'm trying to put a command in a variable, but the complex cases always fail!
http://mywiki.wooledge.org/BashFAQ/050

In general, if you have a complex command, you should set it up as a function instead.

Now for grep. By default it uses basic regular expressions, which means that most of the more advanced syntax is disabled. You should use grep -E to get extended regex. See the man page for details.

To search for all cases of "foo" or "bar" in a file:

Code:

grep -E 'foo|bar' inputfile

Next, you shouldn't need to use grep anyway. awk is a full-fledged text parsing language, and can do all the pattern matching internally.

Code:

awk '/foo|bar/ { print }'

Also be aware that awk variables are not bash variables. You need to import the shell value into awk if you want to use it.

If the line contains "foo" or "bar", and the line number matches the shell value, print it:

Code:

awk -v ln="$linenumber" '(/foo|bar/ && NR == ln) { print }'

Here are a few useful awk references:
http://www.grymoire.com/Unix/Awk.html
http://www.gnu.org/software/gawk/man...ode/index.html
http://www.pement.org/awk/awk1line.txt
http://www.catonmat.net/blog/awk-one...ined-part-one/

Next, to safely load lines from a command or file into an array there's usually no need to use a counter.

Code:

while IFS='' read -r line || [[ -n $line ]]; do

        array+=$( "$line" )

done < <( command )

Setting IFS to null first keeps it from stripping off leading and trailing whitespace, the -r keeps backslashes in the text safe, and the "||" (or) test is there to process the last line if the text itself doesn't terminate with a newline character (this is not actually a problem with command or process substitution, but should be done when the input is a text file.

Note also that if the individual entries could themselves contain newline characters, then you'll probably have to use null separators instead. See the links below. In awk you'd use the printf command to insert them into the output.

From bash v.4+, you can also just use the mapfile built-in, as long as it's just simple newline-delimited text.

Code:

mapfile array < <( command )

How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
http://mywiki.wooledge.org/BashFAQ/001

How can I use array variables?
http://mywiki.wooledge.org/BashFAQ/005

As a side note, $(..) is highly recommended over `..`

Finally, if you're trying to extract values from html, then line- and regex-based tools like awk are not perfectly reliable. You should really use a tool with a true parser, like xmlstarlet. If you'd supply an example of the input, perhaps I could help you work out a solution.

Quote:

Originally Posted by David the H. (Post 4843574)

Finally, if you're trying to extract values from html, then line- and regex-based tools like awk are not perfectly reliable. You should really use a tool with a true parser, like xmlstarlet. If you'd supply an example of the input, perhaps I could help you work out a solution.

Hi David the H., thanks for all of the info there.

Actually, my intention is to convert texts like these (Belgian law, government website, no copyright) into something more readable, and I want to save the result in a mainstream file type, like pdf, so that printing would be intuitive. I was thinking to parse the .html files through awk and convert it into a tex document. Tex is a proven text markup language, and can easily be converted into pdf.

Would you recommend xmlstarlet over awk for this? I saw that the devellopment of xmlstarlet has stalled a little, with some bugs still open.

Do you think publishing markup like footers/pagenumbers/watermarks etc are possible in html/css? I'm quite hesitant on starting to learn an ancient markup language like tex - I just don't know any better alternative. I used to script a little, but that's been almost 10 years now I'm afraid…

Anyway, thanks for the advice!
greetz,
JohnyD

xmlstarlet is a relatively mature application and I'm sure it's more than capable of handling most extraction jobs as it stands. Probably most of the outstanding bugs only affect complex jobs.

I just got done posting a long example of its use in a similar thread, although the requirements there are possibly a bit different from what you're doing.

But it doesn't have to be xmlstarlet. That's just the program I'm most familiar with. What matters is that in the long run something with true xml/html parsing ability will generally be safer, and probably easier, than regex-based tools.

As I mentioned in the other thread, the xml-html-utils are also quite handy, and often don't require much advanced knowledge to use.

What exactly would you want to extract from the above page? I'm sure we could work out some way to extract it safely, but I'd need to know what that something is first. It's a rather long page, after all, and I don't understand Dutch.

For example, I was easily able to extract all the .pdf links from that page with a single expression (loading the results into an array):

Code:

$ mapfile -t links < <( xmlstarlet fo -H -Q -R ejustice.html | xmlstarlet sel -T -t -m '//a[contains(@href,".pdf")]' -v '@href' -n )



$ printf '[%s]\n' "${links[@]}"

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=5&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2010/04/12_1.pdf]

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=19&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2012/07/25_1.pdf]

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=61&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2011/09/07_1.pdf]

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=22&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2011/05/06_1.pdf]

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=31&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2011/03/09_1.pdf]

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=6&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2010/09/28_1.pdf]

[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=4&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2010/07/01_1.pdf]

Other jobs may be easier, or harder, depending on the exacts.

On the other hand, if you just want to convert the whole page itself to pdf, there are also tools that can do that directly. html2ps would put it into postscript, for example, which could then be converted to pdf. openoffice/libreoffice also have scriptable batch-mode conversion ability. Or use html2txt to extract just the text content as a whole. There are probably many other options available.

Hi David,

I realize it has been a full 9 months since your last reply, and it is not so nice of me to keep you in the dark for so long. I can frankly say though that the past months have been the most buzy ones in my life, I got a baby, moved to a new appartment and then I got married. It has been quite a ride :-)

Anyway, I'm picking this little thing back up. The ultimate goal is to extract a portion of the text, apply formatting and save it as an .odt file. I've dropped the ambition to work in Tex, I think .odt has more to offer.

What I want to do specifically is to scan the "Inhoudstafel" (=table of contents) section of the .html for headings. Anything starting with "HOOFDSTUK" (=chapter) would be a level 1 heading, "Afdeling" (=section) would be level 2 heading, etc.

The next block in the html file is the actual text that I am interested in. There, I would want to search for the headings that were extracted in the first section, and apply formatting to them.

All in all, this won't be an easy task imo :-) Do you think xmlstarlet is a good approach for this kind of work?

I already have a webpage with some php code from which I call a bash script to automatically download the right text. This script also calls tidy to tidy up the html code, and I've written some code already to create an .odt file too from within the bash script. The next step is to extract the information from the .html file and to feed it to the .odt file.