LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-05-2012, 02:01 PM   #1
JohnyDRipper
LQ Newbie
 
Registered: Nov 2004
Location: EU
Distribution: Gentoo - tarball 1 install :-)
Posts: 14

Rep: Reputation: 0
feeding grep output to awk


Hi everybody,


I'm trying to filter a number of titles from an .html file using grep and awk. I want to feed the output of grep to awk and then store the results in an array.

Unfortunately, grep doesn't seem to understand the | symbol... The loop runs just fine though, except that it doesn't seem like awk is being executed at all.

This is the output that I get from grep:
grep: |: No such file or directory
grep: 'NR: No such file or directory
grep: ==: No such file or directory
grep: 10: No such file or directory
grep: {print}': No such file or directory

And this is my code (which, admittedly, probably looks like noobcode to you guys - but hey, at least I'm trying to learn )

Code:
declare -a TITLES
ArrayCounter=0
LineCounter=1
while [ $LineCounter -le ${#CNNUMMERS} ]; do
    commando="grep ^$Dag..*$Jaar..* $FILE_TEMP | awk 'NR == $LineCounter {print}'"
    echo The loop runs for the $LineCounter th time.
    echo The command is $commando
   TITLES[$ArrayCounter]=`$commando`
    echo We have added $LineCounter titles to the array, the latest one is: ${TITLES[$ArrayCounter]}
    ((ArrayCounter=$ArrayCounter+1))
    ((LineCounter=$LineCounter+1))
done
 
Old 12-05-2012, 04:00 PM   #2
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
Code:
commando="grep ^$Dag..*$Jaar..* $FILE_TEMP | awk 'NR == $LineCounter {print}'"
TITLES[$ArrayCounter]=`$commando`
1) You really shouldn't store complex commands like that in a variable. This makes the shell think "|" is an argument for grep, rather than a pipe.
2) You should quote the pattern and your variables

I am not sure what you are trying to do, exactly, but something like this might do about the same thing:

Code:
#!/bin/bash
IFS='
'
TITLES=( $(grep "^$Dag..*$Jaar..*" "$FILE_TEMP") )
 
1 members found this post helpful.
Old 12-05-2012, 05:42 PM   #3
JohnyDRipper
LQ Newbie
 
Registered: Nov 2004
Location: EU
Distribution: Gentoo - tarball 1 install :-)
Posts: 14

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by millgates View Post
1) You really shouldn't store complex commands like that in a variable. This makes the shell think "|" is an argument for grep, rather than a pipe.
2) You should quote the pattern and your variables

I am not sure what you are trying to do, exactly, but something like this might do about the same thing:

Code:
#!/bin/bash
IFS='
'
TITLES=( $(grep "^$Dag..*$Jaar..*" "$FILE_TEMP") )
Hey, thank you, this works for me!
I'll keep the tips in mind.
 
Old 12-05-2012, 10:27 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
See here for more detail on what happens when you try to put a command in a variable:

I'm trying to put a command in a variable, but the complex cases always fail!
http://mywiki.wooledge.org/BashFAQ/050

In general, if you have a complex command, you should set it up as a function instead.

Now for grep. By default it uses basic regular expressions, which means that most of the more advanced syntax is disabled. You should use grep -E to get extended regex. See the man page for details.

To search for all cases of "foo" or "bar" in a file:
Code:
grep -E 'foo|bar' inputfile
Next, you shouldn't need to use grep anyway. awk is a full-fledged text parsing language, and can do all the pattern matching internally.

Code:
awk '/foo|bar/ { print }'
Also be aware that awk variables are not bash variables. You need to import the shell value into awk if you want to use it.

If the line contains "foo" or "bar", and the line number matches the shell value, print it:
Code:
awk -v ln="$linenumber" '(/foo|bar/ && NR == ln) { print }'
Here are a few useful awk references:
http://www.grymoire.com/Unix/Awk.html
http://www.gnu.org/software/gawk/man...ode/index.html
http://www.pement.org/awk/awk1line.txt
http://www.catonmat.net/blog/awk-one...ined-part-one/

Next, to safely load lines from a command or file into an array there's usually no need to use a counter.

Code:
while IFS='' read -r line || [[ -n $line ]]; do
	array+=$( "$line" )
done < <( command )
Setting IFS to null first keeps it from stripping off leading and trailing whitespace, the -r keeps backslashes in the text safe, and the "||" (or) test is there to process the last line if the text itself doesn't terminate with a newline character (this is not actually a problem with command or process substitution, but should be done when the input is a text file.

Note also that if the individual entries could themselves contain newline characters, then you'll probably have to use null separators instead. See the links below. In awk you'd use the printf command to insert them into the output.

From bash v.4+, you can also just use the mapfile built-in, as long as it's just simple newline-delimited text.

Code:
mapfile array < <( command )
How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
http://mywiki.wooledge.org/BashFAQ/001

How can I use array variables?
http://mywiki.wooledge.org/BashFAQ/005


As a side note, $(..) is highly recommended over `..`


Finally, if you're trying to extract values from html, then line- and regex-based tools like awk are not perfectly reliable. You should really use a tool with a true parser, like xmlstarlet. If you'd supply an example of the input, perhaps I could help you work out a solution.

Last edited by David the H.; 12-05-2012 at 10:31 PM. Reason: minor edits + added links
 
1 members found this post helpful.
Old 12-25-2012, 09:40 AM   #5
JohnyDRipper
LQ Newbie
 
Registered: Nov 2004
Location: EU
Distribution: Gentoo - tarball 1 install :-)
Posts: 14

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by David the H. View Post
Finally, if you're trying to extract values from html, then line- and regex-based tools like awk are not perfectly reliable. You should really use a tool with a true parser, like xmlstarlet. If you'd supply an example of the input, perhaps I could help you work out a solution.
Hi David the H., thanks for all of the info there.

Actually, my intention is to convert texts like these (Belgian law, government website, no copyright) into something more readable, and I want to save the result in a mainstream file type, like pdf, so that printing would be intuitive. I was thinking to parse the .html files through awk and convert it into a tex document. Tex is a proven text markup language, and can easily be converted into pdf.

Would you recommend xmlstarlet over awk for this? I saw that the devellopment of xmlstarlet has stalled a little, with some bugs still open.

Do you think publishing markup like footers/pagenumbers/watermarks etc are possible in html/css? I'm quite hesitant on starting to learn an ancient markup language like tex - I just don't know any better alternative. I used to script a little, but that's been almost 10 years now I'm afraid…


Anyway, thanks for the advice!
greetz,
JohnyD
 
Old 12-26-2012, 12:00 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
xmlstarlet is a relatively mature application and I'm sure it's more than capable of handling most extraction jobs as it stands. Probably most of the outstanding bugs only affect complex jobs.

I just got done posting a long example of its use in a similar thread, although the requirements there are possibly a bit different from what you're doing.

But it doesn't have to be xmlstarlet. That's just the program I'm most familiar with. What matters is that in the long run something with true xml/html parsing ability will generally be safer, and probably easier, than regex-based tools.

As I mentioned in the other thread, the xml-html-utils are also quite handy, and often don't require much advanced knowledge to use.

What exactly would you want to extract from the above page? I'm sure we could work out some way to extract it safely, but I'd need to know what that something is first. It's a rather long page, after all, and I don't understand Dutch.


For example, I was easily able to extract all the .pdf links from that page with a single expression (loading the results into an array):

Code:
$ mapfile -t links < <( xmlstarlet fo -H -Q -R ejustice.html | xmlstarlet sel -T -t -m '//a[contains(@href,".pdf")]' -v '@href' -n )

$ printf '[%s]\n' "${links[@]}"
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=5&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2010/04/12_1.pdf]
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=19&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2012/07/25_1.pdf]
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=61&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2011/09/07_1.pdf]
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=22&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2011/05/06_1.pdf]
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=31&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2011/03/09_1.pdf]
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=6&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2010/09/28_1.pdf]
[/cgi_loi/loi_a.pl?ddda=2010&sql=dt+contains++'WET'+and+dd+=+date'2010-04-06'and+actif+=+'Y'&language=nl&cn=2010040603&caller=image_a1&tri=dd+AS+RANK+&fromtab=wet_all&pdf_page=4&pdf_file=http://www.ejustice.just.fgov.be/mopdf/2010/07/01_1.pdf]
Other jobs may be easier, or harder, depending on the exacts.

On the other hand, if you just want to convert the whole page itself to pdf, there are also tools that can do that directly. html2ps would put it into postscript, for example, which could then be converted to pdf. openoffice/libreoffice also have scriptable batch-mode conversion ability. Or use html2txt to extract just the text content as a whole. There are probably many other options available.
 
1 members found this post helpful.
Old 09-30-2013, 10:15 AM   #7
JohnyDRipper
LQ Newbie
 
Registered: Nov 2004
Location: EU
Distribution: Gentoo - tarball 1 install :-)
Posts: 14

Original Poster
Rep: Reputation: 0
Hi David,


I realize it has been a full 9 months since your last reply, and it is not so nice of me to keep you in the dark for so long. I can frankly say though that the past months have been the most buzy ones in my life, I got a baby, moved to a new appartment and then I got married. It has been quite a ride :-)

Anyway, I'm picking this little thing back up. The ultimate goal is to extract a portion of the text, apply formatting and save it as an .odt file. I've dropped the ambition to work in Tex, I think .odt has more to offer.

What I want to do specifically is to scan the "Inhoudstafel" (=table of contents) section of the .html for headings. Anything starting with "HOOFDSTUK" (=chapter) would be a level 1 heading, "Afdeling" (=section) would be level 2 heading, etc.

The next block in the html file is the actual text that I am interested in. There, I would want to search for the headings that were extracted in the first section, and apply formatting to them.


All in all, this won't be an easy task imo :-) Do you think xmlstarlet is a good approach for this kind of work?

I already have a webpage with some php code from which I call a bash script to automatically download the right text. This script also calls tidy to tidy up the html code, and I've written some code already to create an .odt file too from within the bash script. The next step is to extract the information from the .html file and to feed it to the .odt file.

Last edited by JohnyDRipper; 09-30-2013 at 10:16 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] storing grep output to feed awk to retrieve entire records matching variable chargaff Programming 8 08-13-2010 06:10 AM
kill < file (Feeding kill from file), PID, <, awk, grep jaffd Programming 4 04-09-2010 04:06 PM
feeding grep into xargs Murdock1979 Programming 6 02-03-2009 07:05 PM
grep output on stdout and grep output to file don't match xnomad Linux - General 3 01-13-2007 04:56 AM
How do I zip and attach the output data of a grep | awk | mail shell script? 360 Programming 1 05-08-2002 08:26 AM


All times are GMT -5. The time now is 03:59 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration