feeding grep output to awk
Hi everybody,
I'm trying to filter a number of titles from an .html file using grep and awk. I want to feed the output of grep to awk and then store the results in an array. Unfortunately, grep doesn't seem to understand the | symbol... The loop runs just fine though, except that it doesn't seem like awk is being executed at all. This is the output that I get from grep: grep: |: No such file or directory grep: 'NR: No such file or directory grep: ==: No such file or directory grep: 10: No such file or directory grep: {print}': No such file or directory And this is my code (which, admittedly, probably looks like noobcode to you guys - but hey, at least I'm trying to learn :) ) Code:
declare -a TITLES |
Code:
commando="grep ^$Dag..*$Jaar..* $FILE_TEMP | awk 'NR == $LineCounter {print}'" 2) You should quote the pattern and your variables I am not sure what you are trying to do, exactly, but something like this might do about the same thing: Code:
#!/bin/bash |
Quote:
I'll keep the tips in mind. |
See here for more detail on what happens when you try to put a command in a variable:
I'm trying to put a command in a variable, but the complex cases always fail! http://mywiki.wooledge.org/BashFAQ/050 In general, if you have a complex command, you should set it up as a function instead. Now for grep. By default it uses basic regular expressions, which means that most of the more advanced syntax is disabled. You should use grep -E to get extended regex. See the man page for details. To search for all cases of "foo" or "bar" in a file: Code:
grep -E 'foo|bar' inputfile Code:
awk '/foo|bar/ { print }' If the line contains "foo" or "bar", and the line number matches the shell value, print it: Code:
awk -v ln="$linenumber" '(/foo|bar/ && NR == ln) { print }' http://www.grymoire.com/Unix/Awk.html http://www.gnu.org/software/gawk/man...ode/index.html http://www.pement.org/awk/awk1line.txt http://www.catonmat.net/blog/awk-one...ined-part-one/ Next, to safely load lines from a command or file into an array there's usually no need to use a counter. Code:
while IFS='' read -r line || [[ -n $line ]]; do Note also that if the individual entries could themselves contain newline characters, then you'll probably have to use null separators instead. See the links below. In awk you'd use the printf command to insert them into the output. From bash v.4+, you can also just use the mapfile built-in, as long as it's just simple newline-delimited text. Code:
mapfile array < <( command ) http://mywiki.wooledge.org/BashFAQ/001 How can I use array variables? http://mywiki.wooledge.org/BashFAQ/005 As a side note, $(..) is highly recommended over `..` Finally, if you're trying to extract values from html, then line- and regex-based tools like awk are not perfectly reliable. You should really use a tool with a true parser, like xmlstarlet. If you'd supply an example of the input, perhaps I could help you work out a solution. |
Quote:
Actually, my intention is to convert texts like these (Belgian law, government website, no copyright) into something more readable, and I want to save the result in a mainstream file type, like pdf, so that printing would be intuitive. I was thinking to parse the .html files through awk and convert it into a tex document. Tex is a proven text markup language, and can easily be converted into pdf. Would you recommend xmlstarlet over awk for this? I saw that the devellopment of xmlstarlet has stalled a little, with some bugs still open. Do you think publishing markup like footers/pagenumbers/watermarks etc are possible in html/css? I'm quite hesitant on starting to learn an ancient markup language like tex - I just don't know any better alternative. I used to script a little, but that's been almost 10 years now I'm afraid… Anyway, thanks for the advice! greetz, JohnyD |
xmlstarlet is a relatively mature application and I'm sure it's more than capable of handling most extraction jobs as it stands. Probably most of the outstanding bugs only affect complex jobs.
I just got done posting a long example of its use in a similar thread, although the requirements there are possibly a bit different from what you're doing. But it doesn't have to be xmlstarlet. That's just the program I'm most familiar with. What matters is that in the long run something with true xml/html parsing ability will generally be safer, and probably easier, than regex-based tools. As I mentioned in the other thread, the xml-html-utils are also quite handy, and often don't require much advanced knowledge to use. What exactly would you want to extract from the above page? I'm sure we could work out some way to extract it safely, but I'd need to know what that something is first. It's a rather long page, after all, and I don't understand Dutch. For example, I was easily able to extract all the .pdf links from that page with a single expression (loading the results into an array): Code:
$ mapfile -t links < <( xmlstarlet fo -H -Q -R ejustice.html | xmlstarlet sel -T -t -m '//a[contains(@href,".pdf")]' -v '@href' -n ) On the other hand, if you just want to convert the whole page itself to pdf, there are also tools that can do that directly. html2ps would put it into postscript, for example, which could then be converted to pdf. openoffice/libreoffice also have scriptable batch-mode conversion ability. Or use html2txt to extract just the text content as a whole. There are probably many other options available. |
Hi David,
I realize it has been a full 9 months since your last reply, and it is not so nice of me to keep you in the dark for so long. I can frankly say though that the past months have been the most buzy ones in my life, I got a baby, moved to a new appartment and then I got married. It has been quite a ride :-) Anyway, I'm picking this little thing back up. The ultimate goal is to extract a portion of the text, apply formatting and save it as an .odt file. I've dropped the ambition to work in Tex, I think .odt has more to offer. What I want to do specifically is to scan the "Inhoudstafel" (=table of contents) section of the .html for headings. Anything starting with "HOOFDSTUK" (=chapter) would be a level 1 heading, "Afdeling" (=section) would be level 2 heading, etc. The next block in the html file is the actual text that I am interested in. There, I would want to search for the headings that were extracted in the first section, and apply formatting to them. All in all, this won't be an easy task imo :-) Do you think xmlstarlet is a good approach for this kind of work? I already have a webpage with some php code from which I call a bash script to automatically download the right text. This script also calls tidy to tidy up the html code, and I've written some code already to create an .odt file too from within the bash script. The next step is to extract the information from the .html file and to feed it to the .odt file. |
All times are GMT -5. The time now is 10:08 AM. |