regular expression for parsing html tags
I have a file of HTML for which I'd like to return only the <IMG> tag attributes.
I've tried this: grep -h -i "<[^>.?img.*?src.*^<]/>" plainhtmlfile.txt > imagetags.txt but of course, this regex says "give me the whole line of text between left and right angle brackets (opening and closing html tags) with img and src appearing somewhere in there". It's giving me all the other crap in the file too though of course, as it has to be greedy (the file could contain many attributes between the img and src tag). I'd like to return only the attributes inside the <IMG> tag. Gaaah! Does anyone have any ideas? |
Well, here's how I did it (they use another OS at work btw):
Code:
# suggested usage: |
Hi Bert,
Here's how I did it using sed: # sed -n 's/.*\(img.src\)\=\([^[:space:]]*\).*/\2/p' plainhtmlfile.txt > imagetags.txt and without " # sed -n 's/.*\(img.src\)\=\"\([^[:space:]]*\)\".*/\2/p' plainhtmlfile.txt > imagetags.txt |
Hey vladkrack, thanks. That does it pretty nicely too.
I've found that doing this with a stream editor though sometimes returns the path and appears to struggle with long and funky filenames. The output can do this in places: ... ... "/img/calcutta.jpg" "9884_claudius.gif" "/img/WWIKaiserWilhelmII.jpg" "/img/charlemagneinpomp.jpg" "/img/Chartism "/img/Pankhurst, "/img/castlescotland.jpg" ... ... Of course this has nothing to do with the efficiency of your regex but the shoddy quality of the htmltags.txt files which was put together by end users who <b> think <u> nothing </b> of </u> nesting tags and using narratives instead of file naming conventions.jpg! The advantage of doing it with perl is that it uses a built-in HTML parser (which is almost certainly cheating ...) :D |
All times are GMT -5. The time now is 06:27 AM. |