Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
01-07-2011, 02:54 PM
|
#1
|
Member
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120
Rep:
|
[bash] count nested HTML tags
I need to search through HTML files to count the number of <li> tags nested within the first <ul> tag:
Code:
<ul>
<li>text 1</li>
<li>text 2</li>
</ul>
...
<ul>
<li>text 3</li>
<li>text 4</li>
<li>text 5</li>
</ul>
I tried:
Code:
tr "\n" " " < file.html | \
grep -E -o '<ul>.*</ul>' | \
grep -F -o '<li>' | \
wc -l
Unfortunately, the second grep is greedy swallowing everything up to the last </ul> close tag. (The desired result is 2.)
Any ideas for a bash-based solution are welcome. Speed is an issue as I will be searching through 350,000 files.
|
|
|
01-07-2011, 03:06 PM
|
#2
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
The most common way to counteract greediness is to have the regex string run for as long as it fails to match a certain character.
Code:
grep -E -o '<ul>[^<]*</ul>'
But when it comes to nesting, there's really no simple regex solution that can catch all cases. You really need to switch to a dedicated html/xml parser for that.
|
|
|
01-07-2011, 08:31 PM
|
#3
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
I agree with David the H., you really should consider using a scripting language with XML support instead.
However, if you are sure the first unordered list is not nested, i.e. that it does not contain unordered lists, you can do this with tr, sed and grep. First, add newlines before each tag with sed. Then, you can remove all lines following the first unordered list, and also all lines before the first unordered list. Then you can just count the list elements:
Code:
#!/bin/bash
cat "$@" \
| tr -s '\t\n\v\f\r ' ' ' \
| sed -e 's|<|\n<|g' \
| sed -e '/^<\/[Uu][Ll][> ]/,$ d' -e '1,/<[Uu][Ll][> ]/ d' \
| grep -ce '^<[Ll][Ii][> ]'
For an example, here's how you could use the command-line PHP interpreter and the Tidy library (built-in with PHP 5 and later) to do it properly. You could of course use an XML parser, but typically HTML code is not XML, so Tidy works much better (than SimpleXML for example). Also, this code is not intended to be efficient, just robust and easy to understand.
Code:
#!/usr/bin/env php
<?PHP
/* Recursive function to find the desired node in $root.
*/
function findNode($root, $name) {
if (strcasecmp($root->name, $name) === 0)
return $root;
if ($root->hasChildren())
foreach ($root->child as $node) {
$recurse = findNode($node, $name);
if ($recurse !== FALSE)
return $recurse;
}
return FALSE;
}
/* Main program, command-line argument loop.
*/
for ($arg = 1; $arg < $argc; $arg++)
if (file_exists($argv[$arg])) {
$tidy = new tidy();
if (@$tidy->parseFile($argv[$arg])) {
if ($tidy->cleanRepair() === FALSE)
fprintf(STDERR, "%s: Warning: Tidy failed to repair the file.\n", $argv[$arg]);
$root = $tidy->root();
$list = findNode($root, "ul");
$items = 0;
if ($list !== FALSE && $list->hasChildren())
foreach ($list->child as $node)
if (strcasecmp($node->name, "li") === 0) $items++;
if ($list === FALSE || $items === 0)
fprintf(STDOUT, "%s: No items in the first unordered list, or no unordered list.\n", $argv[$arg]);
else
fprintf(STDOUT, "%s: %d list items in the first unordered list.\n", $argv[$arg], $items);
} else
fprintf(STDERR, "%s: Cannot parse XML file.\n", $argv[$arg]);
unset($tidy);
} else
fprintf(STDERR, "%s: No such file.\n", $argv[$arg]);
?>
The PHP script takes the names of the HTML/XML files on the command line, and reports the number of list items in the first unordered list, ignoring any nested sub-lists, to standard output. Errors are output to standard error.
Hope this helps, Nominal Animal
Last edited by Nominal Animal; 03-21-2011 at 01:43 AM.
|
|
1 members found this post helpful.
|
01-08-2011, 01:12 AM
|
#4
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
Assuming you may also have nested li's within the first ul this should work ok:
Code:
awk '/<ul>/{x++}x && /<li>/{y++}/<\/ul>/{x--;if(!x){print "number of li = "y;exit}}' file
|
|
1 members found this post helpful.
|
01-08-2011, 05:49 AM
|
#5
|
Member
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120
Original Poster
Rep:
|
Nominal Animal, grail - your solutions work equally well. It will be interesting to compare performance.
I agree that the analysis of the html should be done with an appropriate tool. In fact, I am writing Python/BeautifulSoup scripts to extract information in tab-separated format. However, before writing this script I wanted to do a quickie to confirm whether my assumption was correct that all files have exactly 2 <LI> tags.
|
|
|
01-08-2011, 06:02 AM
|
#6
|
Senior Member
Registered: May 2005
Posts: 4,481
|
Quote:
Originally Posted by hashbang#!
... It will be interesting to compare performance. ...
|
First correctness, then performance.
|
|
|
01-08-2011, 06:19 AM
|
#7
|
Member
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120
Original Poster
Rep:
|
Quote:
Originally Posted by hashbang#!
Nominal Animal, grail - your solutions work equally well. It will be interesting to compare performance.
|
On 1100 files, the sed/grep solution took 6.7s, the awk command 1s.
|
|
|
01-08-2011, 06:59 PM
|
#8
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
Quote:
Originally Posted by grail
Assuming you may also have nested li's within the first ul this should work ok:
Code:
awk '/<ul>/{x++}x && /<li>/{y++}/<\/ul>/{x--;if(!x){print "number of li = "y;exit}}' file
|
Grail, that's excellent! But, it will return an incorrect result if there are more than one list item on a line; it also includes any nested list items in the count. (x==1&&/<li>/{y+=gsub(/<li>/,"")} fixes both.)
Based on Grails solution, using "<" (instead of a newline) as the record separator, I propose the following GNU awk script.
It supports nested lists (only outputs the number of list items immediately in the first unordered list), upper- and lowercase tag names, and attributes:
Code:
gawk -v "RS=<" '/^[Uu][Ll][\t\n\v\f\r >]/ { d++ }
/^[Ll][Ii][\t\n\v\f\r >]/ && d==1 { n++ }
/^\/[Uu][Ll][\t\n\v\f\r >]/ { d--; if (!d) { print "number of li = ", n; exit } }' file
"[\t\n\v\f\r >]" means a single ASCII whitespace or ">" character. I use d for depth, n for number of items.
Because the record separator is not part of the record itself, "^" always implies "<".
Based on the GNU Awk user manual this should work on all awk implementations, but I have not tested it that well.
This will still give the wrong answer if you use XHTML namespaces, or have an onordered list or list items within a comment or CDATA block. Nominal Animal
Last edited by Nominal Animal; 03-21-2011 at 01:43 AM.
|
|
1 members found this post helpful.
|
01-08-2011, 11:31 PM
|
#9
|
Senior Member
Registered: May 2005
Posts: 4,481
|
Quote:
Originally Posted by Nominal Animal
Grail, that's excellent! But, it will return an incorrect result if there are more than one list item on a line; it also includes any nested list items in the count. (x==1&&/<li>/{y+=gsub(/<li>/,"")} fixes both.)
Based on Grails solution, using "<" (instead of a newline) as the record separator, I propose the following GNU awk script.
It supports nested lists (only outputs the number of list items immediately in the first unordered list), upper- and lowercase tag names, and attributes:
Code:
gawk -v "RS=<" '/^[Uu][Ll][\t\n\v\f\r >]/ { d++ }
/^[Ll][Ii][\t\n\v\f\r >]/ && d==1 { n++ }
/^\/[Uu][Ll][\t\n\v\f\r >]/ { d--; if (!d) { print "number of li = ", n; exit } }' file
"[\t\n\v\f\r >]" means a single ASCII whitespace or ">" character. I use d for depth, n for number of items.
Because the record separator is not part of the record itself, "^" always implies "<".
Based on the GNU Awk user manual this should work on all awk implementations, but I have not tested it that well.
This will still give the wrong answer if you use XHTML namespaces, or have an onordered list or list items within a comment or CDATA block. Nominal Animal
|
Despite of my age and experience I'm still getting amazed at the effort humans put into writing wrong by construction parsers.
Take any off the shelf compliant XML parser and use it.
|
|
|
01-09-2011, 12:11 AM
|
#10
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
Quote:
Originally Posted by Sergei Steshenko
Despite of my age and experience I'm still getting amazed at the effort humans put into writing wrong by construction parsers.
Take any off the shelf compliant XML parser and use it.
|
This is HTML, not XML. You need a HTML or a "tag soup" parser.
I wanted to point out a problem in a viable, albeit limited, solution.
And I do believe I listed a simple PHP script using Tidy as an example solution based on recommended methods.
Did you not bother reading any of the preceding posts?
I'm rather disappointed how arrogant your reply was, Sergei.
Perhaps your age and experience is not as extensive as you believe. Nominal Animal
Last edited by Nominal Animal; 03-21-2011 at 01:40 AM.
|
|
|
01-09-2011, 12:30 AM
|
#11
|
Senior Member
Registered: May 2005
Posts: 4,481
|
Quote:
Originally Posted by Nominal Animal
This is HTML, not XML. You need a HTML or a "tag soup" parser.
I wanted to point out a problem in a viable, albeit limited, solution.
And I do believe I listed a simple PHP script using Tidy as an example solution based on recommended methods.
Did you not bother reading any of the preceding posts?
I'm rather disappointed how arrogant your reply was, Sergei.
Perhaps your age and experience is not as extensive as you believe. Nominal Animal
|
In this context HTML <-> XML doesn't matter - they are both tag oriented and not line oriented formats.
Limited solution == wrong solution - the parser is either compliant or not.
|
|
|
01-09-2011, 01:19 AM
|
#12
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
Quote:
Originally Posted by Sergei Steshenko
Limited solution == wrong solution
|
Be glad you're not a physicist; all solutions you'd have would be wrong.
Off the shelf compliant XML parsers are unable to process HTML, precisely because it is not XML. In fact, HTML belongs to a different group of languages, SGML. XML is incompatible with HTML, but HTML is transformable to XML (XHTML 1.0) using an SGML parser. XHTML is the only one which can be processed using XML tools. Unfortunately SGML is a very complex specification, and most SGML tools are too cumbersome to use with HTML.
This has resulted in a number of HTML-only or "tag soup" parsers, which are neither XML, nor SGML, parsers: they are specific to HTML and HTML-like markup, but typically also support XHTML/XML. The PHP example I supplied uses Tidy, which is bundled with PHP 5. (PHP 5 also has SimpleXML, which is an XML parser.) Many Python developers use BeautifulSoup. Neither SimpleXML or BeautifulSoup nor any other HTML tag soup parser is exact; they're designed to be robust, not perfect.
If you are going to suggest an off the shelf solution, please at least point at the working solutions. Waving your proverbial hand towards a group of unsuitable solutions while claiming superior experience is just annoying.
Nominal Animal
Last edited by Nominal Animal; 03-21-2011 at 02:01 AM.
|
|
|
01-09-2011, 05:45 AM
|
#13
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,042
|
Well I am not going to enter into yet another Sergei argument, but thought Nominal might appreciate the following:
Code:
awk 'BEGIN{ RS="<"; IGNORECASE=1 }
/^ul[^>]*>/ { d++ }
/^li[^>]*>/ && d==1 { n++ }
/^\/ul[^>]*>/ { d--; if (!d) { print "number of li = ", n; exit } }' file
|
|
1 members found this post helpful.
|
01-09-2011, 09:18 AM
|
#14
|
Senior Member
Registered: May 2005
Posts: 4,481
|
Quote:
Originally Posted by Nominal Animal
Be glad you're not a physicist; all solutions you'd have would be wrong. ...
|
I am a physicist, and I definitely know all physical solutions are wrong.
Back to our argument - are the limited parsers in this thread fully compliant with the language definition ?
|
|
|
01-09-2011, 01:05 PM
|
#15
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
@grail: That is elegant, but the list item also matches a link element; I wouldn't omit the whitespace alternate characters. I'm in the habit of writing out the character alternatives literally, because I often have to use non-POSIX/non-English locales; perhaps that's why I didn't think of using case insensitivity.. Thanks!
@Sergei Steshenko: Parsers in this thread are not fully compliant with the language definition. They don't need to be. As long as the user is aware of their limitations, they are useful.
Very, very few HTML pages are fully compliant with the language definition. The main difference between HTML and XML parsers is that HTML parsers are always approximate, while XML parsers usually just fail for erroneous input. For PHP, Tidy is very reliable even with bad input. For Python, BeautifulSoup 3.0.x is quite good, and is still being developed. BeautifulSoup 3.1 has been abandoned, because it performed significantly worse for erroneous input.
I think I understand your point -- just see the my first post in this thread (the third one from top) -- but I disagree.
I know limited solutions are extremely useful, if the user is aware of their limitations.
The reason why I took such strong exception to your post, is that I've seen that time and time again leading to rote application of known results, with minimal real understanding. Coupled with an arrogant patronizing quibble, I find the attitude extremely distasteful.
Nominal Animal
Last edited by Nominal Animal; 03-21-2011 at 02:00 AM.
|
|
|
All times are GMT -5. The time now is 07:39 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|