Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to
LinuxQuestions.org , a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free.
Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please
contact us . If you need to reset your password,
click here .
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a
virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month.
Click here for more info.
06-13-2017, 05:33 AM
#1
Member
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: Arch Linux && OpenBSD 7.4 && Pop!_OS && Kali && Qubes-Os
Posts: 824
[perl] making syntax highlight for html in a script.
hello to all ppl @ programming forum.
i have been coding a http sniffer in perl, which works kinda ok, only thing that makes me wanna code it better is syntax highlight for javascript / html. this part of the program has been practice of regexes to me.
Code:
#!/usr/bin/perl -w
use strict;
use warnings;
use Term::ANSIColor qw(:constants);
use Term::ANSIColor;
my @data = <<'EOL';
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>TechSpot : Tech Enthusiasts, Power Users, Gamers</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="environment" content="production" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="viewport" content="width=device-width,maximum-scale=1,initial-scale=1,user-scalable=no" />
<meta name="title" content="TechSpot" />
<nav id="main-menu" class="menu">
<div class="wrapper">
<ul id="dk_menu" class="">
<li class="withSubmenuT">
<a href="/trending/" data-source="trending">Trending</a>
<div class='nav-submenu'>
<ul class="SubHeader">
<li class="DefArrow"><a href="/category/hardware/">Hardware</a></li>
<li class="GArrow"><a href="/category/web/">The Web</a></li>
<li class="PArrow"><a href="/category/culture/">Culture</a></li>
<li class="DefArrow"><a href="/category/mobile/">Mobile</a></li>
<li class="OArrow"><a href="/category/gaming/">Gaming</a></li>
<li class="DefArrow"><a href="/category/apple/">Apple</a></li>
<li class="DefArrow"><a href="/category/microsoft/">Microsoft</a></li>
<li class="DefArrow"><a href="/category/google/">Google</a></li>
</ul>
<div class="loadAnim">
<img src="/images/loading_blue2.gif" alt="Loading GIF">
</div>
<div class="wrapper">
</div>
</div>
</li>
<li class="withSubmenuT">
<div class="article-category">
<a href="http://www.techspot.com/category/hardware/">Hardware</a>
<a href="http://www.techspot.com/drivers/" class="cat2">Drivers</a>
</div>
<h2>
<a href="http://www.techspot.com/downloads/drivers/essentials/amd-crimson-hotfix/">Crimson Hotfix driver improves DiRT 4 and Prey performance</a>
</h2>
<div class="intro"><p>This new AMD driver offers a considerable performance boost of up to 30% when using 8xMSAA in DiRT 4 and Prey. Theres also a long list of fixed issues among which are a Virtual Super Resolution failure and flickering in some Radeon RX 400 and RX 500 products and enabling HDR on high resolution displays. Go to <a href="http://www.techspot.com/downloads/drivers/essentials/amd-crimson-hotfix/">our drivers section</a> for complete release notes and downloads for all platforms.</p>
</div>
<div class="byline">
By Erik Orejuela, <time datetime="2017-06-09 18:00:00-0500" itemprop="datePublished"><span title="2017-06-09 18:00:00">June 9, 2017, 6:00 PM</span></time>
</div>
</div> <!-- /.article-content -->
<div class="clearfix"></div>
</div><!-- /.article-img -->
<div class="article-content ">
<div class="article-category">
<a href="http://www.techspot.com/category/apple/">Apple</a>
<a href="http://www.techspot.com/category/security/" class="cat2">Security</a>
</div>
<h2>
<a href="/news/69652-employees-chinese-apple-suppliers-arrested-selling-customer-data.html">Employees of Chinese Apple suppliers arrested for selling customer data</a>
</h2>
<div class="intro">Apple’s long had a turbulent relationship with China, from trouble with regulators over iTunes Movies and the iBooks store, to Apple News censorship, to numerous patent disputes. Now, the firm is facing more problems in the Asian nation after Chinese…</div>
<div class="byline">
By Rob Thubron, <time datetime="2017-06-09 11:15:00-0500" itemprop="datePublished"><span title="2017-06-09 11:15:00">June 9, 2017, 11:15 AM</span></time>
<em>
<a href="/news/69652-employees-chinese-apple-suppliers-arrested-selling-customer-data.html#commentsOffset" class="comment-count"><span class="highlight ">3 comments</span></a>
</em>
</div>
</div><!-- /.article-content -->
<div class="clearfix"></div>
</article>
<div class="article-content ">
<div class="article-category">
<a href="http://www.techspot.com/category/gaming/">Gaming</a>
</div>
<h2>
<a href="/news/69642-project-cars-2-trailer-pre-order-details-revealed.html">Project Cars 2 trailer, pre-order details revealed</a>
</h2>
<div class="intro">Distributor Bandai Namco and Slightly Mad Studios, the developer and publisher behind Project Cars, announced on Thursday the second installment of the highly successful racing game. Accompanying the announcement is a first-look trailer which we’ve embedded above.As a huge fan…</div>
<div class="byline">
By Shawn Knight, <time datetime="2017-06-09 07:00:00-0500" itemprop="datePublished"><span title="2017-06-09 07:00:00">June 9, 2017, 7:00 AM</span></time>
</div>
</div><!-- /.article-content -->
<div class="clearfix"></div>
</article>
<article>
<div class="article-img">
<a href="/news/69649-intel-hints-microsoft-qualcomm-windows-10arm-x86-emulation.html">
<img class="b-lazy"
src=data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
data-src="/images2/news/ts3_thumbs/2017/06/2017-06-09-ts3_thumbs-300-small.jpg"
/>
</a>
</div>< !-- /.article-img -->
</div> <!-- /.article-content -->
<article>
<div class="article-content featured no_image">
<h2>
<a href="http://www.techspot.com/review/1405-corsair-glaive-mouse/">Corsair Glaive RGB Gaming Mouse Review</a>
</h2>
<div class="byline">
<a href="/category/techspot/"><span style="color:#c00; text-decoration:none; font-weight:600; padding: 0 8px 0 0;">FEATURE</span></a> By Tim Schiesser, <time datetime="2017-06-09 00:00:00-0500" itemprop="datePublished"><span title="2017-06-09 00:00:00">June 9, 2017, 12:00 AM</span></time>
<em>
<a href="http://www.techspot.com/category/techspot/">TechSpot</a>
<a href="http://www.techspot.com/category/hardware/" class="cat2">Hardware</a>
<form action="http://techspot.us1.list-manage.com/subscribe/post?u=e4cfda23de0688b6339e986ae&id=087c9b1822" method="post" id="mc-embedded-subscribe-form" name="mc-embedded-subscribe-form" class="validate" target="_blank">
<div class="copyright">
<div class="wrapper">
<p>© 2017 TechSpot, Inc. All Rights Reserved.</p>
<p>TechSpot is a registered trademark. <a rel="nofollow" href="/terms.html">Terms of Use</a> <a rel="nofollow" href="/privacy.html">Privacy Policy</a> <a rel="nofollow" href="https://wrightsmedia.com/sites/techspot/" target="_blank">Licensing</a> <a rel="nofollow" href="http://www.techspot.com/advertising/">Advertise</a></p>
<p>International Editions: <a rel="alternate" hreflang="en-US" href="http://www.techspot.com">US / UK</a> <a rel="alternate" hreflang="en-IN" href="http://www.in.techspot.com">India</a></p>
</div>
</div>
</footer>
</div> <!-- end GlobalWrapper -->
<script type="text/javascript">
$(document).ready(function() {
var _sf_startpt=(new Date()).getTime();
TSTopMenu();
TSAlerts();
TSMainMenuInit();
showPrettyDates();
});
</script>
<div id="c-mask" class="c-mask"></div>
</body>
</html>
EOL
source_check(@data);
sub source_check
{
my $data = $_[0];
while ($data =~ m!((?:\S|\s))!g) {
print colored ("$1", "dark yellow");
if ($data =~ m!\G([</]+)([\w\s]+)(>)!gci) { # <script> , </li>
print colored ("$1", "white");
print colored ("$2", "magenta");
print colored ("$3", "white");
}
elsif ($data =~ m!\G(<)([\w\s]+)([=\s]{1,})([\w\-0-9]+)([>]{1})!gci) { # <script type=en>
print colored ("$1", "white");
print colored ("$2", "magenta");
print colored ("$3", "white");
print colored ("$4", "blue");
print colored ("$5", "white");
}
elsif ($data =~ m!\G(<)([\w\s]+)([=\s]{1,})!gci) { # <script type=
print colored ("$1", "white");
print colored ("$2", "magenta");
print colored ("$3", "white");
}
elsif ($data =~ m!\G(<)([/\w\s]+)([=])([\w-]+)(>)!gci) { # <html lang=en>
print colored ("$1", "white"); # <meta charset=utf-8>
print colored ("$2", "magenta");
print colored ("$3", "white");
print colored ("$4", "magenta");
print colored ("$5", "white");
}
elsif ($data =~ m!\G(\<\!)(.*?)([\-\]]{2}\>)!gci) { # <!-- matches comments --> <![CDATA[ ]]>
print colored ("$1", "dark green");
print colored ("$2", "dark green");
print colored ("$3", "dark green");
}
elsif ($data =~ m!\G(['"]{1})([\w\s\-\:]+)(['"\s>]{1,})!gci) { # "list-of-entities__item-link"
print colored ("$1", "white");
print colored ("$2", "blue");
print colored ("$3", "white");
}
elsif ($data =~ m!\G("?[a-z]{3,10}\:[/]{0,2})((\w+(?:\.|\_|\-|\@|\%|\$))+)([a-z]{2,6})((/)?(.*?)(")?)!gci) { # "http://foo.foo.com/(.*?)"
print colored ("$1", "red");
print colored ("$2", "red");
print colored ("$4", "red");
print colored ("$5", "red");
}
}
};
that perl code i have been using as a syntax highlight part of my program. to me it looks ugly and i would be grateful for code examples how to do it differently. im not asking you to write it completely but it would be nice to get advice how to do it differently.
06-13-2017, 05:44 AM
#2
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,306
Well, I would not run it as root.
About the perl code itself, there are already modules for parsing HTML which you can use to extract elements and highlight them. Start with a look at either of CPAN's HTML parsing modules : HTML::Parser or HTML::TokeParser
06-13-2017, 10:00 AM
#3
Member
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508
Rep:
Code isn't always necessarily "pretty."
I ran your program and it looks quite good to me, though I'm not that familiar with colorizing terminal output.
Last edited by Laserbeak; 06-14-2017 at 03:46 PM .
06-16-2017, 01:47 AM
#4
Member
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: Arch Linux && OpenBSD 7.4 && Pop!_OS && Kali && Qubes-Os
Posts: 824
Original Poster
Quote:
Originally Posted by
Turbocapitalist
Well, I would not run it as root.
About the perl code itself, there are already modules for parsing HTML which you can use to extract elements and highlight them. Start with a look at either of CPAN's HTML parsing modules : HTML::Parser or HTML::TokeParser
thanks for the tip.
i have been practicing writing regexes using html::tokeparser::simple
Code:
#!/usr/bin/perl -w
use strict;
use warnings;
use Term::ANSIColor qw(:constants);
use Term::ANSIColor;
use HTML::TokeParser::Simple;
my $html = join '', <DATA>;
source_check($html);
sub source_check
{
my $data = $_[0];
my $ignore=0;
my $p = HTML::TokeParser::Simple->new(string => $data);
while ( my $token = $p->get_token ) { # List of all HTML tags
if ( $token->is_start_tag(qr/^(?:a|abbr|acronym|address|applet|area|article|aside
|audio|b|base|basefont|bdi|bdo|bgsound|big|blink
|blockquote|body|br|button|canvas|caption|center
|cite|code|col|colgroup|command|content|data
|datalist|dd|del|details|dfn|dialog|dir|div|dl
|dt|element|em|embed|fieldset|figcaption|figure
|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6
|head|header|hgroup|hr|html|i|iframe|image|img
|input|ins|isindex|kbd|keygen|label|legend|li
|link|listing|main|map|mark|marquee|menu|menuitem
|meta|meter|multicol|nav|nobr|noembed|noframes
|noscript|object|ol|optgroup|option|output|p
|param|picture|plaintext|pre|progress|q|rp|rt
|rtc|ruby|s|samp|script|section|select|shadow
|slot|small|source|spacer|span|strike|strong
|style|sub|summary|sup|table|tbody|td|template
|textarea|tfoot|th|thead|time|title|tr|track|tt
|u|ul|var|video|wbr|xmp)$/ix) ) {
if ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([>]{1})!gcix ) {
# <h1> # works
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white");
next;
}
# <meta charset="utf-8" /> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+[\w]{4})([='"]+)([/\w\d\-\_\.]+)(['"\s/>]{3,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white");
next;
}
# <meta property="og:site_name" content="TEMPLATED" /> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]{2,}[\w]{1,})([='"]+)([\w\d\-\_\.\:]{2,})([='"]+)([\s\w]+)([='"]{2,})([\w\d]+)(['"\s/>]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <meta http-equiv="X-UA-Compatible" content="IE=edge" />
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]{2,}[\w\-]{2,})([='"]+)([\w\d\-\_\.\:]{2,})([='"]+)([\s\w]+)([='"]{2,})([\w\d\=]+)(['"\s/>]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <a href="https://twitter.com/share"Tweet</a> # works
elsif ( $token->as_is =~ m!\G([<]{1})(\w\s[\w]{4})([='"]+)([/\w\-\_\.]+)(.*?)(['"]+)!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "red"),print( colored "$5", "red"),print( colored "$6", "white");
next;
}
# <a href # works
elsif ( $token->as_is =~ m!\G([<]{1})(\w\s[\w]{4})([='"]+)([/\w\-\_\.]+)(.*?)(['"]+)!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "red"),print( colored "$5", "red"),print( colored "$6", "white");
next;
}
# <meta property="twitter:site" content="@templatedco" /> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]{2,}[\w]{1,})([='"]+)([\w\d\-\_\.\:]{2,})([='"]+)([\s\w]+)([='"]{2,})([\@\w\d]+)(['"\s/>]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <script type="text/javascript" src="/uutiset/public/custom_components/modernizr.min.js">
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+[\w]{4})([='"]{2,})([\w\/\w]+)(['"]+)([\w]+)([='"]{2,})([\w\/\-\_\.]+)([\s'">]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "red"),print( colored "$9", "white");
next;
}
# <script type="text/javascript"> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+[\w]{4})([='"]{2,})([\w\/\w]+)(['">]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white");
next;
}
# <script src="/advertisement.ad.js" async="async"> # works
elsif ( $token->as_is =~ m!\G([<]{1})([a-zA-Z\s]+)([='"]{2})([\/\.a-zA-z]+)(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z]+)(['">]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "red"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s\-]+)([=]{1})(['"]+)([\w\/\-\;]+)(['"]+)([\w\d\s\-]+)([='"]{2})([\w\/\-]+)([;]+)([\w\d\s]+)([=]+)([\w\d\/\-]+)(['"]{1})([\s\/>]+)!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "white"),print( colored "$5", "blue"),print( colored "$6", "white"),print( colored "$7", "blue"),print( colored "$8", "white"),print( colored "$9", "blue"),print( colored "$10", "white"),print( colored "$11", "blue"),print( colored "$12", "white"),print( colored "$13", "blue"),print( colored "$14", "white"),print( colored "$15", "white");
next;
}
# <meta name="robots" content="noindex" /> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([='"]{2,})([a-zA-Z]{6})(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z]+)(['"\s\/>]{3,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <meta name="description" content="Find latest news coverage of breaking news events, trending topics, and compelling articles, photos and videos of US and international news stories."/> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([='"]{2,})([a-zA-Z]{11,})(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z0-9\s\,\.\s]+)(['"\s\/>]{3,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <meta name="Description" content="the glories of HTML::TokeParser::Simple" /> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([='"]{2,})([a-zA-Z]{8,})(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z0-9\s\:].*?)(['"\s\/>]{4,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
# <img alt="image alt text" src="my.gif"> # works
elsif ( $token->as_is =~ m!\G([<]{1})([\w]{3}[\s\w]{3,})([='"\s]{2,})([\w\s]+)(['"]+\s)([\w]+)([="]+)([\w\.\w]+)(['">]{2,})!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
next;
}
}
if ( $token->is_end_tag(qr/^(?:a|abbr|acronym|address|applet|area|article|aside
|audio|b|base|basefont|bdi|bdo|bgsound|big|blink
|blockquote|body|br|button|canvas|caption|center
|cite|code|col|colgroup|command|content|data
|datalist|dd|del|details|dfn|dialog|dir|div|dl
|dt|element|em|embed|fieldset|figcaption|figure
|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6
|head|header|hgroup|hr|html|i|iframe|image|img
|input|ins|isindex|kbd|keygen|label|legend|li
|link|listing|main|map|mark|marquee|menu|menuitem
|meta|meter|multicol|nav|nobr|noembed|noframes
|noscript|object|ol|optgroup|option|output|p
|param|picture|plaintext|pre|progress|q|rp|rt
|rtc|ruby|s|samp|script|section|select|shadow
|slot|small|source|spacer|span|strike|strong
|style|sub|summary|sup|table|tbody|td|template
|textarea|tfoot|th|thead|time|title|tr|track|tt
|u|ul|var|video|wbr|xmp)$/ix) ) {
if ( $token->as_is =~ m!\G([</]+)([\w]+)(>)!gcix ) {
print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white");
next;
}
}
if ($ignore) {
#Everything inside the script tag. Here you can ignore or print as is
if ($token->as_is) {
print $token->as_is;
}
}
else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is;
}
}
}
__DATA__
<!doctype html>
<!--[if lt IE 7 ]> <html lang="fi" class="ie ie6"> <![endif]-->
<!--[if IE 7 ]> <html lang="fi" class="ie ie7"> <![endif]-->
<!--[if IE 8 ]> <html lang="fi" class="ie ie8"> <![endif]-->
<!--[if IE 9 ]> <html lang="fi" class="ie ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--><html lang="fi"><!--<![endif]-->
<html>
<head>
<title>//////'s test page</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="Description" content="the glories of HTML::TokeParser::Simple" />
<meta name="keywords" content="one two three four five six seven eight nine ten" />
<meta name="robots" content="noindex" />
<link rel="stylesheet" type="text/css" href="cwi.css" />
</head>
<body>
<h1>header one</h1>
<h2>header two</h2>
<h3>header three</h3>
<h4>header four</h4>
<h5>header five</h5>
<h6>header siz</h6>
<p>p tag paragraph</p>
<p>p tag containing <u>underline</u> and <b>bold</b> and a <a href="http://test.foo.com/link.html">link</a></p>
<p>p tag containing <u>underline</u> and <b>bold</b> and a <a href="http://foo.com/bar/link.html">link</a></p>
<img alt="image alt text" src="my.gif">
</body>
</html>
i just don't know how to parse comments.
Code:
<!-- comment foobar -->
<![CDATA[ blah blah ]]>
06-16-2017, 02:19 AM
#5
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,306
For the comments, there is the method is_comment
For the regular HTML elements, I'd say it's not necessary or helpful even to try to enummerate them. Just use the is_start_tag without passing any arguments to it.
06-16-2017, 04:03 AM
#6
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,306
I'd make use of the parser's methods something like this:
Code:
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag ) {
my $element = $token->[1];
my @attrs = @{$token->[3]};
# print Dumper( @attr ),qq(\n\n);
print qq(<),colored($element,'magenta');
foreach my $attr ( @attrs ) {
if ( $attr ne '/' && $attr ne 'script' ) {
print qq( ),colored($attr, 'cyan');
if ( defined ( $token->[2]{$attr} ) ) {
print qq(="),colored($token->[2]{$attr},'green'),qq(");
}
} else {
print qq( $attr);
}
}
print qq(>);
} elsif ( $token->is_text ) {
print colored($token->[1],'white');
} elsif ( $token->is_end_tag ) {
print qq(</),colored( $token->[1],'magenta'),qq(>);
} elsif ( $token->is_comment ) {
print colored( $token->[1],'white');
} elsif ( $token->is_declaration ) {
print colored( $token->[1],'bright_white');
} else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is;
}
}
If you want to examine the tokens in a more generic way, there is the module Data::Dumper which provides the function or method to show arbitrary data structures.
1 members found this post helpful.
06-16-2017, 06:36 AM
#7
Member
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: Arch Linux && OpenBSD 7.4 && Pop!_OS && Kali && Qubes-Os
Posts: 824
Original Poster
that is awesome script.
thanks alot for your input.
now i can start looking for other bugs in my script
All times are GMT -5. The time now is 02:25 PM .
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know .
Latest Threads
LQ News