LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-13-2017, 05:33 AM   #1
//////
Member
 
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: win 10 | OpenBSD 6.1 | Fedora 26
Posts: 241

Rep: Reputation: 72
[perl] making syntax highlight for html in a script.


hello to all ppl @ programming forum.

i have been coding a http sniffer in perl, which works kinda ok, only thing that makes me wanna code it better is syntax highlight for javascript / html. this part of the program has been practice of regexes to me.

Code:
#!/usr/bin/perl -w
use strict;
use warnings;

use Term::ANSIColor qw(:constants);
use Term::ANSIColor;

my @data =  <<'EOL';
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>TechSpot : Tech Enthusiasts, Power Users, Gamers</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="environment" content="production" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="viewport" content="width=device-width,maximum-scale=1,initial-scale=1,user-scalable=no" /> 
<meta name="title" content="TechSpot" />
	<nav id="main-menu" class="menu">
		<div class="wrapper">
			<ul id="dk_menu" class="">
				<li class="withSubmenuT">
					<a href="/trending/" data-source="trending">Trending</a>
					<div class='nav-submenu'>
						<ul class="SubHeader">
							<li class="DefArrow"><a href="/category/hardware/">Hardware</a></li>
							<li class="GArrow"><a href="/category/web/">The Web</a></li>
							<li class="PArrow"><a href="/category/culture/">Culture</a></li>
							<li class="DefArrow"><a href="/category/mobile/">Mobile</a></li>
							<li class="OArrow"><a href="/category/gaming/">Gaming</a></li>
							<li class="DefArrow"><a href="/category/apple/">Apple</a></li>
							<li class="DefArrow"><a href="/category/microsoft/">Microsoft</a></li>
							<li class="DefArrow"><a href="/category/google/">Google</a></li>
						</ul>
						<div class="loadAnim">
								<img src="/images/loading_blue2.gif" alt="Loading GIF">
							</div>
						<div class="wrapper">
						</div>
					</div>
				</li>
				<li class="withSubmenuT">


					<div class="article-category">
				<a href="http://www.techspot.com/category/hardware/">Hardware</a>
									<a href="http://www.techspot.com/drivers/" class="cat2">Drivers</a>
							</div>
		
		<h2>
							<a href="http://www.techspot.com/downloads/drivers/essentials/amd-crimson-hotfix/">Crimson Hotfix driver improves DiRT 4 and Prey&nbsp;performance</a>
					</h2>

				

		<div class="intro"><p>This new AMD driver offers a considerable performance boost of up to 30% when using 8xMSAA in DiRT 4 and Prey. Theres also a long list of fixed issues among which are a Virtual Super Resolution failure and flickering in some Radeon RX 400 and RX 500 products and enabling HDR on high resolution displays. Go to <a href="http://www.techspot.com/downloads/drivers/essentials/amd-crimson-hotfix/">our drivers section</a> for complete release notes and downloads for all platforms.</p>
</div>


					<div class="byline">
																	By Erik Orejuela, <time datetime="2017-06-09 18:00:00-0500" itemprop="datePublished"><span title="2017-06-09 18:00:00">June 9, 2017, 6:00 PM</span></time>
																	</div>
		
	</div>    <!-- /.article-content -->
	
	<div class="clearfix"></div>
	</div><!-- /.article-img -->
	
	<div class="article-content  ">

					<div class="article-category">
				<a href="http://www.techspot.com/category/apple/">Apple</a>
									<a href="http://www.techspot.com/category/security/" class="cat2">Security</a>
							</div>
		
		<h2>
							<a href="/news/69652-employees-chinese-apple-suppliers-arrested-selling-customer-data.html">Employees of Chinese Apple suppliers arrested for selling customer&nbsp;data</a>
					</h2>

				

		<div class="intro">Apple&rsquo;s long had a turbulent relationship with China, from trouble with regulators over iTunes Movies and the iBooks store, to Apple News censorship, to numerous patent disputes. Now, the firm is facing more problems in the Asian nation after Chinese&hellip;</div>


					<div class="byline">
																	By Rob Thubron, <time datetime="2017-06-09 11:15:00-0500" itemprop="datePublished"><span title="2017-06-09 11:15:00">June 9, 2017, 11:15 AM</span></time>
																									<em>
							<a href="/news/69652-employees-chinese-apple-suppliers-arrested-selling-customer-data.html#commentsOffset" class="comment-count"><span class="highlight ">3 comments</span></a>
						</em>
												</div>
		
	</div><!-- /.article-content -->
	
	<div class="clearfix"></div>
</article>

	<div class="article-content  ">

					<div class="article-category">
				<a href="http://www.techspot.com/category/gaming/">Gaming</a>
							</div>
		
		<h2>
							<a href="/news/69642-project-cars-2-trailer-pre-order-details-revealed.html">Project Cars 2 trailer, pre-order details&nbsp;revealed</a>
					</h2>

				

		<div class="intro">Distributor Bandai Namco and Slightly Mad Studios, the developer and publisher behind Project Cars, announced on Thursday the second installment of the highly successful racing game. Accompanying the announcement is a first-look trailer which we&rsquo;ve embedded above.As a huge fan&hellip;</div>


					<div class="byline">
																	By Shawn Knight, <time datetime="2017-06-09 07:00:00-0500" itemprop="datePublished"><span title="2017-06-09 07:00:00">June 9, 2017, 7:00 AM</span></time>
																										</div>
		
	</div><!-- /.article-content -->
	
	<div class="clearfix"></div>
</article>

<article>
		<div class="article-img">
		<a href="/news/69649-intel-hints-microsoft-qualcomm-windows-10arm-x86-emulation.html">
						<img class="b-lazy" 
         		src=
         		data-src="/images2/news/ts3_thumbs/2017/06/2017-06-09-ts3_thumbs-300-small.jpg"
			/>
					</a>
	</div><       !-- /.article-img -->
	
	
	</div>     <!-- /.article-content -->

<article>
	
	<div class="article-content featured no_image">

		
		<h2>
							<a href="http://www.techspot.com/review/1405-corsair-glaive-mouse/">Corsair Glaive RGB Gaming Mouse&nbsp;Review</a>
					</h2>

					<div class="byline">
												<a href="/category/techspot/"><span style="color:#c00; text-decoration:none; font-weight:600; padding: 0 8px 0 0;">FEATURE</span></a>				By Tim Schiesser, <time datetime="2017-06-09 00:00:00-0500" itemprop="datePublished"><span title="2017-06-09 00:00:00">June 9, 2017, 12:00 AM</span></time>
			
							<em>
					<a href="http://www.techspot.com/category/techspot/">TechSpot</a>
											<a href="http://www.techspot.com/category/hardware/" class="cat2">Hardware</a>

			
			<form action="http://techspot.us1.list-manage.com/subscribe/post?u=e4cfda23de0688b6339e986ae&amp;id=087c9b1822" method="post" id="mc-embedded-subscribe-form" name="mc-embedded-subscribe-form" class="validate" target="_blank">
	<div class="copyright">
		<div class="wrapper">
			<p>&copy; 2017 TechSpot, Inc. All Rights Reserved.</p>
							<p>TechSpot is a registered trademark. <a rel="nofollow" href="/terms.html">Terms of Use</a> <a rel="nofollow" href="/privacy.html">Privacy Policy</a> <a rel="nofollow" href="https://wrightsmedia.com/sites/techspot/" target="_blank">Licensing</a> <a rel="nofollow" href="http://www.techspot.com/advertising/">Advertise</a></p>
								<p>International Editions: <a rel="alternate" hreflang="en-US" href="http://www.techspot.com">US / UK</a> <a rel="alternate" hreflang="en-IN" href="http://www.in.techspot.com">India</a></p>
									</div>
	</div>
</footer>
</div>    <!-- end GlobalWrapper -->

<script type="text/javascript">
	$(document).ready(function() {
		var _sf_startpt=(new Date()).getTime();
		TSTopMenu();
		TSAlerts();
		TSMainMenuInit();
							showPrettyDates();
			});
</script>



<div id="c-mask" class="c-mask"></div>

</body>
</html>
EOL

source_check(@data);

sub source_check
{

my $data = $_[0];

	 while ($data =~ m!((?:\S|\s))!g) {
		print colored ("$1", "dark yellow");
		
		if ($data =~ m!\G([</]+)([\w\s]+)(>)!gci) { #  <script> , </li>		              
			print colored ("$1", "white");
			print colored ("$2", "magenta");
			print colored ("$3", "white");
		}
		elsif ($data =~ m!\G(<)([\w\s]+)([=\s]{1,})([\w\-0-9]+)([>]{1})!gci) {	# <script type=en>
			print colored ("$1", "white");
			print colored ("$2", "magenta");
			print colored ("$3", "white");
			print colored ("$4", "blue");
			print colored ("$5", "white");
			
		}
		elsif ($data =~ m!\G(<)([\w\s]+)([=\s]{1,})!gci) {	# <script type=
			print colored ("$1", "white");
			print colored ("$2", "magenta");
			print colored ("$3", "white");
		}
		elsif ($data =~ m!\G(<)([/\w\s]+)([=])([\w-]+)(>)!gci) {	# <html lang=en>
			print colored ("$1", "white");							# <meta charset=utf-8>
			print colored ("$2", "magenta");
			print colored ("$3", "white");
			print colored ("$4", "magenta");
			print colored ("$5", "white");
		}
		elsif ($data =~ m!\G(\<\!)(.*?)([\-\]]{2}\>)!gci) {	# <!-- matches comments --> <![CDATA[    ]]>
			print colored ("$1", "dark green");
			print colored ("$2", "dark green");
			print colored ("$3", "dark green");
		}
		elsif ($data =~ m!\G(['"]{1})([\w\s\-\:]+)(['"\s>]{1,})!gci) { # "list-of-entities__item-link"
			print colored ("$1", "white");
			print colored ("$2", "blue");
			print colored ("$3", "white");
		}
		elsif ($data =~ m!\G("?[a-z]{3,10}\:[/]{0,2})((\w+(?:\.|\_|\-|\@|\%|\$))+)([a-z]{2,6})((/)?(.*?)(")?)!gci) {		# "http://foo.foo.com/(.*?)"
			print colored ("$1", "red");
			print colored ("$2", "red");
			print colored ("$4", "red");
			print colored ("$5", "red");
		}
	}
};
that perl code i have been using as a syntax highlight part of my program. to me it looks ugly and i would be grateful for code examples how to do it differently. im not asking you to write it completely but it would be nice to get advice how to do it differently.
 
Old 06-13-2017, 05:44 AM   #2
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Ubuntu, Devuan, OpenBSD
Posts: 2,372
Blog Entries: 3

Rep: Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049
Well, I would not run it as root.

About the perl code itself, there are already modules for parsing HTML which you can use to extract elements and highlight them. Start with a look at either of CPAN's HTML parsing modules : HTML::Parser or HTML::TokeParser
 
Old 06-13-2017, 10:00 AM   #3
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 142Reputation: 142
Code isn't always necessarily "pretty."

I ran your program and it looks quite good to me, though I'm not that familiar with colorizing terminal output.

Last edited by Laserbeak; 06-14-2017 at 03:46 PM.
 
Old 06-16-2017, 01:47 AM   #4
//////
Member
 
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: win 10 | OpenBSD 6.1 | Fedora 26
Posts: 241

Original Poster
Rep: Reputation: 72
Quote:
Originally Posted by Turbocapitalist View Post
Well, I would not run it as root.

About the perl code itself, there are already modules for parsing HTML which you can use to extract elements and highlight them. Start with a look at either of CPAN's HTML parsing modules : HTML::Parser or HTML::TokeParser
thanks for the tip.

i have been practicing writing regexes using html::tokeparser::simple

Code:
#!/usr/bin/perl -w
use strict;
use warnings;

use Term::ANSIColor qw(:constants);
use Term::ANSIColor;
use HTML::TokeParser::Simple;

my $html = join '', <DATA>;

source_check($html);

sub source_check
{

my $data = $_[0];

my $ignore=0;

my $p = HTML::TokeParser::Simple->new(string => $data);

while ( my $token = $p->get_token ) {                                    # List of all HTML tags
	if ( $token->is_start_tag(qr/^(?:a|abbr|acronym|address|applet|area|article|aside
									|audio|b|base|basefont|bdi|bdo|bgsound|big|blink
									|blockquote|body|br|button|canvas|caption|center
									|cite|code|col|colgroup|command|content|data
									|datalist|dd|del|details|dfn|dialog|dir|div|dl
									|dt|element|em|embed|fieldset|figcaption|figure
									|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6
									|head|header|hgroup|hr|html|i|iframe|image|img
									|input|ins|isindex|kbd|keygen|label|legend|li
									|link|listing|main|map|mark|marquee|menu|menuitem
									|meta|meter|multicol|nav|nobr|noembed|noframes
									|noscript|object|ol|optgroup|option|output|p
									|param|picture|plaintext|pre|progress|q|rp|rt
									|rtc|ruby|s|samp|script|section|select|shadow
									|slot|small|source|spacer|span|strike|strong
									|style|sub|summary|sup|table|tbody|td|template
									|textarea|tfoot|th|thead|time|title|tr|track|tt
									|u|ul|var|video|wbr|xmp)$/ix) ) {

		if ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([>]{1})!gcix ) {
			#  <h1>  # works
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"); 
			next;
		}
		#  <meta charset="utf-8" />  #  works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+[\w]{4})([='"]+)([/\w\d\-\_\.]+)(['"\s/>]{3,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white");
			next;
		}
        #  <meta property="og:site_name" content="TEMPLATED" />  #  works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]{2,}[\w]{1,})([='"]+)([\w\d\-\_\.\:]{2,})([='"]+)([\s\w]+)([='"]{2,})([\w\d]+)(['"\s/>]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;      
		}
		   # <meta http-equiv="X-UA-Compatible" content="IE=edge" />
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]{2,}[\w\-]{2,})([='"]+)([\w\d\-\_\.\:]{2,})([='"]+)([\s\w]+)([='"]{2,})([\w\d\=]+)(['"\s/>]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
		#  <a href="https://twitter.com/share"Tweet</a>     #  works
		elsif ( $token->as_is =~ m!\G([<]{1})(\w\s[\w]{4})([='"]+)([/\w\-\_\.]+)(.*?)(['"]+)!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "red"),print( colored "$5", "red"),print( colored "$6", "white");
			next;
		}
		#  <a href # works
		elsif ( $token->as_is =~ m!\G([<]{1})(\w\s[\w]{4})([='"]+)([/\w\-\_\.]+)(.*?)(['"]+)!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "red"),print( colored "$5", "red"),print( colored "$6", "white");
			next;
		}
		#  <meta property="twitter:site" content="@templatedco" />  # works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]{2,}[\w]{1,})([='"]+)([\w\d\-\_\.\:]{2,})([='"]+)([\s\w]+)([='"]{2,})([\@\w\d]+)(['"\s/>]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
		# <script type="text/javascript" src="/uutiset/public/custom_components/modernizr.min.js">
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+[\w]{4})([='"]{2,})([\w\/\w]+)(['"]+)([\w]+)([='"]{2,})([\w\/\-\_\.]+)([\s'">]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "red"),print( colored "$9", "white");
			next;
		} 
		# <script type="text/javascript"> # works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+[\w]{4})([='"]{2,})([\w\/\w]+)(['">]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white");
			next;
		}
		# <script src="/advertisement.ad.js" async="async"> # works
		elsif ( $token->as_is =~ m!\G([<]{1})([a-zA-Z\s]+)([='"]{2})([\/\.a-zA-z]+)(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z]+)(['">]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "red"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
		# <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> # works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s\-]+)([=]{1})(['"]+)([\w\/\-\;]+)(['"]+)([\w\d\s\-]+)([='"]{2})([\w\/\-]+)([;]+)([\w\d\s]+)([=]+)([\w\d\/\-]+)(['"]{1})([\s\/>]+)!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "white"),print( colored "$5", "blue"),print( colored "$6", "white"),print( colored "$7", "blue"),print( colored "$8", "white"),print( colored "$9", "blue"),print( colored "$10", "white"),print( colored "$11", "blue"),print( colored "$12", "white"),print( colored "$13", "blue"),print( colored "$14", "white"),print( colored "$15", "white");
			next;
		}
		# <meta name="robots" content="noindex" /> # works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([='"]{2,})([a-zA-Z]{6})(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z]+)(['"\s\/>]{3,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
		# <meta name="description" content="Find latest news coverage of breaking news events, trending topics, and compelling articles, photos and videos of US and international news stories."/>  # works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([='"]{2,})([a-zA-Z]{11,})(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z0-9\s\,\.\s]+)(['"\s\/>]{3,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
		# <meta name="Description" content="the glories of HTML::TokeParser::Simple" />	 #  works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w\s]+)([='"]{2,})([a-zA-Z]{8,})(['"\s]+)([a-zA-Z]+)([='"]{2,})([a-zA-Z0-9\s\:].*?)(['"\s\/>]{4,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
		# <img alt="image alt text" src="my.gif"> # works
		elsif ( $token->as_is =~ m!\G([<]{1})([\w]{3}[\s\w]{3,})([='"\s]{2,})([\w\s]+)(['"]+\s)([\w]+)([="]+)([\w\.\w]+)(['">]{2,})!gcix ) {
			print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white"),print( colored "$4", "blue"),print( colored "$5", "white"),print( colored "$6", "blue"),print( colored "$7", "white"),print( colored "$8", "blue"),print( colored "$9", "white");
			next;
		}
	}
	
	
	if ( $token->is_end_tag(qr/^(?:a|abbr|acronym|address|applet|area|article|aside
									|audio|b|base|basefont|bdi|bdo|bgsound|big|blink
									|blockquote|body|br|button|canvas|caption|center
									|cite|code|col|colgroup|command|content|data
									|datalist|dd|del|details|dfn|dialog|dir|div|dl
									|dt|element|em|embed|fieldset|figcaption|figure
									|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6
									|head|header|hgroup|hr|html|i|iframe|image|img
									|input|ins|isindex|kbd|keygen|label|legend|li
									|link|listing|main|map|mark|marquee|menu|menuitem
									|meta|meter|multicol|nav|nobr|noembed|noframes
									|noscript|object|ol|optgroup|option|output|p
									|param|picture|plaintext|pre|progress|q|rp|rt
									|rtc|ruby|s|samp|script|section|select|shadow
									|slot|small|source|spacer|span|strike|strong
									|style|sub|summary|sup|table|tbody|td|template
									|textarea|tfoot|th|thead|time|title|tr|track|tt
									|u|ul|var|video|wbr|xmp)$/ix) ) {


	if ( $token->as_is =~ m!\G([</]+)([\w]+)(>)!gcix ) {

		print( colored "$1", "white"),print( colored "$2", "magenta"),print( colored "$3", "white");
		next;
	}
		

	}
	if ($ignore) {
		#Everything inside the script tag. Here you can ignore or print as is

	if ($token->as_is) {
		print $token->as_is;
	}	


	}
	else
	{  
	#Everything excluding scripts falls here handle as appropriate
	next unless $token->is_text;
	print $token->as_is;
	}
}
}

__DATA__
<!doctype html>
<!--[if lt IE 7 ]> <html lang="fi" class="ie ie6"> <![endif]-->
<!--[if IE 7 ]> <html lang="fi" class="ie ie7"> <![endif]-->
<!--[if IE 8 ]> <html lang="fi" class="ie ie8"> <![endif]-->
<!--[if IE 9 ]> <html lang="fi" class="ie ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--><html lang="fi"><!--<![endif]-->
<html>
<head>
<title>//////'s test page</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="Description" content="the glories of HTML::TokeParser::Simple" />
<meta name="keywords" content="one two three four five six seven eight nine ten" />
<meta name="robots" content="noindex" />
<link rel="stylesheet" type="text/css" href="cwi.css" />
</head>

<body>
<h1>header one</h1>
<h2>header two</h2>
<h3>header three</h3>
<h4>header four</h4>
<h5>header five</h5>
<h6>header siz</h6>

<p>p tag paragraph</p>
<p>p tag containing <u>underline</u> and <b>bold</b> and a <a href="http://test.foo.com/link.html">link</a></p>
<p>p tag containing <u>underline</u> and <b>bold</b> and a <a href="http://foo.com/bar/link.html">link</a></p>

<img alt="image alt text" src="my.gif">

</body>
</html>
i just don't know how to parse comments.

Code:
<!-- comment foobar -->

<![CDATA[ blah blah ]]>
 
Old 06-16-2017, 02:19 AM   #5
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Ubuntu, Devuan, OpenBSD
Posts: 2,372
Blog Entries: 3

Rep: Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049
For the comments, there is the method is_comment

For the regular HTML elements, I'd say it's not necessary or helpful even to try to enummerate them. Just use the is_start_tag without passing any arguments to it.
 
Old 06-16-2017, 04:03 AM   #6
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Ubuntu, Devuan, OpenBSD
Posts: 2,372
Blog Entries: 3

Rep: Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049Reputation: 1049
I'd make use of the parser's methods something like this:

Code:
    while ( my $token = $p->get_token ) {
        if ( $token->is_start_tag ) {
            my $element = $token->[1];
            my @attrs = @{$token->[3]};
            # print Dumper( @attr ),qq(\n\n);                                   
            print qq(<),colored($element,'magenta');
            foreach my $attr ( @attrs ) {
                if ( $attr ne '/' && $attr ne 'script' ) {
                    print qq( ),colored($attr, 'cyan');
                    if ( defined (  $token->[2]{$attr} ) ) {
                        print qq(="),colored($token->[2]{$attr},'green'),qq(");
                    }
                } else {
                    print qq( $attr);
                }
            }
            print qq(>);

        } elsif ( $token->is_text ) {
            print colored($token->[1],'white');

        } elsif ( $token->is_end_tag ) {
            print qq(</),colored( $token->[1],'magenta'),qq(>);

        } elsif ( $token->is_comment ) {
            print colored( $token->[1],'white');

        } elsif ( $token->is_declaration ) {
            print colored( $token->[1],'bright_white');

        } else
        {
            #Everything excluding scripts falls here handle as appropriate      
            next unless $token->is_text;
            print $token->as_is;
        }
    }
If you want to examine the tokens in a more generic way, there is the module Data::Dumper which provides the function or method to show arbitrary data structures.
 
1 members found this post helpful.
Old 06-16-2017, 06:36 AM   #7
//////
Member
 
Registered: Nov 2005
Location: Land of Linux :: Finland
Distribution: win 10 | OpenBSD 6.1 | Fedora 26
Posts: 241

Original Poster
Rep: Reputation: 72
that is awesome script.

thanks alot for your input.
now i can start looking for other bugs in my script
 
  


Reply

Tags
perl syntax highlight


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl syntax and html ole parsing wakatana Programming 1 10-10-2012 09:48 AM
perl script syntax error matt007 Programming 2 09-17-2009 11:00 AM
setting syntax highlight for vi cramer Programming 18 08-14-2006 12:49 PM
Syntax highlight, how they make it? stormrider_may Programming 9 05-11-2006 08:36 PM
Emacs Syntax highlight UltraSoul Linux - Software 1 07-11-2005 09:19 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration