LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 12-01-2010, 05:21 AM   #1
Raptorialis
LQ Newbie
 
Registered: Apr 2008
Posts: 12

Rep: Reputation: 0
Advice: BASH Codes for html tag analysis


Hello, I would like to use Linux Bash Shell for html analysis of metatags in websites because i cannot afford to pay for windows SEO tools and to be honest i cannot stand windows. Also i want to feel like i am programming the internet rather than using GUI tools. Don't ask me why, its just the way i am right now.

But anyway, my question is does anyone have some BASH scripts i could follow for html tag analysis or advice? Basically i am looking to get into SEO and want to be able to build my own SEO BASH toolset.

Thank you for any advice.

~Rap

Last edited by Raptorialis; 12-01-2010 at 05:23 AM.
 
Old 12-01-2010, 06:44 AM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,207

Rep: Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799
From other posts I have read it appears Perl has pretty good html parsing abilities.
 
Old 12-01-2010, 07:57 AM   #3
Raptorialis
LQ Newbie
 
Registered: Apr 2008
Posts: 12

Original Poster
Rep: Reputation: 0
Yes, i heard that as well, thank you. I have to say that i do not want to learn Perl because right now it will be too complicated for me. If i can do everything with wget, sed, awk and grep i would be much happier. If we take simply scanning an html page for metatags and then read them into variables and displaying them that would be a great place to start for me. If anyone knows bash please let me know if you have an example of accomplishing this then maybe i can extend the script as i learn more.

thank you
~Rap
 
Old 12-01-2010, 08:21 AM   #4
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,207

Rep: Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799
Well the tools you have listed do have their abilities, but as has been said to me many time, the main issue (ht|x)ml is that the structures can be spread over many lines and
hence these other tools can be less up to the job in some cases.

That being said, perhaps if you gave a demo of the input and the desired output it would make the job of helping you a great deal easier?

Here are some links to help you on your way with some of the items you listed:

bash - http://tldp.org/LDP/abs/html/
awk - http://www.gnu.org/manual/gawk/html_node/index.html
sed - http://www.grymoire.com/Unix/Sed.html

Try some of your own solutions using these and lets us know where you get stuck?
 
Old 12-01-2010, 10:00 AM   #5
Raptorialis
LQ Newbie
 
Registered: Apr 2008
Posts: 12

Original Poster
Rep: Reputation: 0
Thank you grail.

The input
run getseotag.sc <www.mydomain.com> <keyword="hello world">

The output
Site Title = "Hello World"
Site Descripton = "The is site about hello world"
H1 = "Welcome to the hello world site"
H2 = "Situated in the heart of hello world"
H3 = "Great places to eat in hello world"
print: The keyword "hello world" for <www.mydomain.com> was found in 5 tags.

So... this would be a simple SEO HTML tag analysis bash script i called getseotag.sc for this example.

I will take a good look at the links you sent and see if i can work it out.

thanks for any advice.

~Rap
 
Old 12-01-2010, 04:48 PM   #6
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,391
Blog Entries: 2

Rep: Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900Reputation: 900
This project sounds somewhat ambitious, and if it is going to be successful, requires proper parsing of HTML. Writing a robust XML/HTML parser in bash would be something akin to building a house, but using a rock as a hammer. It may be possible, but will be frustrating and even if it works, will always be substandard in terms of performance and maintainability. The time spent to learn Perl and take advantage of any of the several existing and supported HTML parsers will be trivial compared to the time wasted trying to do this in bash. Use the right tool for the job. If you already know grep/sed/awk/bash, then learning Perl (or other higher level language, like Python) will not be so difficult.

Been there, done that, had to sell the T-shirt to recover the losses.

--- rod.
 
Old 12-02-2010, 04:31 AM   #7
Raptorialis
LQ Newbie
 
Registered: Apr 2008
Posts: 12

Original Poster
Rep: Reputation: 0
Thank you kindly for your advice theNbomr. I very much respect what you say.
I will consider what you say closely because i have other plans for unix command line.
For instance i want to scan a list of my website domains and then pull back all comments made on my sites so i can identify spam and stuff.

So i guess Perl is the answer in these cases.

The only concern i have is whether Linux VPS has perl interpreter installed by default. What i dont want to get into is having too much to install on Linux everytime i want write scripts. With BASH it is there with the default VPS install. I am not sure if Perl is there by default.

I will have to trade off ease of programming with amount of time it would take to get Perl installed and working on any Linux based VPS.

I would be greatful for further input if possible.

cheers

~rap
 
Old 12-02-2010, 06:15 AM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,207

Rep: Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799
Well I still agree that it may become bigger than ben hur, but I like a challenge

So your output makes sense to me but you will need to deliver a proper html page that we can work on so we can assess what we need to battle for
record separation? Doesn't have to be real per se but we will need the exact tags we need to look for and search between.19940522 ....... aa_bbb_19940522cccccc_ddddd
 
Old 12-18-2010, 09:07 AM   #9
Raptorialis
LQ Newbie
 
Registered: Apr 2008
Posts: 12

Original Poster
Rep: Reputation: 0
Hi, i guess if we take the Google.com page and strip out all the meta tags and save to a file, that will be a good start.

eg.
[Tags] [Tag Value]
Title - "Google Inc"
Description - "Google is the biggest search engine in the world"
[H1] - "Google"
[H2] - etc
[H3] = etc

The script should identify html or xml tags and then output the tag name and value.

Does that make sense?

Cheers

~Rap
 
Old 12-18-2010, 10:15 AM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,207

Rep: Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799Reputation: 1799
Well interestingly I didn't get any of the details you have listed for the google page, but maybe it is just how we access data.

Probably sed or awk would be your best bets for searching ranges, something like:
Code:
sed -n '/<H1/,/H1/p' file
You will probably need to be more selective but this will give an idea.
 
Old 12-19-2010, 09:21 PM   #11
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
It sounds like you need something like HTML::TagParser.

Code:
#!/usr/bin/env perl
use strict;
use warnings;
use URI::Fetch;
use HTML::TagParser;

my $url = "http://gitref.org";
my $html = HTML::TagParser->new($url);

my $element = $html->getElementsByTagName("title");
print "title: ", $element->innerText(), "\n" if ref $element;

my @headings = $html->getElementsByTagName("h2");
foreach my $item (@headings) {
    print "heading: ", $item->innerText(), "\n";
}
Output:
Code:
telemachus ~ ❯❯ perl tagger
title: Git Reference
heading: Introduction to the Git Reference
heading: How to Think Like Git
Every Linux, BSD or Unix-like distro I know of has Perl installed out of the box. In order to make this code work, you need to install two Perl modules from CPAN: URI::Fetch and HTML::TagParser. Do yourself a favor: install cpanm first and use it to install the other modules.

The documentation for HTML::TagParser is pretty clear, but it does assume that you're comfortable with Perl. If you're not, then obviously you'll want to study Perl basics first. I'm a fan of Learning Perl, but there are many good books on Perl.

Last edited by Telemachos; 12-19-2010 at 09:38 PM. Reason: Add output
 
Old 12-20-2010, 11:32 AM   #12
Raptorialis
LQ Newbie
 
Registered: Apr 2008
Posts: 12

Original Poster
Rep: Reputation: 0
Telemachos, Thanks for your advice. Can I used Perl to post article content to my blog automatically and does Perl have any capabilities for passing my username and password to my blog and then doing the article posting?

thx

~Rap
 
  


Reply

Tags
analysis, bash, html, tags


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to Reading And Writing SAC (Seismic Analysis Code) Data Files in Fortran codes lengyue Linux - Software 2 12-03-2008 12:39 PM
Are the hex codes for colors in a jpg the same codes as used in html? abefroman Linux - Security 3 07-31-2005 03:21 PM
Javascript / HTML <select> tag djgerbavore Programming 3 04-23-2005 10:51 AM
PERL: split on html tag? ocularbob Programming 12 09-08-2003 05:52 PM
html .avi tag ?? itsjustme Programming 2 07-30-2003 12:32 PM


All times are GMT -5. The time now is 03:16 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration