ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Hello, I would like to use Linux Bash Shell for html analysis of metatags in websites because i cannot afford to pay for windows SEO tools and to be honest i cannot stand windows. Also i want to feel like i am programming the internet rather than using GUI tools. Don't ask me why, its just the way i am right now.
But anyway, my question is does anyone have some BASH scripts i could follow for html tag analysis or advice? Basically i am looking to get into SEO and want to be able to build my own SEO BASH toolset.
Thank you for any advice.
~Rap
Last edited by Raptorialis; 12-01-2010 at 05:23 AM.
Yes, i heard that as well, thank you. I have to say that i do not want to learn Perl because right now it will be too complicated for me. If i can do everything with wget, sed, awk and grep i would be much happier. If we take simply scanning an html page for metatags and then read them into variables and displaying them that would be a great place to start for me. If anyone knows bash please let me know if you have an example of accomplishing this then maybe i can extend the script as i learn more.
Well the tools you have listed do have their abilities, but as has been said to me many time, the main issue (ht|x)ml is that the structures can be spread over many lines and
hence these other tools can be less up to the job in some cases.
That being said, perhaps if you gave a demo of the input and the desired output it would make the job of helping you a great deal easier?
Here are some links to help you on your way with some of the items you listed:
The input
run getseotag.sc <www.mydomain.com> <keyword="hello world">
The output
Site Title = "Hello World"
Site Descripton = "The is site about hello world"
H1 = "Welcome to the hello world site"
H2 = "Situated in the heart of hello world"
H3 = "Great places to eat in hello world"
print: The keyword "hello world" for <www.mydomain.com> was found in 5 tags.
So... this would be a simple SEO HTML tag analysis bash script i called getseotag.sc for this example.
I will take a good look at the links you sent and see if i can work it out.
This project sounds somewhat ambitious, and if it is going to be successful, requires proper parsing of HTML. Writing a robust XML/HTML parser in bash would be something akin to building a house, but using a rock as a hammer. It may be possible, but will be frustrating and even if it works, will always be substandard in terms of performance and maintainability. The time spent to learn Perl and take advantage of any of the several existing and supported HTML parsers will be trivial compared to the time wasted trying to do this in bash. Use the right tool for the job. If you already know grep/sed/awk/bash, then learning Perl (or other higher level language, like Python) will not be so difficult.
Been there, done that, had to sell the T-shirt to recover the losses.
Thank you kindly for your advice theNbomr. I very much respect what you say.
I will consider what you say closely because i have other plans for unix command line.
For instance i want to scan a list of my website domains and then pull back all comments made on my sites so i can identify spam and stuff.
So i guess Perl is the answer in these cases.
The only concern i have is whether Linux VPS has perl interpreter installed by default. What i dont want to get into is having too much to install on Linux everytime i want write scripts. With BASH it is there with the default VPS install. I am not sure if Perl is there by default.
I will have to trade off ease of programming with amount of time it would take to get Perl installed and working on any Linux based VPS.
I would be greatful for further input if possible.
Well I still agree that it may become bigger than ben hur, but I like a challenge
So your output makes sense to me but you will need to deliver a proper html page that we can work on so we can assess what we need to battle for
record separation? Doesn't have to be real per se but we will need the exact tags we need to look for and search between.19940522 ....... aa_bbb_19940522cccccc_ddddd
#!/usr/bin/env perl
use strict;
use warnings;
use URI::Fetch;
use HTML::TagParser;
my $url = "http://gitref.org";
my $html = HTML::TagParser->new($url);
my $element = $html->getElementsByTagName("title");
print "title: ", $element->innerText(), "\n" if ref $element;
my @headings = $html->getElementsByTagName("h2");
foreach my $item (@headings) {
print "heading: ", $item->innerText(), "\n";
}
Output:
Code:
telemachus ~ ❯❯ perl tagger
title: Git Reference
heading: Introduction to the Git Reference
heading: How to Think Like Git
Every Linux, BSD or Unix-like distro I know of has Perl installed out of the box. In order to make this code work, you need to install two Perl modules from CPAN: URI::Fetch and HTML::TagParser. Do yourself a favor: install cpanm first and use it to install the other modules.
The documentation for HTML::TagParser is pretty clear, but it does assume that you're comfortable with Perl. If you're not, then obviously you'll want to study Perl basics first. I'm a fan of Learning Perl, but there are many good books on Perl.
Last edited by Telemachos; 12-19-2010 at 09:38 PM.
Reason: Add output
Telemachos, Thanks for your advice. Can I used Perl to post article content to my blog automatically and does Perl have any capabilities for passing my username and password to my blog and then doing the article posting?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.