LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-19-2007, 02:58 PM   #1
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Rep: Reputation: 30
Interesting Question - Parsing a webpage


Hi,

This should be an interesting question I belive ( If not, am sorry )

There are so many webpages with action buttons in them. If the button is clicked then the registered action to the button gets executed. My question is how the process of searching manually for a button in the page and clicking the particular button be done through a script.

Basically, we have an internal page with so many action buttons in that , in which when the button is clicked it navigates to other page which contain the vital information which can be parsed. (HTML parsing which could be done easily). But the page as well contains other action buttons. Based on the values parsed I need to click the button(s).

Even if values can be parsed, how to script the job of clicking the right button.

Any clues. Much appreciated !

Thanks
 
Old 05-19-2007, 05:42 PM   #2
Proud
Senior Member
 
Registered: Dec 2002
Location: England
Distribution: Used to use Mandrake/Mandriva
Posts: 2,794

Rep: Reputation: 116Reputation: 116
If you are parsing the HTML and not rendering the page, you couldn't just trigger a mouse click over the button's area. However, if the button calls some javascript you could maybe analyse the function, or if it submits a form you can issue a POST or GET request to the same url with the expected parameters. Hopefully each button has a name, or reliable location.
 
Old 05-20-2007, 05:19 AM   #3
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Thanks for the reply,

Am not rendering any HTML page,

with wget method, I get the HTML page and parse the page to retrieve the contents.

Based on the value that is obtained after parsing, I need to trigger the button action.

Let me try it out !
 
Old 05-20-2007, 10:00 AM   #4
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Just to elaborate a bit more on Proud's answer:

Usually information is sent thru forms. Somehow the form is submitted. Often there is a submit button, bit it can also be done by any event which triggers the java submit() method. Usually it is something like this.form.submit() or myformname.submit().

What you'd have to look for is something like
Code:
<FORM blabla action="next_page.html" bla bla>
.....
.....
</FORM>
So you would have to create a small parser which parses the <FORM> tag.

Additionally, within the form you will encounter form elements (inout box, radio buttons etc) which can be given a value. Once you figured that out, you can compose a POST string and pass it to wget so it can actually be posted.

Interesting project, I have sometimes faced the same need, but lacked the time to implement it.

I think that this mechanism is also being used by spammers who create spam bots to post on forums. (Hence the pattern you have to recognize before you can register) Bit maybe there is already something published about this subject.

jlinkels
 
Old 05-20-2007, 01:45 PM   #5
jiml8
Senior Member
 
Registered: Sep 2003
Posts: 3,171

Rep: Reputation: 116Reputation: 116
My website actually watches for this sort of behavior, because it is typical spambot behavior.

If you do any of the things my site watches for, it'll blacklist your IP permanently, and send me an email. And I will contact your hosting service with a complaint.
 
Old 05-20-2007, 07:02 PM   #6
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Jiml8,

It might be typical for spambots (that was the use I referred to) however there are valid reasons to use this.

I had at least 2 fully legitimate applications where my sever logs in into a web page and does something with the forms presented.

I agree with you that spambots (and their owners) should be repelled as much as possible.

jlinkels
 
Old 05-20-2007, 11:45 PM   #7
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Thanks for the replies !

A big ' NO ' - for the question whether its to spam.

Actually its the part of the project, where we are trying to automate the process of clicking and copying the data.

Since the base application has been successfull and tested manually, we would like to test it for different kinds of data - which manually is not possible at all.

Hence, the need for automation.

I guarantee this is not for spamming.

 
Old 05-21-2007, 02:04 AM   #8
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
This is an interesting topic.
Just for interests sake, I wrote a little perl script that uses HTML::Parser, to parse a web page that I grabbed with wget (this very page was the example I testd with). I was easily able to find all of the javascript elements. Some are inline javascript, and others are links. I don't know any javascript, but I couldn't see much hope of recognizing anything that looked like it was creating a button, or handling a button press.
Can someone give an example or description of what to look for?

--- rod.
 
Old 05-21-2007, 02:13 AM   #9
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Ubuntu 12.04, Antix19.3
Posts: 3,794

Rep: Reputation: 282Reputation: 282Reputation: 282
I don't know if this helps, but you can try to use lynx (the commandline based browser). It can accept a command file with keystrokes.
Read man lynx, specifically the option cmd_log and cmd_script.

And the source code for lynx is available, so you can always go through that to get your own simulation.
 
Old 05-21-2007, 02:24 AM   #10
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Have you looked at WWW::Mechanize from Perl?
http://search.cpan.org/~petdance/WWW...hanize/FAQ.pod
 
Old 05-21-2007, 03:58 AM   #11
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by theNbomr
This is an interesting topic.
Just for interests sake, I wrote a little perl script that uses HTML::Parser, to parse a web page that I grabbed with wget (this very page was the example I testd with). I was easily able to find all of the javascript elements. Some are inline javascript, and others are links. I don't know any javascript, but I couldn't see much hope of recognizing anything that looked like it was creating a button, or handling a button press.
Can someone give an example or description of what to look for?

--- rod.
If you dont mind could you please post your parser.

Though I have implemented the parser myself, its too complicated as it doesnt use any of the packages as HTML::Parser.

I would like to know how to make things simpler

Thanks
 
Old 05-21-2007, 12:19 PM   #12
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Okay, here it is. It was never written with the intention of publishing it, so it isn't pretty. This was hacked together on the fly, so lots of useless code, but I don't want to break it by trying to pretty it up. A lot of this was just cribbed from the documentation page on CPAN.

Code:
#! /usr/bin/perl -w
#
#   jscriptParser.pl
#
#   Finds all javascript references in an HTML file
#
#   jscriptParser.pl  <filename.html>
#
use HTML::Parser();

my @scripts;
my $expectingJscript = 0;
 
sub start_handler{

my $tag = shift;
my $attr = shift;
my $text = shift;
my $self = shift;

    my %attrList = %{$attr};

    if( $tag =~ m/^script/i ){

        # print "Start Tag: $tag\n";
        # print "Text: $text\n";
        
        if( exists $attrList{ "type" } ){

            if( $attrList{ "type" } =~ m/text\/javascript/i ){

                #   print "\tAttributes:\n";
                #   foreach my $attr ( keys %attrList ){
                #       print "\t$attr = $attrList{ $attr }\n";
                #   }
                if( exists $attrList{ "src" } ){
                    push @scripts, $attrList{ "src" };
                }
                else{
                    # print "Expecting inline javascript\n";
                    $expectingJscript = 1;
                }
            }
        }
        else{
            $expectingJscript = 0;
            return;
        }
    }
    return;    

}

sub end_handler{

my $tag = shift;
my $text = shift;
my $self = shift;

    if( $tag =~ m/script/i ){

        # print "End Tag: $tag\n";
        $expectingJscript = 0;
    }
    return;    

}

sub comment_handler{

my $text = shift;
my $self = shift;

    if( $expectingJscript ){
        print "Javascript CommentText: $text\n";
    }
    else{
        # print "Non-jscript comment ignored\n";
    }
    return;    

}

sub text_handler{

my $text = shift;
my $self = shift;

    if( $expectingJscript ){
        print "\n\nInline javascript : \n";
        print     "===================\n";
        print "$text\n",
    }
    else{
        # print "Non-jscript Dtext ignored\n";
    }
    return;    

}


    my $p = HTML::Parser->new(api_version => 3);

    # 
    #  Assign handlers for various HTML element types
    #
    $p->handler( start => \&start_handler, "tagname,attr,text,self");
    $p->handler( end => \&end_handler, "tagname,text,self");
    $p->handler( comment => \&comment_handler, "text,self");
    $p->handler( text => \&text_handler, "text,self");
    
    $p->parse_file(shift || die) || die $!;

    #
    # Dump the list of found jscript references
    #
    print "\n\nExternal scripts named:\n",
          "===========================\n";
    foreach my $scriptSource ( @scripts ){
        print  $scriptSource,"\n";
    }
Just run the script with the filename of an HTML page as an argument.

chrism01's reference to HTML::Mechanize, which I'd completely forgotten about, suggests that this problem is by no means trivial to solve.

--- rod.

Last edited by theNbomr; 05-21-2007 at 12:20 PM.
 
Old 05-21-2007, 01:08 PM   #13
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
thanks for the script!

let me go through and try the script !
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
simple parsing question thanhvn Programming 4 01-31-2006 08:05 AM
sed parsing question ncblues Linux - Newbie 5 01-03-2005 06:36 AM
Text parsing question bruoersolitario Linux - General 4 04-15-2004 02:12 PM
webpage question alkad_mzu General 4 11-28-2003 04:56 PM
parsing a webpage help please mrtwice Programming 3 04-25-2003 12:25 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:23 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration