LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-16-2017, 12:30 AM   #1
drewk
LQ Newbie
 
Registered: Apr 2017
Posts: 5

Rep: Reputation: Disabled
Help using lynx dump


Hey linux gurus,

Is there a way to dump the main article of a website with lynx and not get content from iframes, etc.

I use lynx -dump -nolist "url" > filename.txt

If lynx cannot do this, what about links or elinks? Thx
 
Old 04-16-2017, 12:54 AM   #2
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,112
Blog Entries: 3

Rep: Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013
Welcome.

The answer is "maybe". It depends on how the article is marked up and how badly the site abuses HTML and if it is infected with Javascript in places.

It is most likely you will need other tools. Perl is quite good at extracting parts of HTML documents, see the CPAN modules HTML::TreeBuilder or HTML::TreeBuilder::XPath for that. You can then pass the extracted piece to lynx for rendering.
 
Old 04-16-2017, 02:52 AM   #3
drewk
LQ Newbie
 
Registered: Apr 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
I don't know perl nor those other methods you listed. I am using sed to print between lines. I got the results I wanted. lynx dump works best on very simple websites with no frames, tables, etc. But they are rare.
 
Old 04-16-2017, 03:29 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310
i second xpath.
personally i had good experience with xmllint (part of libxml2) and the html-xml-utils.
so, what i'd try:
- get the name, class and/or id of the element i want to extract by opening the page in a good browser, press F12 for developer tools etc.
- use one of the above mentioned xpath-capable utilities to extract that part, resulting in a partial html document
- give that to lynx, see what it can make of it. should work.

sed is not good for html & co.
just keep in mind that
- sed relies on line breaks
- html does not
it might work now, but as soon as the site changes its layout it will break, and you're SOL again... and again... until you decide to tackle the learning curve.
at least that's what's happened to me.

if you like you can describe your problem in more detail, we will work something out.
 
Old 04-16-2017, 05:37 AM   #5
drewk
LQ Newbie
 
Registered: Apr 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by ondoho View Post
- get the name, class and/or id of the element i want to extract by opening the page in a good browser, press F12 for developer tools etc.
- use one of the above mentioned xpath-capable utilities to extract that part, resulting in a partial html document
- give that to lynx, see what it can make of it. should work.
where is the xpath utility in the link?
How do you type the xpath expression and where? I don't understand. I need detailed instruction.

Last edited by drewk; 04-16-2017 at 05:43 AM.
 
Old 04-16-2017, 06:17 AM   #6
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,112
Blog Entries: 3

Rep: Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013
You need a utility that can extract elements based on XPath expressions. You can use a pre-made one or use perl to write one using the modules mentioned.

The following reads a file name or takes data from stdin and parses it, extracting all TD elements of the class "first" and printing them to stdout.

Code:
#!/usr/bin/perl                                                                 

use strict;
use warnings;
use HTML::TreeBuilder::XPath;

my $xpath = qq(//td[\@class="first"]);

my $file = shift || '/dev/stdin';

my $xhtmlroot = HTML::TreeBuilder::XPath->new;
$xhtmlroot->implicit_tags(1);
$xhtmlroot->parse_file( $file )
    or die( "Could not parse '$file' : $!\n");

for my $element ( $xhtmlroot->findnodes( $xpath ) ){
    print $element->as_HTML( undef, "  " );
    print qq(\n);
}

exit( 1 );
There aren't so many guides for XPath, but once you get the syntax, it's not so hard.
https://www.data2type.de/en/xml-xslt...-introduction/
 
1 members found this post helpful.
Old 04-16-2017, 07:30 AM   #7
drewk
LQ Newbie
 
Registered: Apr 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
It didn't work for me. Probably my fault. I don't know what to type for xpath.

I going to stick to the old copy & paste method. Turbocapitalist, I apologize for wasting your time.
 
Old 04-16-2017, 07:33 AM   #8
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,112
Blog Entries: 3

Rep: Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013Reputation: 2013
Quote:
Originally Posted by drewk View Post
I don't know what to type for xpath.
What can you describe about the part of the HTML document that you are trying to extract? Which element is it and what makes it unique? Does it have any attributes such as class or id? Is it a child of a particular element?
 
Old 04-16-2017, 02:49 PM   #9
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310
Quote:
Originally Posted by drewk View Post
where is the xpath utility in the link?
How do you type the xpath expression and where? I don't understand. I need detailed instruction.
the link is the tutorial.
the utilities are in your package repositories, as stated in the next line:
Quote:
i had good experience with xmllint (part of libxml2) and the html-xml-utils.
like i said, there's a learning curve, but once you'll realize how unsuitable sed is for html in (changing) websites, you're going to want to take that learning curve.
xpath is how browsers work.
it's the best tool for the job.

just one example:
there's a webpage that has a weather forecast in the form of a table. i want only that, nothing else.
the table's class is "meteogram" (and only that).
the table can be anywhere in the page.
this is the xpath expression to extract that table from the whole html page:
Code:
"//table[@class=\"meteogram\"]"
and with xmllint, the command is this:
Code:
xmllint --html --xpath "//table[@class=\"meteogram\"]" http://ilmatieteenlaitos.fi/saa/helsinki/ 2>/dev/null
2>/dev/null because xmllint throws a lot of warnings and errors, even when it's working nicely.

this is a working example straight from a web page; try it! - but you can also use a local file there.

maybe this helps...
 
1 members found this post helpful.
Old 04-16-2017, 10:11 PM   #10
drewk
LQ Newbie
 
Registered: Apr 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
@ Turbocapitalist
@ ondoho

I hope I am not wasting both your time. I tried to understand xpath but ti can be difficult to write one if the content I want is buried deep into the html document.

I was looking on youtube.com for xpath tutorials and I found an easier way to do this by using an extension called firefug for firefox.

I just have to hover over an element I want and a snippet of code is selected in the firebug window. I just right-click on that snippet of code and select copy xpath. Here is an example:

/html/body/div[8]/div[9]/div[1]/div[1]/div/div[1]/div[2]/div/div/div/article

There is no way I could write something like that on my own.

So I used that xpath code with xmllint and it did the job. Thank you both for your time and knowledge.
 
Old 04-17-2017, 10:54 AM   #11
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310Reputation: 3310
Quote:
Originally Posted by drewk View Post
/html/body/div[8]/div[9]/div[1]/div[1]/div/div[1]/div[2]/div/div/div/article
that sure looks confusing, but probably "//article" would do the same, because it's a pretty unique identifier (article, that is. it would probably return all articles on the page).
the // says: anywhere in the code. so you don't have to acribically follow the winding path from the document root.

...the principle is actually really easy to grasp, but xpath is totally unflexible and unforgiving.

you should really try and play; look at a page's source code, and at the same time try to extract elements from it.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Creating a core dump from raw ram dump? cyent Programming 2 08-15-2010 05:57 PM
Core dump issues. Program crashes but does not generate core dump file sabeel_ansari Programming 1 10-07-2009 04:23 PM
installing FreeBSD on Virtual Box error: "Cannot dump. No dump device defined" Valkyrie_of_valhalla *BSD 4 09-06-2007 04:02 AM
How to forcely dump the history of user commands to the admin dump file. mcp_achindra Linux - Security 1 03-19-2004 12:04 PM
Lynx joseph Linux - General 2 08-28-2003 02:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 10:49 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration