LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Help using lynx dump (https://www.linuxquestions.org/questions/linux-newbie-8/help-using-lynx-dump-4175603926/)

drewk 04-16-2017 01:30 AM

Help using lynx dump
 
Hey linux gurus,

Is there a way to dump the main article of a website with lynx and not get content from iframes, etc.

I use lynx -dump -nolist "url" > filename.txt

If lynx cannot do this, what about links or elinks? Thx

Turbocapitalist 04-16-2017 01:54 AM

Welcome.

The answer is "maybe". It depends on how the article is marked up and how badly the site abuses HTML and if it is infected with Javascript in places.

It is most likely you will need other tools. Perl is quite good at extracting parts of HTML documents, see the CPAN modules HTML::TreeBuilder or HTML::TreeBuilder::XPath for that. You can then pass the extracted piece to lynx for rendering.

drewk 04-16-2017 03:52 AM

I don't know perl nor those other methods you listed. I am using sed to print between lines. I got the results I wanted. lynx dump works best on very simple websites with no frames, tables, etc. But they are rare.

ondoho 04-16-2017 04:29 AM

i second xpath.
personally i had good experience with xmllint (part of libxml2) and the html-xml-utils.
so, what i'd try:
- get the name, class and/or id of the element i want to extract by opening the page in a good browser, press F12 for developer tools etc.
- use one of the above mentioned xpath-capable utilities to extract that part, resulting in a partial html document
- give that to lynx, see what it can make of it. should work.

sed is not good for html & co.
just keep in mind that
- sed relies on line breaks
- html does not
it might work now, but as soon as the site changes its layout it will break, and you're SOL again... and again... until you decide to tackle the learning curve.
at least that's what's happened to me.

if you like you can describe your problem in more detail, we will work something out.

drewk 04-16-2017 06:37 AM

Quote:

Originally Posted by ondoho (Post 5697560)
- get the name, class and/or id of the element i want to extract by opening the page in a good browser, press F12 for developer tools etc.
- use one of the above mentioned xpath-capable utilities to extract that part, resulting in a partial html document
- give that to lynx, see what it can make of it. should work.

where is the xpath utility in the link?
How do you type the xpath expression and where? I don't understand. I need detailed instruction.

Turbocapitalist 04-16-2017 07:17 AM

You need a utility that can extract elements based on XPath expressions. You can use a pre-made one or use perl to write one using the modules mentioned.

The following reads a file name or takes data from stdin and parses it, extracting all TD elements of the class "first" and printing them to stdout.

Code:

#!/usr/bin/perl                                                               

use strict;
use warnings;
use HTML::TreeBuilder::XPath;

my $xpath = qq(//td[\@class="first"]);

my $file = shift || '/dev/stdin';

my $xhtmlroot = HTML::TreeBuilder::XPath->new;
$xhtmlroot->implicit_tags(1);
$xhtmlroot->parse_file( $file )
    or die( "Could not parse '$file' : $!\n");

for my $element ( $xhtmlroot->findnodes( $xpath ) ){
    print $element->as_HTML( undef, "  " );
    print qq(\n);
}

exit( 1 );

There aren't so many guides for XPath, but once you get the syntax, it's not so hard.
https://www.data2type.de/en/xml-xslt...-introduction/

drewk 04-16-2017 08:30 AM

It didn't work for me. Probably my fault. I don't know what to type for xpath.

I going to stick to the old copy & paste method. Turbocapitalist, I apologize for wasting your time.

Turbocapitalist 04-16-2017 08:33 AM

Quote:

Originally Posted by drewk (Post 5697617)
I don't know what to type for xpath.

What can you describe about the part of the HTML document that you are trying to extract? Which element is it and what makes it unique? Does it have any attributes such as class or id? Is it a child of a particular element?

ondoho 04-16-2017 03:49 PM

Quote:

Originally Posted by drewk (Post 5697590)
where is the xpath utility in the link?
How do you type the xpath expression and where? I don't understand. I need detailed instruction.

the link is the tutorial.
the utilities are in your package repositories, as stated in the next line:
Quote:

i had good experience with xmllint (part of libxml2) and the html-xml-utils.
like i said, there's a learning curve, but once you'll realize how unsuitable sed is for html in (changing) websites, you're going to want to take that learning curve.
xpath is how browsers work.
it's the best tool for the job.

just one example:
there's a webpage that has a weather forecast in the form of a table. i want only that, nothing else.
the table's class is "meteogram" (and only that).
the table can be anywhere in the page.
this is the xpath expression to extract that table from the whole html page:
Code:

"//table[@class=\"meteogram\"]"
and with xmllint, the command is this:
Code:

xmllint --html --xpath "//table[@class=\"meteogram\"]" http://ilmatieteenlaitos.fi/saa/helsinki/ 2>/dev/null
2>/dev/null because xmllint throws a lot of warnings and errors, even when it's working nicely.

this is a working example straight from a web page; try it! - but you can also use a local file there.

maybe this helps...

drewk 04-16-2017 11:11 PM

@ Turbocapitalist
@ ondoho

I hope I am not wasting both your time. I tried to understand xpath but ti can be difficult to write one if the content I want is buried deep into the html document.

I was looking on youtube.com for xpath tutorials and I found an easier way to do this by using an extension called firefug for firefox.

I just have to hover over an element I want and a snippet of code is selected in the firebug window. I just right-click on that snippet of code and select copy xpath. Here is an example:

/html/body/div[8]/div[9]/div[1]/div[1]/div/div[1]/div[2]/div/div/div/article

There is no way I could write something like that on my own.

So I used that xpath code with xmllint and it did the job. Thank you both for your time and knowledge. :)

ondoho 04-17-2017 11:54 AM

Quote:

Originally Posted by drewk (Post 5697892)
/html/body/div[8]/div[9]/div[1]/div[1]/div/div[1]/div[2]/div/div/div/article

that sure looks confusing, but probably "//article" would do the same, because it's a pretty unique identifier (article, that is. it would probably return all articles on the page).
the // says: anywhere in the code. so you don't have to acribically follow the winding path from the document root.

...the principle is actually really easy to grasp, but xpath is totally unflexible and unforgiving.

you should really try and play; look at a page's source code, and at the same time try to extract elements from it.


All times are GMT -5. The time now is 03:26 PM.