Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
04-16-2017, 01:30 AM
|
#1
|
LQ Newbie
Registered: Apr 2017
Posts: 5
Rep:
|
Help using lynx dump
Hey linux gurus,
Is there a way to dump the main article of a website with lynx and not get content from iframes, etc.
I use lynx -dump -nolist "url" > filename.txt
If lynx cannot do this, what about links or elinks? Thx
|
|
|
04-16-2017, 01:54 AM
|
#2
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,632
|
Welcome.
The answer is "maybe". It depends on how the article is marked up and how badly the site abuses HTML and if it is infected with Javascript in places.
It is most likely you will need other tools. Perl is quite good at extracting parts of HTML documents, see the CPAN modules HTML::TreeBuilder or HTML::TreeBuilder::XPath for that. You can then pass the extracted piece to lynx for rendering.
|
|
|
04-16-2017, 03:52 AM
|
#3
|
LQ Newbie
Registered: Apr 2017
Posts: 5
Original Poster
Rep:
|
I don't know perl nor those other methods you listed. I am using sed to print between lines. I got the results I wanted. lynx dump works best on very simple websites with no frames, tables, etc. But they are rare.
|
|
|
04-16-2017, 04:29 AM
|
#4
|
LQ Addict
Registered: Dec 2013
Posts: 19,872
|
i second xpath.
personally i had good experience with xmllint (part of libxml2) and the html-xml-utils.
so, what i'd try:
- get the name, class and/or id of the element i want to extract by opening the page in a good browser, press F12 for developer tools etc.
- use one of the above mentioned xpath-capable utilities to extract that part, resulting in a partial html document
- give that to lynx, see what it can make of it. should work.
sed is not good for html & co.
just keep in mind that
- sed relies on line breaks
- html does not
it might work now, but as soon as the site changes its layout it will break, and you're SOL again... and again... until you decide to tackle the learning curve.
at least that's what's happened to me.
if you like you can describe your problem in more detail, we will work something out.
|
|
|
04-16-2017, 06:37 AM
|
#5
|
LQ Newbie
Registered: Apr 2017
Posts: 5
Original Poster
Rep:
|
Quote:
Originally Posted by ondoho
- get the name, class and/or id of the element i want to extract by opening the page in a good browser, press F12 for developer tools etc.
- use one of the above mentioned xpath-capable utilities to extract that part, resulting in a partial html document
- give that to lynx, see what it can make of it. should work.
|
where is the xpath utility in the link?
How do you type the xpath expression and where? I don't understand. I need detailed instruction.
Last edited by drewk; 04-16-2017 at 06:43 AM.
|
|
|
04-16-2017, 07:17 AM
|
#6
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,632
|
You need a utility that can extract elements based on XPath expressions. You can use a pre-made one or use perl to write one using the modules mentioned.
The following reads a file name or takes data from stdin and parses it, extracting all TD elements of the class "first" and printing them to stdout.
Code:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $xpath = qq(//td[\@class="first"]);
my $file = shift || '/dev/stdin';
my $xhtmlroot = HTML::TreeBuilder::XPath->new;
$xhtmlroot->implicit_tags(1);
$xhtmlroot->parse_file( $file )
or die( "Could not parse '$file' : $!\n");
for my $element ( $xhtmlroot->findnodes( $xpath ) ){
print $element->as_HTML( undef, " " );
print qq(\n);
}
exit( 1 );
There aren't so many guides for XPath, but once you get the syntax, it's not so hard.
https://www.data2type.de/en/xml-xslt...-introduction/
|
|
1 members found this post helpful.
|
04-16-2017, 08:30 AM
|
#7
|
LQ Newbie
Registered: Apr 2017
Posts: 5
Original Poster
Rep:
|
It didn't work for me. Probably my fault. I don't know what to type for xpath.
I going to stick to the old copy & paste method. Turbocapitalist, I apologize for wasting your time.
|
|
|
04-16-2017, 08:33 AM
|
#8
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,632
|
Quote:
Originally Posted by drewk
I don't know what to type for xpath.
|
What can you describe about the part of the HTML document that you are trying to extract? Which element is it and what makes it unique? Does it have any attributes such as class or id? Is it a child of a particular element?
|
|
|
04-16-2017, 03:49 PM
|
#9
|
LQ Addict
Registered: Dec 2013
Posts: 19,872
|
Quote:
Originally Posted by drewk
where is the xpath utility in the link?
How do you type the xpath expression and where? I don't understand. I need detailed instruction.
|
the link is the tutorial.
the utilities are in your package repositories, as stated in the next line:
Quote:
i had good experience with xmllint (part of libxml2) and the html-xml-utils.
|
like i said, there's a learning curve, but once you'll realize how unsuitable sed is for html in (changing) websites, you're going to want to take that learning curve.
xpath is how browsers work.
it's the best tool for the job.
just one example:
there's a webpage that has a weather forecast in the form of a table. i want only that, nothing else.
the table's class is "meteogram" (and only that).
the table can be anywhere in the page.
this is the xpath expression to extract that table from the whole html page:
Code:
"//table[@class=\"meteogram\"]"
and with xmllint, the command is this:
Code:
xmllint --html --xpath "//table[@class=\"meteogram\"]" http://ilmatieteenlaitos.fi/saa/helsinki/ 2>/dev/null
2>/dev/null because xmllint throws a lot of warnings and errors, even when it's working nicely.
this is a working example straight from a web page; try it! - but you can also use a local file there.
maybe this helps...
|
|
1 members found this post helpful.
|
04-16-2017, 11:11 PM
|
#10
|
LQ Newbie
Registered: Apr 2017
Posts: 5
Original Poster
Rep:
|
@ Turbocapitalist
@ ondoho
I hope I am not wasting both your time. I tried to understand xpath but ti can be difficult to write one if the content I want is buried deep into the html document.
I was looking on youtube.com for xpath tutorials and I found an easier way to do this by using an extension called firefug for firefox.
I just have to hover over an element I want and a snippet of code is selected in the firebug window. I just right-click on that snippet of code and select copy xpath. Here is an example:
/html/body/div[8]/div[9]/div[1]/div[1]/div/div[1]/div[2]/div/div/div/article
There is no way I could write something like that on my own.
So I used that xpath code with xmllint and it did the job. Thank you both for your time and knowledge.
|
|
|
04-17-2017, 11:54 AM
|
#11
|
LQ Addict
Registered: Dec 2013
Posts: 19,872
|
Quote:
Originally Posted by drewk
/html/body/div[8]/div[9]/div[1]/div[1]/div/div[1]/div[2]/div/div/div/article
|
that sure looks confusing, but probably "//article" would do the same, because it's a pretty unique identifier (article, that is. it would probably return all articles on the page).
the // says: anywhere in the code. so you don't have to acribically follow the winding path from the document root.
...the principle is actually really easy to grasp, but xpath is totally unflexible and unforgiving.
you should really try and play; look at a page's source code, and at the same time try to extract elements from it.
|
|
|
All times are GMT -5. The time now is 10:38 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|