LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-12-2013, 11:10 PM   #1
neopandid
Member
 
Registered: Aug 2011
Location: Russia
Distribution: Debian
Posts: 32

Rep: Reputation: Disabled
How to extract domains, links from one webpage.


Hi,
I'd like to know how to extract domain names from one site.
with CLI or third party programs.
I want to add these domains to my squid list.This is easy but I have to compile the list first.
I can do this by hand but it is very time consuming.
For example:
One site
has subfolders and every subfolder is for categories.
Every page contains hundreds of links.

domainname.com/subfolder/page1
domainname.com/subfolder/page2
domainname.com/subfolder/page3.html

How can I do this?
I am using Debian but I am open to any suggestions.

Last edited by neopandid; 02-12-2013 at 11:14 PM. Reason: Info added
 
Old 02-12-2013, 11:51 PM   #2
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
perl HTML::Tree http://search.cpan.org/~cjm/HTML-Tre...b/HTML/Tree.pm

Code:
#!/usr/bin/perl -w

$filename="index.html";

use HTML::TreeBuilder;
    my $tree = HTML::TreeBuilder->new();
    $tree->parse_file($filename);
        # Then do something with the tree, using HTML::Element

for (@{ $tree->extract_links()  }) {
      my($link, $element, $attr, $tag) = @$_;
      print
        "Hey, there's a $tag that links to ",
        $link, ", in its $attr attribute, at ",
        $element->address(), ".\n";
  }

# Finally:
$tree->delete;
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Selective making links 'hot' in a php webpage hattori.hanzo Programming 2 12-07-2011 06:17 PM
Writing script to extract appropriate line from a web site using links ben1173 Linux - Newbie 4 10-26-2010 10:33 AM
firefox unable to load any webpage in Openbox despite ping and links working in CLI admas Arch 2 05-28-2009 02:16 AM
Multiple domains in LDAP and 1 samba server for all domains, what to do? xnomad Linux - Server 1 11-14-2008 09:12 AM
wget doesn't convert links on webpage JosephS Linux - Software 1 01-27-2008 11:51 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration