LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 04-01-2015, 02:52 PM   #1
mia_tech
Member
 
Registered: Dec 2007
Location: FL, USA
Distribution: CentOS 5.3, Ubuntu 9.04
Posts: 245

Rep: Reputation: 16
scraping content from webpage with lynx


I'm trying to scrape some content from a webpage, so I can convert it into csv format; however, this page has a bunch of tables, and table (cells, rows), so I think instead of working with the source code of the page, it will be better to work with the content. And that's why I'm using lynx -dump. Ok so, there's a portion of the page that contains a list, and every row begins with a numbrer.

Code:
1 2/20/15 0 10 john tampa, fl
2 3/15/15 1 3  mike atlanta, ga
3...
4..
N..
how can I put every field in a csv file? I was thinking something along the lines
Code:
lynx -dump http://siteaddress.com/stats | "and some other pipes here"
 
Old 04-01-2015, 03:42 PM   #2
Pearlseattle
Member
 
Registered: Aug 2007
Location: Zurich, Switzerland
Distribution: Gentoo
Posts: 999

Rep: Reputation: 142Reputation: 142
Hi
Just dump the page and use blank/space as delimiter when you upload it into *office?
 
Old 04-01-2015, 04:28 PM   #3
mia_tech
Member
 
Registered: Dec 2007
Location: FL, USA
Distribution: CentOS 5.3, Ubuntu 9.04
Posts: 245

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by Pearlseattle View Post
Hi
Just dump the page and use blank/space as delimiter when you upload it into *office?
actually, what I did was to save the page to desktop as html, and them opening it, in excel and import it. it worked great, but I bash scripting was how I originally wanted to do it, but I fond it quiet cumbersome, but I was still curious as how I could do it.
 
Old 04-01-2015, 04:57 PM   #4
Pearlseattle
Member
 
Registered: Aug 2007
Location: Zurich, Switzerland
Distribution: Gentoo
Posts: 999

Rep: Reputation: 142Reputation: 142
Ok, easy - just search in Internet about replacing a char (space/blank in your case) with "sed" or "awk".
Example for such a search: "bash sed replace char"
 
Old 04-02-2015, 08:09 AM   #5
mia_tech
Member
 
Registered: Dec 2007
Location: FL, USA
Distribution: CentOS 5.3, Ubuntu 9.04
Posts: 245

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by Pearlseattle View Post
Ok, easy - just search in Internet about replacing a char (space/blank in your case) with "sed" or "awk".
Example for such a search: "bash sed replace char"
ohh, I forgot to mentioned that there are a bounch of text before the lines I want to scrap from webpage. How would jump those lines? and start getting input from the ones I have?
 
Old 04-02-2015, 08:15 AM   #6
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
^ with the limited information op provides i'll assume that lines that begin with a number are the ones that they are interested in:
Code:
grep ^[0-9] mia-tech.html | ...
 
1 members found this post helpful.
Old 04-03-2015, 08:12 AM   #7
mia_tech
Member
 
Registered: Dec 2007
Location: FL, USA
Distribution: CentOS 5.3, Ubuntu 9.04
Posts: 245

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by schneidz View Post
^ with the limited information op provides i'll assume that lines that begin with a number are the ones that they are interested in:
Code:
grep ^[0-9] mia-tech.html | ...
ok, I'm still trying to figure this out. I save the dump from lynx to a text file, but for some reason the grep command doesn't work, and my guess is that it is b/c the page doesn't start with a number but with a space. lynx out put page a bit wierd. So the output really looks like this

Code:
   1 1/1/2014 Unknown 2 2 Norfolk, VA
   ^[14][1] ^[15][2]
   2 1/3/2014 Unknown 1 3 New York (Queens), NY
   ^[16][3] ^[17][4]
   3 1/4/2014 Leonard Frank Harris Jr 2 2 Rock Falls, IL
   ^[18][5] ^[19][6] ^[20][7]
   4 1/5/2014 Unknown 1 3 Erie, OH
I guess that's why grep wasn't working before.
 
Old 04-03-2015, 08:50 AM   #8
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
^ this mite work:
Code:
grep ^"   [0-9]" mia-tech.html
 
Old 04-03-2015, 01:43 PM   #9
BenCollver
Rogue Class
 
Registered: Sep 2006
Location: OR, USA
Distribution: Slackware64-15.0
Posts: 376
Blog Entries: 2

Rep: Reputation: 172Reputation: 172
I'd prefer a more programmatic approach. Had good results with PHP's file_get_contents() and SimpleHtmlDom.

http://simplehtmldom.sourceforge.net/
 
2 members found this post helpful.
Old 04-05-2015, 02:19 PM   #10
Pearlseattle
Member
 
Registered: Aug 2007
Location: Zurich, Switzerland
Distribution: Gentoo
Posts: 999

Rep: Reputation: 142Reputation: 142
Quote:
Originally Posted by BenCollver View Post
I'd prefer a more programmatic approach. Had good results with PHP's file_get_contents() and SimpleHtmlDom.

http://simplehtmldom.sourceforge.net/
Puah, sounds great - almost lost consciousness when I had to write a few weeks back ~50 regular expressions to parse some webpages - thanks!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
python mechanize scraping questions methodtwo Programming 4 03-14-2014 10:57 AM
LXer: Web scraping with Python (Part 2) LXer Syndicated Linux News 0 09-04-2009 09:00 PM
LXer: Web Scraping with Python LXer Syndicated Linux News 0 12-03-2008 03:40 PM
Look for a tool able to download a webpage and its content (images etc...) redvivi Linux - Software 1 08-23-2008 09:37 AM
HTML scraping meadensi Programming 2 06-09-2005 01:17 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 02:15 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration