scraping content from webpage with lynx

mia_tech · 04-01-2015, 02:52 PM

I'm trying to scrape some content from a webpage, so I can convert it into csv format; however, this page has a bunch of tables, and table (cells, rows), so I think instead of working with the source code of the page, it will be better to work with the content. And that's why I'm using lynx -dump. Ok so, there's a portion of the page that contains a list, and every row begins with a numbrer.

Code:

1 2/20/15 0 10 john tampa, fl
2 3/15/15 1 3  mike atlanta, ga
3...
4..
N..

how can I put every field in a csv file? I was thinking something along the lines

Code:

lynx -dump http://siteaddress.com/stats | "and some other pipes here"

Pearlseattle · 04-01-2015, 03:42 PM

Hi
Just dump the page and use blank/space as delimiter when you upload it into *office?

mia_tech · 04-01-2015, 04:28 PM

Quote:

Originally Posted by Pearlseattle

Hi
Just dump the page and use blank/space as delimiter when you upload it into *office?

actually, what I did was to save the page to desktop as html, and them opening it, in excel and import it. it worked great, but I bash scripting was how I originally wanted to do it, but I fond it quiet cumbersome, but I was still curious as how I could do it.

Pearlseattle · 04-01-2015, 04:57 PM

Ok, easy - just search in Internet about replacing a char (space/blank in your case) with "sed" or "awk".
Example for such a search: "bash sed replace char"

mia_tech · 04-02-2015, 08:09 AM

Quote:

Originally Posted by Pearlseattle

Ok, easy - just search in Internet about replacing a char (space/blank in your case) with "sed" or "awk".
Example for such a search: "bash sed replace char"

ohh, I forgot to mentioned that there are a bounch of text before the lines I want to scrap from webpage. How would jump those lines? and start getting input from the ones I have?

schneidz · 04-02-2015, 08:15 AM

^ with the limited information op provides i'll assume that lines that begin with a number are the ones that they are interested in:

Code:

grep ^[0-9] mia-tech.html | ...

mia_tech · 04-03-2015, 08:12 AM

Quote:

Originally Posted by schneidz

^ with the limited information op provides i'll assume that lines that begin with a number are the ones that they are interested in:

Code:

grep ^[0-9] mia-tech.html | ...

ok, I'm still trying to figure this out. I save the dump from lynx to a text file, but for some reason the grep command doesn't work, and my guess is that it is b/c the page doesn't start with a number but with a space. lynx out put page a bit wierd. So the output really looks like this

Code:

   1 1/1/2014 Unknown 2 2 Norfolk, VA
   ^[14][1] ^[15][2]
   2 1/3/2014 Unknown 1 3 New York (Queens), NY
   ^[16][3] ^[17][4]
   3 1/4/2014 Leonard Frank Harris Jr 2 2 Rock Falls, IL
   ^[18][5] ^[19][6] ^[20][7]
   4 1/5/2014 Unknown 1 3 Erie, OH

I guess that's why grep wasn't working before.

schneidz · 04-03-2015, 08:50 AM

^ this mite work:

Code:

grep ^"   [0-9]" mia-tech.html

BenCollver · 04-03-2015, 01:43 PM

I'd prefer a more programmatic approach. Had good results with PHP's file_get_contents() and SimpleHtmlDom.

http://simplehtmldom.sourceforge.net/

Pearlseattle · 04-05-2015, 02:19 PM

Quote:

Originally Posted by BenCollver

I'd prefer a more programmatic approach. Had good results with PHP's file_get_contents() and SimpleHtmlDom.

http://simplehtmldom.sourceforge.net/

Puah, sounds great - almost lost consciousness when I had to write a few weeks back ~50 regular expressions to parse some webpages - thanks!