Read and extract table data in HTML from unix

shridhar22 · 11-03-2014, 03:56 AM

I need extract rows 48 to 53 (which have "-" in all the 5 columns).
This page gets updated daily and I need names in the second column. Currently the are MSNG SIBN MSRS RASP MTLRP MTLR
http://www.nkcbank.ru/viewCatalog.do?menuKey=254

I used curl command to get the html code but dont know how to extract my required data. TIA

pan64 · 11-03-2014, 04:12 AM

looks like you need a html/xml parser, for example in perl.....

ondoho · 11-03-2014, 02:17 PM

try html-xml-utils
and
xmllint (part of libxml2)

shridhar22 · 11-04-2014, 02:01 AM

thanks ondoho and pan64, I did install html-xml-utils but still a bit confused how to extract the second column from rows marked with *

ondoho · 11-04-2014, 02:13 PM

well, i'm not going to click that link.
so if you want to post some html, explain what you want to extract, and show us what you tried so far, we'll be more than happy to assist.

TB0ne · 11-04-2014, 03:36 PM

Quote:

Originally Posted by shridhar22

thanks ondoho and pan64, I did install html-xml-utils but still a bit confused how to extract the second column from rows marked with *

Ok..so why don't you post what you have written, show us a sample of the input data, and what you're wanting as the output data, and we can try to help. But we're not going to write your code for you, or click bank-website links in Russia. Post your code and relevant details.

shridhar22 · 11-05-2014, 12:28 AM

Okay, sorry if I conflicted the forum norms, I tried to quote the html page source but its (480122 characters) . I need information from 1st table=>name in 2nd column=> which has all blanks (-).
I used the * special character (there are total 7 * signs on the page source) and tried to check from where I can extract name in the second column. I found if * is at line number 3 then my required word is at line number 6 and so on.

I used the following command, which works correct for me as of now, but I know this logic/regex only survives till the time my word is available 3 lines ahead of the greped * symbol.

Quote:

curl --silent http://www.nkcbank.com/viewCatalog.do?menuKey=254 | awk -v lines=3 '/\*/ {for(i=lines;i;--i)getline; print $0 }' | grep -Eo '\b[[:upper:]][[:upper:]][[:upper:]]+\b'

[QUOTE]

pan64 · 11-05-2014, 05:08 AM

awk | grep can be combined into one single awk script.
The script you wrote does not check the 5 occurrences of - (but a *), that is not the same thing at all.

Here you can find additional information and tips:
http://stackoverflow.com/questions/1...ble-using-bash

ondoho · 11-05-2014, 12:24 PM

shridhar22, please believe me, in the long run you'll be happier using html-xml-utils, which contain some commands that parse html - something that you're now trying to re-implement from scratch.
xmllint is actually even better, but harder to use.

it's probably easier to parse by css classes, so instead of looking for "the 1st table", you'd be looking for "a table that has the class xxxx"

you can upload the html code of the whole page somewhere else, so interested helpers can use that to see what you're trying to achieve.

i'm not a good coder, but i once made a weather forecast script that uses above mentioned utilities, if you want you can take a peek here.