Parse out only specific characters from web page

centosser · 08-12-2014, 05:15 PM

Hi. I am using curl to download a specific web page, basically I need the CustName: from the query run on whois.domaintools.com

If you type an ip address, it will return the organization name. I need only this information saved in a plain text file. I tried using grep -E but it gets messy because there are many &nbsp and ;&nbsp located after the CustName. Also the string is returned in one long line so grepping for CustName returns that same long line. The characters that follow the information I need are simply a new line which is '<br>'. I need to stop grabbing text up until that point.

So what I do is run

Code:

curl -s http://whois.domaintools.com/ip.addr.of.domain > file

Then I run

Code:

grep -E -o "CustName.{120}" file

The 120 stands for characters after CustName. Much of these are &nbsp and ;nbsp. I use 120 to make sure I grab everything. Basically, since the data is all in one line, they use <br> right after the information I need. The Address section is below CustName and that is not what I need. I would only like the information up until the <br>

Here is an example of the output of the above command:

Code:

grep -E -o "CustName.{120}" file
242:CustName:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Akamai&nbspTechnologies<br/>Address:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs

As you can see, the only information I want is Akamai Technologies. How can I parse out this data in the most efficient way? Thank you for any help.

unSpawn · 08-12-2014, 06:04 PM

If you 'elinks -dump' then you have parsed output to grep or awk? Or else pipe the curl through a parser like http://www.devshed.com/c/a/apache/logging-in-apache/2/ (see "Listing 3-1. A Simple Script to Use As a Filter")?

centosser · 08-12-2014, 06:25 PM

Hi. I'm not sure what parser means here. I saved the output of the curl command to a file and just ran grep from there. Sorry new to parsing in apache.

norobro · 08-12-2014, 07:37 PM

Perhaps I'm missing something here, but why not use "whois"?

Code:

whois ip.addr.of.domain | grep OrgName

centosser · 08-13-2014, 03:37 PM

The problem is that whois does not seem to work with every ip address