LinuxQuestions.org - [SOLVED] Retrieve specific data from html (alternative who is)

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Retrieve specific data from html (alternative who is) (https://www.linuxquestions.org/questions/programming-9/retrieve-specific-data-from-html-alternative-who-is-4175639318/)

pedropt

09-27-2018 03:46 PM

Retrieve specific data from html (alternative who is)

I have been working with whois tool for some time , but sometimes whois database is not very accurate on countries from specific ip .
tcpiputils.com website looks very accurate and we can retrieve an ip data html with wget , by downloading the normal html displayed on browser .
But using grep in the downloaded data to retrieve the country is a mess , because depending on some ips , the html can display the city and then the country on same field .

Here it is an example of an ip that have been trying to exploit my mail server on tcpiputils
https://dnslytics.com/ip/142.11.199.129

I can download the html with all that data using wget , just by sending the command

Quote:

wget https://dnslytics.com/ip/142.11.199.129

if you download this html you will see that in line 134 is where is the field with city , country etc ...
but that line is a mess to pick up something or to pick an unique reference for grep to get it , and then i have another issue ahead witch is in other ips the country could be in other line and it could have only country without city name before .

here it is a part of that line

Quote:

</script></div></div><div class="col-xs-12 col-sm-6 col-md-7 col-lg-7"><p><a href="/ip">IPv4 root</a> -> <a href="/ip/142.0.0.0-142.255.255.255">142/8</a> -> <a href="/ip/142.11.192.0-142.11.255.255">142.11.192.0/18</a> -> 142.11.199.129</p><h2>IP information 142.11.199.129</h2><table class="table table-condensed table-hover table-striped"><tr><td>IP address</td><td>142.11.199.129</td></tr><tr><td>Location</td><td>Seattle, Washington, United States (US) <img src="/images/blank.gif" alt="us flag" class="flag flag-us"></td></tr><tr><td>Registry</td><td>arin</td></tr></table></div></div><div class="row"><div class.....

Does anyone have an idea how to start with ?

This one looks a bit hard to figure it out at least for me .
thanks

individual

09-27-2018 04:01 PM

What specific information are you trying to get? Just the city? All of the location information?

pedropt

09-27-2018 04:06 PM

Hi , i just want the country .

individual

09-27-2018 04:11 PM

Do you want the full country name, or the two letter country code?

pedropt

09-27-2018 04:16 PM

The one more easier to get , i believe it is the 2 letter country code .
Using the 2 letter code i can make a search for the full country name on a country list i have here .

individual

09-27-2018 04:20 PM

Here you go. This was kind of a fun little script to write. It was easiest to use the PCRE (Perl Compatible regular Expressions) mode of grep to search for the appropriate lines. I've included the option to select the full or short country name.
EDIT: Please let me know if anything isn't clear.
EDIT2: I'm sorry, full_country is actually the state name.
EDIT3: I updated it to get the full country name.

Code:

#!/bin/bash



site="https://dnslytics.com/ip"

addr=$1

html=$(wget -qO - "$site/$addr")

location_line=$(grep -m1 -oE '\btd><td>([^<]+)<\b' <<< "$html")

country=$(grep -m1 -o '\([A-Z][A-Z]\)' <<< "$location_line")

full_country=$(grep -oP "(?<=;|>)\b([^&]+)(?=&nbsp;\()" <<< "$location_line")



echo $country

echo $full_country

scasey

09-27-2018 04:39 PM

An interesting site, and interesting script...but I don't see any real advantage over whois

Code:

whois 142.11.199.129 | grep -i Country

Country:        US

The website just appears to be reporting data from whois and is displaying exactly the same thing that whois returns on my server.

pedropt

09-27-2018 04:44 PM

Wow :hattip: .
Very good programming in so short time .

Thank you very much for the code , in my script i already use whois , but this one will popup if whois output is not very reliable or old .
Somehow , i believe many people will use this code you wrote in future .

Thanks again

Edited

scasey
The problem with whois tool is that sometimes is not very accurate , specially if the server is located in one country and the guy that registered it is in another , in whois you will retrieve multiple countries , one ip here i got 3 countries , US , CN (China) , SG (Singapore) .
And also because sometimes whois is overloaded and you can get a timeout from the output .

The code "Individual" wrote gives you an alternative way to get some ip country name without having to use whois , and also dnsutils website can reverse ip to hostname and a lot of other informations that whois is not able to get .
From anyone who uses it , it may use it to get other variables from the webpage that normally you can not get with whois .

individual

09-27-2018 04:44 PM

Quote:

Originally Posted by scasey (Post 5908614)

An interesting site, and interesting script...but I don't see any real advantage over whois

Code:

whois 142.11.199.129 | grep -i Country

Country:        US

The website just appears to be reporting data from whois and is displaying exactly the same thing that whois returns on my server.

Since the pedropt only wants the country, it is probably easier to use the whois program. But now he has options to choose from.

scasey

09-27-2018 04:55 PM

Quote:

Originally Posted by individual (Post 5908618)

Since the pedropt only wants the country, it is probably easier to use the whois program. But now he has options to choose from.

Don't get me wrong..your script is most impressive...I just wanted to point out that the website is only displaying what whois returns.

I use whois to get the reporting address for spam reporting, and it works almost all the time. There are issues with KoreaNIC, and sometimes with JPNIC, and I have to go to the relevant web pages for those...sometimes.

Here's a script I use to pull the contact information:

Code:

#!/bin/bash

whois $1 | grep -i abuse



$ gabuse 142.11.199.129

OrgAbuseHandle: HAC3-ARIN

OrgAbuseName:  Hostwinds Abuse Center

OrgAbusePhone:  +1-206-886-0665 

OrgAbuseEmail:  abuse@hostwinds.com

OrgAbuseRef:    https://rdap.arin.net/registry/entity/HAC3-ARIN

pedropt

09-27-2018 04:59 PM

Ok Scasey , here it is another example .

Code:

whois 139.99.118.122 | grep -iE ^country | awk {'print$2'}

hope you understood what i mean with whois tool

individual

09-27-2018 05:01 PM

Quote:

Originally Posted by scasey (Post 5908622)

Don't get me wrong..your script is most impressive...I just wanted to point out that the website is only displaying what whois returns.

No offense taken. :)

Quote:

Originally Posted by scasey (Post 5908622)

I use whois to get the reporting address for spam reporting, and it works almost all the time. There are issues with KoreaNIC, and sometimes with JPNIC, and I have to go to the relevant web pages for those...sometimes.

Here's a script I use to pull the contact information:

Code:

#!/bin/bash

whois $1 | grep -i abuse



$ gabuse 142.11.199.129

OrgAbuseHandle: HAC3-ARIN

OrgAbuseName:  Hostwinds Abuse Center

OrgAbusePhone:  +1-206-886-0665 

OrgAbuseEmail:  abuse@hostwinds.com

OrgAbuseRef:    https://rdap.arin.net/registry/entity/HAC3-ARIN

Thanks for sharing! I think it would be neat to combine the two scripts.

scasey

09-27-2018 05:05 PM

Quote:

Originally Posted by pedropt (Post 5908617)

scasey
The problem with whois tool is that sometimes is not very accurate , specially if the server is located in one country and the guy that registered it is in another , in whois you will retrieve multiple countries , one ip here i got 3 countries , US , CN (China) , SG (Singapore) .

Yes, I've seen that. As said, I mostly use whois to identify the reporting address for the IP that delivered spam to my servers...in which case I only need to know the upstream provider/owner of that IP.
When that doesn't work, I go directly to the managing Network Information Center's website (although, I have also bookmarked https://dnslytics.com -- it certainly can be useful.)

pedropt

09-27-2018 05:23 PM

Quote:

Originally Posted by scasey (Post 5908629)

I personally dont contact the upstream provider because in my server from the same subnet of a specific ip i recieve a lot of exploitation techniques to the server or Denial of service .
My solution was to implement some rules to deal with DOS attacks , and then lookup on the logs what a specific ip have been doing , and depending on that i can block it directly into the firewall with my script .
This way i dont have to worry again with that ip .
Sometimes when some ip subnet is trying to hack the server , by this i mean that for ex : one day i get a port scan from 192.168.1.30 , i block it in firewall , then next day i get another portscan or a dos from 192.168.1.35 , same treatment in the firewall , then after 2 days i get an attempt to exploitation or anything else from another ip from same subnet like 192.168.1.50 , best way here is to block that subnet as a whole , because as i notice here , most attacks come from another open websites with services for public , and by this i think that those sites were hacked somehow and the hacker is using the website as a remote shell and redirecting the job using a different ip or the owner of the website did had nothing to do and decided to hammer something on the web .

Definitively the best way is to block in the firewall or i have to contact isp providers everyday because of abusing ips on its network , i dont have time for that .

scasey

09-27-2018 05:44 PM

I definitely agree about blocking netblocks, and do that routinely, but only for email, using ucspi-tcp, which can be configured to drop connections on port 25. We currently block about 75% of connection attempts from spamming providers, mostly in other countries.

We've automated reporting such that we can supply a perl script with the reporting address, the delivering IP address, and the name of the Maildir file containing the spam. The script then composes and send an email to the provider. In our experience, the vast majority of providers welcome the reports, as they allow them to address the source of the UCE and thereby avoid being blacklisted.

We used to do all that automatically, but the program we'd found to do the contact lookups stopped working and was no longer maintained. Before that happened, we'd built the very effective block list mentioned above, tho...so we get relatively little UCE anymore...a small enough amount that we can manage doing the lookups and reporting manually.

All times are GMT -5. The time now is 02:57 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page