LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-27-2018, 03:46 PM   #1
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Rep: Reputation: Disabled
Retrieve specific data from html (alternative who is)


I have been working with whois tool for some time , but sometimes whois database is not very accurate on countries from specific ip .
tcpiputils.com website looks very accurate and we can retrieve an ip data html with wget , by downloading the normal html displayed on browser .
But using grep in the downloaded data to retrieve the country is a mess , because depending on some ips , the html can display the city and then the country on same field .

Here it is an example of an ip that have been trying to exploit my mail server on tcpiputils
https://dnslytics.com/ip/142.11.199.129

I can download the html with all that data using wget , just by sending the command

if you download this html you will see that in line 134 is where is the field with city , country etc ...
but that line is a mess to pick up something or to pick an unique reference for grep to get it , and then i have another issue ahead witch is in other ips the country could be in other line and it could have only country without city name before .

here it is a part of that line
Quote:
</script></div></div><div class="col-xs-12 col-sm-6 col-md-7 col-lg-7"><p><a href="/ip">IPv4 root</a> -> <a href="/ip/142.0.0.0-142.255.255.255">142/8</a> -> <a href="/ip/142.11.192.0-142.11.255.255">142.11.192.0/18</a> -> 142.11.199.129</p><h2>IP information 142.11.199.129</h2><table class="table table-condensed table-hover table-striped"><tr><td>IP address</td><td>142.11.199.129</td></tr><tr><td>Location</td><td>Seattle,&nbsp;Washington,&nbsp;United States&nbsp;(US) <img src="/images/blank.gif" alt="us flag" class="flag flag-us"></td></tr><tr><td>Registry</td><td>arin</td></tr></table></div></div><div class="row"><div class.....
Does anyone have an idea how to start with ?

This one looks a bit hard to figure it out at least for me .
thanks
 
Old 09-27-2018, 04:01 PM   #2
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
What specific information are you trying to get? Just the city? All of the location information?
 
Old 09-27-2018, 04:06 PM   #3
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Original Poster
Rep: Reputation: Disabled
Hi , i just want the country .
 
Old 09-27-2018, 04:11 PM   #4
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Do you want the full country name, or the two letter country code?
 
Old 09-27-2018, 04:16 PM   #5
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Original Poster
Rep: Reputation: Disabled
The one more easier to get , i believe it is the 2 letter country code .
Using the 2 letter code i can make a search for the full country name on a country list i have here .
 
Old 09-27-2018, 04:20 PM   #6
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Here you go. This was kind of a fun little script to write. It was easiest to use the PCRE (Perl Compatible regular Expressions) mode of grep to search for the appropriate lines. I've included the option to select the full or short country name.
EDIT: Please let me know if anything isn't clear.
EDIT2: I'm sorry, full_country is actually the state name.
EDIT3: I updated it to get the full country name.
Code:
#!/bin/bash

site="https://dnslytics.com/ip"
addr=$1
html=$(wget -qO - "$site/$addr")
location_line=$(grep -m1 -oE '\btd><td>([^<]+)<\b' <<< "$html")
country=$(grep -m1 -o '\([A-Z][A-Z]\)' <<< "$location_line")
full_country=$(grep -oP "(?<=;|>)\b([^&]+)(?=&nbsp;\()" <<< "$location_line")

echo $country
echo $full_country

Last edited by individual; 09-28-2018 at 02:53 PM. Reason: Fixing silly typos.
 
2 members found this post helpful.
Old 09-27-2018, 04:39 PM   #7
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,734

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
An interesting site, and interesting script...but I don't see any real advantage over whois
Code:
whois 142.11.199.129 | grep -i Country
Country:        US
The website just appears to be reporting data from whois and is displaying exactly the same thing that whois returns on my server.
 
Old 09-27-2018, 04:44 PM   #8
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Original Poster
Rep: Reputation: Disabled
Wow .
Very good programming in so short time .

Thank you very much for the code , in my script i already use whois , but this one will popup if whois output is not very reliable or old .
Somehow , i believe many people will use this code you wrote in future .

Thanks again

Edited

scasey
The problem with whois tool is that sometimes is not very accurate , specially if the server is located in one country and the guy that registered it is in another , in whois you will retrieve multiple countries , one ip here i got 3 countries , US , CN (China) , SG (Singapore) .
And also because sometimes whois is overloaded and you can get a timeout from the output .

The code "Individual" wrote gives you an alternative way to get some ip country name without having to use whois , and also dnsutils website can reverse ip to hostname and a lot of other informations that whois is not able to get .
From anyone who uses it , it may use it to get other variables from the webpage that normally you can not get with whois .

Last edited by pedropt; 09-27-2018 at 04:53 PM.
 
Old 09-27-2018, 04:44 PM   #9
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by scasey View Post
An interesting site, and interesting script...but I don't see any real advantage over whois
Code:
whois 142.11.199.129 | grep -i Country
Country:        US
The website just appears to be reporting data from whois and is displaying exactly the same thing that whois returns on my server.
Since the pedropt only wants the country, it is probably easier to use the whois program. But now he has options to choose from.
 
Old 09-27-2018, 04:55 PM   #10
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,734

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
Quote:
Originally Posted by individual View Post
Since the pedropt only wants the country, it is probably easier to use the whois program. But now he has options to choose from.
Don't get me wrong..your script is most impressive...I just wanted to point out that the website is only displaying what whois returns.

I use whois to get the reporting address for spam reporting, and it works almost all the time. There are issues with KoreaNIC, and sometimes with JPNIC, and I have to go to the relevant web pages for those...sometimes.

Here's a script I use to pull the contact information:
Code:
#!/bin/bash
whois $1 | grep -i abuse

$ gabuse 142.11.199.129
OrgAbuseHandle: HAC3-ARIN
OrgAbuseName:   Hostwinds Abuse Center
OrgAbusePhone:  +1-206-886-0665 
OrgAbuseEmail:  abuse@hostwinds.com
OrgAbuseRef:    https://rdap.arin.net/registry/entity/HAC3-ARIN
 
1 members found this post helpful.
Old 09-27-2018, 04:59 PM   #11
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Original Poster
Rep: Reputation: Disabled
Ok Scasey , here it is another example .
Code:
whois 139.99.118.122 | grep -iE ^country | awk {'print$2'}
hope you understood what i mean with whois tool
 
Old 09-27-2018, 05:01 PM   #12
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by scasey View Post
Don't get me wrong..your script is most impressive...I just wanted to point out that the website is only displaying what whois returns.
No offense taken.
Quote:
Originally Posted by scasey View Post
I use whois to get the reporting address for spam reporting, and it works almost all the time. There are issues with KoreaNIC, and sometimes with JPNIC, and I have to go to the relevant web pages for those...sometimes.

Here's a script I use to pull the contact information:
Code:
#!/bin/bash
whois $1 | grep -i abuse

$ gabuse 142.11.199.129
OrgAbuseHandle: HAC3-ARIN
OrgAbuseName:   Hostwinds Abuse Center
OrgAbusePhone:  +1-206-886-0665 
OrgAbuseEmail:  abuse@hostwinds.com
OrgAbuseRef:    https://rdap.arin.net/registry/entity/HAC3-ARIN
Thanks for sharing! I think it would be neat to combine the two scripts.
 
Old 09-27-2018, 05:05 PM   #13
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,734

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
Quote:
Originally Posted by pedropt View Post
scasey
The problem with whois tool is that sometimes is not very accurate , specially if the server is located in one country and the guy that registered it is in another , in whois you will retrieve multiple countries , one ip here i got 3 countries , US , CN (China) , SG (Singapore) .
Yes, I've seen that. As said, I mostly use whois to identify the reporting address for the IP that delivered spam to my servers...in which case I only need to know the upstream provider/owner of that IP.
When that doesn't work, I go directly to the managing Network Information Center's website (although, I have also bookmarked https://dnslytics.com -- it certainly can be useful.)
 
Old 09-27-2018, 05:23 PM   #14
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by scasey View Post
Yes, I've seen that. As said, I mostly use whois to identify the reporting address for the IP that delivered spam to my servers...in which case I only need to know the upstream provider/owner of that IP.
When that doesn't work, I go directly to the managing Network Information Center's website (although, I have also bookmarked https://dnslytics.com -- it certainly can be useful.)
I personally dont contact the upstream provider because in my server from the same subnet of a specific ip i recieve a lot of exploitation techniques to the server or Denial of service .
My solution was to implement some rules to deal with DOS attacks , and then lookup on the logs what a specific ip have been doing , and depending on that i can block it directly into the firewall with my script .
This way i dont have to worry again with that ip .
Sometimes when some ip subnet is trying to hack the server , by this i mean that for ex : one day i get a port scan from 192.168.1.30 , i block it in firewall , then next day i get another portscan or a dos from 192.168.1.35 , same treatment in the firewall , then after 2 days i get an attempt to exploitation or anything else from another ip from same subnet like 192.168.1.50 , best way here is to block that subnet as a whole , because as i notice here , most attacks come from another open websites with services for public , and by this i think that those sites were hacked somehow and the hacker is using the website as a remote shell and redirecting the job using a different ip or the owner of the website did had nothing to do and decided to hammer something on the web .

Definitively the best way is to block in the firewall or i have to contact isp providers everyday because of abusing ips on its network , i dont have time for that .
 
Old 09-27-2018, 05:44 PM   #15
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,734

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
I definitely agree about blocking netblocks, and do that routinely, but only for email, using ucspi-tcp, which can be configured to drop connections on port 25. We currently block about 75% of connection attempts from spamming providers, mostly in other countries.

We've automated reporting such that we can supply a perl script with the reporting address, the delivering IP address, and the name of the Maildir file containing the spam. The script then composes and send an email to the provider. In our experience, the vast majority of providers welcome the reports, as they allow them to address the source of the UCE and thereby avoid being blacklisted.

We used to do all that automatically, but the program we'd found to do the contact lookups stopped working and was no longer maintained. Before that happened, we'd built the very effective block list mentioned above, tho...so we get relatively little UCE anymore...a small enough amount that we can manage doing the lookups and reporting manually.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] retrieve html output in bash StorageDon Linux - Newbie 4 11-28-2016 01:06 PM
How to retrieve the data lost in installation? BryanWalters Linux - General 5 05-14-2016 06:19 AM
how to retrieve some specific set of lines from a file and store it in a char buffer. vigneshinbox Programming 3 04-02-2009 01:16 AM
Best way to retrieve data baldurpet Linux - Software 3 12-25-2008 12:48 AM
Very urgent! Need to retrieve some data! Help bikov_k General 2 10-16-2004 06:51 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:55 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration