LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-29-2019, 09:18 PM   #1
anon033
Member
 
Registered: Mar 2019
Posts: 188

Rep: Reputation: 13
Heavy Parsing of Text File


Hello everyone! I am very excited as I finally started working on some much needed scripts. The one I am currently working on is used to configure /etc/hosts for blocking basically trashware. I was recommended amazing lists over in the security part of LQ and am currently working on the script part. I have this script currently

Code:
#!/bin/sh

# download the hosts lists
wget http://winhelp2002.mvps.org/hosts.txt
wget https://hosts-file.net/grm.txt

wget https://hosts-file.net/hphosts-partial.txt
wget https://reddestdream.github.io/Projects/MinimalHosts/etc/MinimalHostsBlocker/minimalhosts

wget https://raw.githubusercontent.com/StevenBlack/hosts/master/data/KADhosts/hosts
wget https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.Spam/hosts

wget https://v.firebog.net/hosts/static/w3kbl.txt
wget https://raw.githubusercontent.com/notracking/hosts-blocklists/master/hostnames.txt

wget https://adaway.org/hosts.txt
wget https://v.firebog.net/hosts/AdguardDNS.txt

wget https://raw.githubusercontent.com/anudeepND/blacklist/master/adservers.txt
wget https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt

wget https://hosts-file.net/ad_servers.txt
wget https://v.firebog.net/hosts/Easylist.txt

wget http://pgl.yoyo.org/as/serverlist.php?hostformat=hosts&mimetype=plaintext
wget https://raw.githubusercontent.com/StevenBlack/hosts/master/data/UncheckyAds/hosts

wget https://www.squidblacklist.org/downloads/dg-ads.acl
wget https://v.firebog.net/hosts/Easyprivacy.txt

wget https://v.firebog.net/hosts/Prigent-Ads.txt
wget https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt

wget https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts
wget https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt

wget https://s3.amazonaws.com/lists.disconnect.me/simple_malvertising.txt
wget https://mirror1.malwaredomains.com/files/justdomains

wget https://hosts-file.net/exp.txt
wget https://hosts-file.net/emd.txt

wget https://hosts-file.net/psh.txt
wget https://mirror.cedia.org.ec/malwaredomains/immortal_domains.txt

wget https://www.malwaredomainlist.com/hostslist/hosts.txt
wget https://bitbucket.org/ethanr/dns-blacklists/raw/8575c9f96e5b4a1308f2f12394abd86d0927a4a0/bad_lists/Mandiant_APT1_Report_Appendix_D.txt

wget https://v.firebog.net/hosts/Prigent-Malware.txt
wget https://v.firebog.net/hosts/Prigent-Phishing.txt

wget https://phishing.army/download/phishing_army_blocklist_extended.txt
wget https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-malware.txt

wget https://ransomwaretracker.abuse.ch/downloads/RW_DOMBL.txt
wget https://ransomwaretracker.abuse.ch/downloads/CW_C2_DOMBL.txt

wget https://ransomwaretracker.abuse.ch/downloads/LY_C2_DOMBL.txt
wget https://ransomwaretracker.abuse.ch/downloads/TC_C2_DOMBL.txt

wget https://ransomwaretracker.abuse.ch/downloads/TL_C2_DOMBL.txt
wget https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist

wget https://v.firebog.net/hosts/Shalla-mal.txt
wget https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.Risk/hosts

wget https://www.squidblacklist.org/downloads/dg-malicious.acl
wget https://zerodot1.gitlab.io/CoinBlockerLists/hosts

wget https://raw.githubusercontent.com/jmdugan/blocklists/master/corporations/facebook/all
wget https://hosts-file.net/ad_servers.txt

wget https://hosts-file.net/emd.txt
wget https://hosts-file.net/exp.txt

wget https://hosts-file.net/fsa.txt
wget https://hosts-file.net/grm.txt

wget https://hosts-file.net/hfs.txt
wget https://hosts-file.net/hjk.txt

wget https://hosts-file.net/mmt.txt
wget https://hosts-file.net/psh.txt

# merge all files into a tmeporary text file
find $PWD -type f | xargs cat > tmp.txt

# remove all duplicate hosts
cat tmp.txt | sort -u > hosts.tmp

# add formtting header
sed '1 s/^/<ip-address>    <host-name>\n/' hosts.tmp > tmp.txt
it's a lot, I know. This is all working 100% fine (takes a bit, but I expected that). Now is that hard part, just a recap of what I have done so far:

1) First I gathered all the lists I need

2) I merge all these files into a temporary file called "tmp.txt"

3) I sorted tmp.txt and removed all the duplicates and put the results into "hosts.tmp"

4) I add a comment for formatting to the top of the file and save it back into "tmp.txt"

What I need to do next is go into hosts and:

1) remove all # and everything that isn't a URL (this will be done before I add the comment)

2) under the ip-adress column insert:

Code:
0.0.0.0
3) move all the URLs under host.name

I am having a hard time doing this part and need some help. If anyone has any docs, samples or even a way to do one of these I would love some help. Thank you so much! I am doing this in generic UNIX script so things like sed -i don't work as sed -i is different in GNU sed then well every other sed.

I got it all working:

Code:
#!/bin/sh

echo "configuring /etc/hosts..."

# download the hosts lists
torsocks wget -q --show-progress -i host-list.txt

# merge all files into a temporary text file
cat * >> tmp.txt

# remove all duplicate hosts and comments
cat tmp.txt | uniq -iu > hosts.tmp

# replace all ips with 0.0.0.0
cat tmp.txt | awk '$1="0.0.0.0"' > hosts.tmp


# add formatting header
sed '1 s/^/#<ip-address>    <host-name>/' hosts.tmp > tmp.txt

# move to /etc/hosts
mv tmp.txt /etc/hosts

echo "/etc/hosts configured"
Thank you for the pointers

Last edited by anon033; 06-30-2019 at 12:48 PM.
 
Old 06-30-2019, 12:57 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I would say the first thing to do is remove all the duplication from your script:
Code:
wget https://hosts-file.net/emd.txt
I would also suggest making a temporary directory to download all your files to so removal is easy and not clogging up the current directory

If sed and other commands are an issue, you may want to simply loop through the file and make the necessary changes once.
 
1 members found this post helpful.
Old 06-30-2019, 01:26 AM   #3
tshikose
Member
 
Registered: Apr 2010
Location: Kinshasa, Democratic Republic of Congo
Distribution: RHEL, Fedora, CentOS
Posts: 525

Rep: Reputation: 95
Hi,

I suggest you use the commands below in regards to your numbered tasks.
I do it from memory and I have not tested, so you better ensure you adapt them to your real needs and environment.
man pages exist for a reason.

1) wget your are using, adding cd before to change to a working directory will help as suggested

2) cat * > tmp.txt

3) uniq -iu tmp.txt > tmp_uniq.txt

4) sed -i ... tmp_uniq.txt, please figure out what the ... should be replaced with


1) insert a grep -v '^#' after the uniq above

2) sed -i ... tmp_uniq.txt, with similar comment for ...


3) I do not understand what you meant here
 
1 members found this post helpful.
Old 06-30-2019, 02:08 AM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
And I thought host file had passed its use-by date. I haven't used one in years for blocking like that.
This seems (massively) overly manual. What about something like pihole to handle everything on the LAN ?.
 
Old 06-30-2019, 03:19 AM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Have a look at how this script does it.
Do not reinvent the wheel (and don't ask us to help you reinvent the wheel).
 
3 members found this post helpful.
Old 06-30-2019, 11:08 AM   #6
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
I'm not going to write a full script, but here are a some pointers:

1) Look into here-docs. You can use that to group your URLs together.

2) By using #1, xargs can solve your downloading problem in a shorter way.

3) Why bother with temporary files when pipes work just as well?

4) Don't use sed for something that can be accomplished with echo/printf and cat.

EDIT:
I skimmed past your filtering problem, but I'd recommend a mix of sed and grep's extended regex.

Last edited by individual; 06-30-2019 at 11:21 AM.
 
1 members found this post helpful.
Old 06-30-2019, 12:58 PM   #7
anon033
Member
 
Registered: Mar 2019
Posts: 188

Original Poster
Rep: Reputation: 13
Quote:
Originally Posted by individual View Post
I'm not going to write a full script, but here are a some pointers:

1) Look into here-docs. You can use that to group your URLs together.

2) By using #1, xargs can solve your downloading problem in a shorter way.

3) Why bother with temporary files when pipes work just as well?

4) Don't use sed for something that can be accomplished with echo/printf and cat.

EDIT:
I skimmed past your filtering problem, but I'd recommend a mix of sed and grep's extended regex.
Noted thank you

Last edited by anon033; 06-30-2019 at 01:08 PM.
 
Old 06-30-2019, 02:16 PM   #8
anon033
Member
 
Registered: Mar 2019
Posts: 188

Original Poster
Rep: Reputation: 13
Quote:
Originally Posted by ondoho View Post
Have a look at how this script does it.
Do not reinvent the wheel (and don't ask us to help you reinvent the wheel).
I am not reinventing the wheel, I am writing a script to do a task and there isn't anything that works for me and my system and is fully portable across ALL UNIX systems, is pure UNIX shell script (not written in perl, python or BASH programming languages) and uses tor. I didn't mean to come across asking you all to solve this for me, I just needed some pointers. I got everything working (need to fix some issues because some of these lists aren't formatted and windows leaves stupid encodings in the files ._.), but thank you for your advice.
 
Old 06-30-2019, 02:35 PM   #9
anon033
Member
 
Registered: Mar 2019
Posts: 188

Original Poster
Rep: Reputation: 13
Quote:
Originally Posted by syg00 View Post
And I thought host file had passed its use-by date. I haven't used one in years for blocking like that.
This seems (massively) overly manual. What about something like pihole to handle everything on the LAN ?.
My goal is to see just how much I can do with the bare system and a few tools. It's half annal systemness and half let's see how far we can get with this.
 
Old 06-30-2019, 03:52 PM   #10
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Since you solved it on your own, I'll show you my solution to it.
Code:
#!/usr/bin/env dash

files="$(cat - <<'EOF')"
PUT ALL OF YOUR URLS HERE
EOF

# pipe the URLs to xargs, which will call wget with output silenced.
# pipe the downloaded content to sort -u and store it in a temporary file.
echo "$files" | xargs -I{} wget -q -O - {} | sort -u > tmp.txt

# filter hosts to only include valid hostnames.
sed 's/#.*//' tmp.txt | grep -Eo '([a-zA-Z0-9-]+)(\.[a-zA-Z0-9-]+){1,}' | grep -Ev '[0-9]{1,3}(\.[0-9]{1,3}){3}' > tmp2.txt

# format the file to align urls under host-name and put 0.0.0.0 under ip-address.
sort -u tmp2.txt | sed 's/^/                /' > tmp.txt
printf "<ip-address>    <host-name>\n0.0.0.0\n" | cat - tmp.txt > hosts

# remove temporary files.
rm tmp*
 
Old 06-30-2019, 04:01 PM   #11
anon033
Member
 
Registered: Mar 2019
Posts: 188

Original Poster
Rep: Reputation: 13
Quote:
Originally Posted by individual View Post
Since you solved it on your own, I'll show you my solution to it.
Code:
#!/usr/bin/env dash

files="$(cat - <<'EOF')"
PUT ALL OF YOUR URLS HERE
EOF

# pipe the URLs to xargs, which will call wget with output silenced.
# pipe the downloaded content to sort -u and store it in a temporary file.
echo "$files" | xargs -I{} wget -q -O - {} | sort -u > tmp.txt

# filter hosts to only include valid hostnames.
sed 's/#.*//' tmp.txt | grep -Eo '([a-zA-Z0-9-]+)(\.[a-zA-Z0-9-]+){1,}' | grep -Ev '[0-9]{1,3}(\.[0-9]{1,3}){3}' > tmp2.txt

# format the file to align urls under host-name and put 0.0.0.0 under ip-address.
sort -u tmp2.txt | sed 's/^/                /' > tmp.txt
printf "<ip-address>    <host-name>\n0.0.0.0\n" | cat - tmp.txt > hosts

# remove temporary files.
rm tmp*
There are only two things I just can't figure out. I can't for the life of me fix one thing though, some of these files are written on msdos and other machines and thus having annoying encoding at the end (^W) I looked online, but none of the solutions will remove these symbols. For example:

Code:
's/^W//g'
now I know that to sed 's/^W/' means remove all lines starting with W, but I can't find how to remove this symbol. From what I can find it appears to be a carriage return from old msdos encoding.

The other is I need to move all the hostnames under column two (some files have only hostnames and no ip, so when I replace all of column one with "0.0.0.0" it removes a lot of hostnames). I can't seem to find or figure out doing this. Not all of them start with "www" or "http" or any specific string so I can't find all lines with X and move them to Y. How would I go about this?

Last edited by anon033; 06-30-2019 at 04:05 PM.
 
Old 06-30-2019, 04:07 PM   #12
individual
Member
 
Registered: Jul 2018
Posts: 315
Blog Entries: 1

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by FOSSilized_Daemon View Post
now I know that to sed 's/^W/' means remove all lines starting with W, but I can't find how to remove this symbol.
When talking about regex, yes, but that's not what you're trying to do here. ^W (or ^M as I'm looking at it in vim) signifies carriage return/line feed (CRLF), which is the line terminator Windows uses. Tr (man tr for more info) can help you, though.
Code:
tr -d '\r'
That means "delete all carriage return characters."

EDIT: For your second problem, that's what sed 's/^/ /' in my script does. It moves all lines 16 spaces to the right. Of course, you'll need to do that before placing the header in the file.

EDIT2: I misread your question. You can insert 0.0.0.0 in the same sed statement, but modify the number of spaces.
Code:
sort -u tmp2.txt | sed 's/^/0.0.0.0         /' > tmp.txt

Last edited by individual; 06-30-2019 at 04:15 PM.
 
Old 06-30-2019, 04:21 PM   #13
anon033
Member
 
Registered: Mar 2019
Posts: 188

Original Poster
Rep: Reputation: 13
Talking

Quote:
Originally Posted by individual View Post
When talking about regex, yes, but that's not what you're trying to do here. ^W (or ^M as I'm looking at it in vim) signifies carriage return/line feed (CRLF), which is the line terminator Windows uses. Tr (man tr for more info) can help you, though.
Code:
tr -d '\r'
That means "delete all carriage return characters."

EDIT: For your second problem, that's what sed 's/^/ /' in my script does. It moves all lines 16 spaces to the right. Of course, you'll need to do that before placing the header in the file.

EDIT2: I misread your question. You can insert 0.0.0.0 in the same sed statement, but modify the number of spaces.
Code:
sort -u tmp2.txt | sed 's/^/0.0.0.0         /' > tmp.txt
Thank you so much! Seriously this helps so much
 
Old 06-30-2019, 04:44 PM   #14
anon033
Member
 
Registered: Mar 2019
Posts: 188

Original Poster
Rep: Reputation: 13
Quote:
Originally Posted by individual View Post
When talking about regex, yes, but that's not what you're trying to do here. ^W (or ^M as I'm looking at it in vim) signifies carriage return/line feed (CRLF), which is the line terminator Windows uses. Tr (man tr for more info) can help you, though.
Code:
tr -d '\r'
That means "delete all carriage return characters."

EDIT: For your second problem, that's what sed 's/^/ /' in my script does. It moves all lines 16 spaces to the right. Of course, you'll need to do that before placing the header in the file.

EDIT2: I misread your question. You can insert 0.0.0.0 in the same sed statement, but modify the number of spaces.
Code:
sort -u tmp2.txt | sed 's/^/0.0.0.0         /' > tmp.txt
I hate to bother you again, but I am having some odd issues I can't understand. The script is now

Code:
#!/bin/sh

echo "configuring /etc/hosts..."

# download the hosts lists
torsocks wget -q --show-progress -i host-list.txt

# merge all files into a temporary text file
cat * > tmp.txt

# filter hosts to only include valid hostnames
sed 's/#.*//' tmp.txt | grep -Eo '([a-zA-Z0-9-]+)(\.[a-zA-Z0-9-]+){1,}' | grep -Ev '[0-9]{1,3}(\.[0-9]{1,3}){3}' > tmp2.txt

# format the file to align urls under host-name and put 0.0.0.0 under ip-address
sort -u tmp2.txt | sed 's/^/                /' > tmp.txt
cat tmp.txt | awk '$1="0.0.0.0"' >> tmp3.txt

# add formatting header
sed '1 s/^/#<ip-address>    <host-name>/' tmp2.txt > host

echo "/etc/hosts configured"
and this script works as expected up to

Code:
sort -u tmp2.txt | sed 's/^/                /' > tmp.txt
after that things get weird. I am using

Code:
cat tmp.txt | awk '$1="0.0.0.0"' >> tmp3.txt
to put the ip next to all URLs. This places the ip under <ip-address> (which is column one), however after this is done I cat tmp3.txt and all the hostnames are gone. I need to have the ip next to all hostnames and I am unsure how else to do this. I am also very confused on how this is remove all hostnames. I know I am likely missing something obvious, but do you notice anything which would cause this?

EDIT:

Doing

Code:
sort -u tmp2.txt | sed 's/^/0.0.0.0         /' > tmp.txt
doesn't work it just inserts ip spaces ip

Nevermind, I made a mistake all good

Last edited by anon033; 06-30-2019 at 05:08 PM.
 
Old 06-30-2019, 04:48 PM   #15
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,727

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
RE: Widows text files and Linux.
Two more options I'm aware of:
1. Do an ascii file transfer in sftp (or ftp). That will change the line-end character to match the destination OS.
(so if transferring from Windows to Linux, the Windows CRLF will be converted to the Linux LF, and vice-versa)
2. Run the file through dos2unix to convert the line-end characters (or unix2dos going the other way)

This is not to say that using tr or other methods already posted aren't good...just tossing out a couple other options.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Parsing mountpoint from /etc/mtab (parsing "string") in C deadeyes Programming 3 09-06-2011 05:33 PM
[SOLVED] Parsing text and combining the parsed text zeratul111 Linux - Newbie 6 10-28-2010 12:46 PM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 02:36 PM
I need help parsing text from a text file rsmccain Linux - General 2 01-05-2006 02:43 PM
Parsing a file for a string of text jamesmwlv Linux - General 2 12-02-2002 07:13 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:51 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration