script help? modifying text files?

saturndude · 01-27-2012, 02:57 AM

I want to do something with a script, and I tried, but it's hard for me. I wonder if you could give me a hand.

I'd like to have a file called "hosts-mods" containing modifications to my hosts file. First, additions, then a section of deletions.

Then when I download a hosts file from MVPS or hosts-file dot net, I will run my script (probably written with awk, it seemed like a good choice when I started this 6 years ago, so I got an O'Reilly book).

The script will go through the hosts-mods file, one line at a time, and act on the main hosts file, taking the correct action (adding to the main hosts file or commenting out a line and optionally moving all the commented out lines to the end of the file).

So the script will alter the main hosts file so it will have the modifications I want.

Looks easy. Sounds easy. But for me, it is unbelievably hard.

Later, I can add fancy stuff like writing "these files were added" to the screen and listing them. Or "these files were commented out" on the screen and then listing them.

Programming manuals introduce many concepts per page to save paper. But this is very confusing and is too fast for me (plus I don't know what structures I want in my program/script and which are useless). It would be nice if they would say "here is a good way of setting up a counter" or "this type of variable is often used for this purpose", but they don't want to limit people (someone might use a piece of code in a new way).

I know linux is great at searching for and comparing text in files. Can anybody give me a hand to get started? I'm a smart man, but sometimes I have to have some help.

Thanks!

catkin · 01-27-2012, 03:29 AM

Can you show what's in a hosts-mods file and explain how you want to use it to make changes?

saturndude · 01-27-2012, 11:45 AM

hosts-mods is a text file. I broke the sites for this post, but you can still see how it is supposed to work:

127.0.0.1 www dot exscn dot com
127.0.0.1 forums dot exscn dot com
127.0.0.1 freezonesurvivors dot to
127.0.0.1 www dot dd-wrt dot org
127.0.0.1 www dot openwrt dot com
# 127.0.0.1 googlesyndication dot com
# 127.0.0.1 pagead dot googlesyndication dot com
# 127.0.0.1 pagead2 dot googlesyndication dot com
# 127.0.0.1 searchportal dot information dot com
# 127.0.0.1 www dot thepiratebay dot org

The top 5 lines are sites I want to block. On the bottom, popular hosts files often block those 5 sites, but I want to allow them.

My script will take the first line of the modification file, and probably load it into a variable. Then open the hosts file I got from the internet, and see if any line matches it (maybe ignoring the 127.0.0.1 part). If no line matches www dot exscn dot com, for example, I want to add it to the hosts file. If it is already in the hosts file, do nothing and move on to the second line of hosts-additions.

(of course, I want to search for both exscn dot com and www dot exscn dot com.)

For the last 5 lines, I also want to search for the domains. If any are found in the hosts file I got from the internet, delete the line and move on to the next line of hosts-mods. If it is not found, do nothing and move on to the next line of hosts-mods until all lines are processed.

(of course, I want to search for both piratebay dot org and www dot piratebay dot org.)

King_DuckZ · 01-27-2012, 01:02 PM

I'd do that in Ruby (or Python). Conceptually, you load the whole content of your original file into memory, as a list of lines. Let's call it listOriginal. Then you load the other files, in two other lists, listAdditions and listDeletions. Now what you probably need is:

a: the intersection of listOriginal and listDeletion
b: the union of listOriginal and listAddition
c: the difference of b and listDeletion
d: a new list made from a, but with comments before each line
e: the union of c and d

then you can dump e to disk and you're done.
In ruby this is quite short and not hard at all, but if you don't know about scripting this is not the easiest excercise to begin with. I can try to write something for you if you need, although I won't have much time to test and all, or maybe you can try a combination of the more classical sed, grep & co.
Learning ruby however might come in very handy, so if you don't need the script urgently I'd do this investment on myself and start scripting right away

Edit: there's a variety of alternative URLs to access piratebay. This is probably not the best place to look for them (and I'd be very sad if you used it to download my games), but I'm sure you can find mirrors or other links around that are not banned in your country.

jthill · 01-27-2012, 01:35 PM

This'll do it.

Code:

#!/bin/bash
{ 
sed -f /dev/fd/10 10<<-EOD
/^[^#][^ ]* www dot example dot com$/    s/^/#/   
$ a\
127.0.0.1 insertion dot this dot that
EOD
} | sort | uniq

King_DuckZ · 01-27-2012, 07:09 PM

Quote:

Originally Posted by jthill

This'll do it.

Code:

#!/bin/bash
{ 
sed -f /dev/fd/10 10<<-EOD
/^[^#][^ ]* www dot example dot com$/    s/^/#/   
$ a\
127.0.0.1 insertion dot this dot that
EOD
} | sort | uniq

For big lists it is going to be quite some copy & paste.

jthill · 01-27-2012, 07:25 PM

? It's one line per insertion or deletion, you can't get shorter than that. For "example" read "deletion", and duplicate those lines to taste.

saturndude · 01-28-2012, 01:49 AM

Ruby? Python? I can't. I don't have a lot of time, and I'm not really smart. I mean, I'm smart, but when it comes to programming, I am not smart AT ALL.

The problem is that every concept of awk is thrown at me in just a few pages and it's hard to ignore anything, because I don't know what parts will turn out to be irrelevant.

Look at the examples of the linux "find" command in O'Reilly's "Linux in a Nutshell". That's what I need -- LOTS of examples to help me get started. But O'Reilly's "Effective awk Programming" doesn't have many.

(I'm using ADSL from Cincinnati Bell, a rare mid-size telco that was never swallowed up by the big boys. They have no problems with Pirate Bay. They substitute their own "domain not found" page when I should see the one from OpenDNS; nothing I can do about that. That's my only complaint.)

I'm trying to combine case insensitivity, searching for a pattern, four possible outcomes, opening a couple of files, and appending stuff to a final hosts file (instead of the default "print pattern matches to screen"). Hard enough by themselves, I want to combine them. Very hard.

I thought of using IF/THEN/ELSE statements with awk. One line for each addition, and one for each deletion. It would go something like this (forgive the line numbers):

10 define any variables, if any (the "BEGIN" part of an awk script)

20 If a line in /etc/hosts contains www dot exscn dot com, delete it. Make it field #2, and make field #1 the 127.0.0.1 part and add it to the end of the hosts file so the site is blocked. ELSE make it field #2, and make field #1 the 127.0.0.1 part and add it to the end of the hosts file.

30 repeat this for other sites, one line for each site.

40 If googleadsyndication dot com is found, delete that line. Make googleadsyndication dot com field #2, make field #1 "#127.0.0.1", append to hosts file. ELSE make googleadsyndication dot com field #2, make field #1 "#127.0.0.1", append to hosts file.

50 Repeat line 40 for the next site I want commented out of the final hosts file.

Why am I writing both lines that I added and lines that I commented out to the bottom of the hosts file? Personal preference, that's all. I think I'll skip this at first.

(My efforts have more lines so I can break down the steps, it just helps me think better.)

Your sed example is pretty neat, and it helps me think. Will I need a line for each "www dot example dot com" that I want to delete? Will I need one line of "www dot insertion dot com" for each line I want to add? Looks good, except for the /dev/fd10 part. Is that a storage space for variables, like extra CPU registers? I don't normally do anything in /dev (or /proc). Or does that line ignore the first 10 lines of /etc/hosts where the "localhost" entry is?

Any help still greatly appreciated!

King_DuckZ · 01-28-2012, 12:32 PM

Quote:

Originally Posted by saturndude

Ruby? Python? I can't. I don't have a lot of time, and I'm not really smart. I mean, I'm smart, but when it comes to programming, I am not smart AT ALL.

You don't really need to dig into the intricacies. One of the points of strength of Ruby is its ease of use. I'm typing off the tip of my head, but what you need can be done with something like this:

Code:

#!/usr/bin/env ruby

require 'set'

YourDownloadedFile = "/home/hello/hosts-mods"
HostsFile = "/etc/hosts"

#This will contain the whole hosts file, as a set of lines
aHosts = File.read(HostsFile).lines.collect
#Clean trailing spaces, line endings etc and drop empty lines
aHosts.collect! {|s| s.chomp.strip}.select! {|s| s.length > 0}

#Here we put the lines you want to add...
aAddLines = File.read(YourDownloadedFile).lines.collect
#As before, but also only take lines starting by "+"
aAddLines.select! {|s| s[0] == "+"}.collect! {|s| s.chomp.strip}.select! {|s| s.length > 0}

#...and here we put the lines you want to blacklist
aRemoveLines = File.read(YourDownloadedFile).lines.collect
aRemoveLines.select! {|s| s[0] = "-"}.collect! {|s| s.chomp.strip}.select! {|s| s.length > 0}

#We have three Arrays so far, but Sets would be better suited to our task, so let's convert data
#Note that Set can't contain duplicates in the same list, so they will be stripped away here if any.
sHosts = Set.new(aHosts)
sAddLines = Set.new(aAddLines)
sRemoveLines = Set.new(aRemoveLines)

#Do as said in my previous post
a = sHosts.intersection sRemoveLines
b = sHosts.union sAddLines
c = b.difference sRemoveLines
d = Set.new(a).collect! {|s| "#" + s}
e = c.union d

#Print the result on stdout
e.each {|s| puts s}

That's all. I didn't test it, but maybe someone else here can give you a hand. You can redirect stdout to /etc/hosts to overwrite it, or do whatever you need. Note that merging in the last Set wasn't really necessary, we could've printed c and then d, right away. Storing intermediate values in variables is not even necessary, but I thought it was clearer like this. I hope it helps you (if it works!)

Also, I'm assuming that additions are regular lines beginning with a +, and removals begins with -. So something like:

+1.2.3.4 www hello it
+5.6.7.8 www ciao it
-9.1.2.3 www salut fr

would make sure that hello and ciao are in your merged result, and would ensure that no salut appears anywhere in it.

saturndude · 01-28-2012, 05:01 PM

Should I get a book on Ruby?

I thought of just comparing, adding and deleting whole lines in awk. Simplify it by ignoring the act of breaking things into fields, but then I saw this page:

http://www.notesbit.com/index.php/sc...nix-help-page/

which contained these pieces of code (and others) to use in awk:

"$3 !~ /regexp/" # regexp does not match in 3d field
‘{print NR ": " $0}’ # prefix a line number, colon, space to each line

Ruby sounds like a darn good idea as well. There is a limit to what I can learn from man pages (I'm "old school" and like a paper book), so I'll have to grab a book on Ruby and try it out. Anything is better than manual pattern searching and replacing with joe, especially with 50 or more modifications each time I update /etc/hosts.

I'll let you know how it goes.....

jthill · 01-28-2012, 05:39 PM

Oh boy, you'd think I'd know better than to type it straight into the reply box. Apologies for the sloppy work.

Here's the script with the quoting right so it actually works:

Code:

#!/bin/bash
{ 
sed -f /dev/fd/10 10<<-'EOD'

# don't blackhole these hosts:

/.*127.0.0.1 googlesyndication.com$/            s/^#* */#/
/.*127.0.0.1 pagead.googlesyndication.com$/     s/^#* */#/
/.*127.0.0.1 pagead2.googlesyndication.com$/    s/^#* */#/
/.*127.0.0.1 searchportal.information.com$/     s/^#* */#/
/.*127.0.0.1 www.thepiratebay.org$/             s/^#* */#/

# do blackhole these hosts:

$ a\
127.0.0.1 www.exscn.com\
127.0.0.1 forums.exscn.com\
127.0.0.1 freezonesurvivors.to\
127.0.0.1 www.dd-wrt.org\
127.0.0.1 www.openwrt.com
EOD
} | sort | uniq

Put that in file edithosts, run it like this

Code:

# cat /path/to/newly/downloaded/blackhole/file /etc/hosts | edithosts > /etc/hosts.new
# ln -f /etc/hosts /etc/hosts.old
# mv -f /etc/hosts.new /etc/hosts

Note the backslash after each inserted host except the last.

Nominal Animal · 01-28-2012, 08:34 PM

How about using a bit more complex Bash-awk script, but with

/etc/blackhole.always
List of hostnames always blackholed
/etc/blackhole.never
List of hostnames (domains if starting with a .) never blackholed
/etc/blackhole.urls
List of URLs containing blackhole host lists to use

Oh yes, this one is grossly overengineered, but I can't help it; I like it this way.

This variant uses a fixed blackhole IP address, 127.0.0.254. You can pick any 127.x.y.z address you want, they all point to the loopback. Reserving one loopback address for blackholing means that you can autodetect them in the host file, and you can add a rule to reject all accesses to that address in your iptables rules.

In /etc/hosts, the hostnames mapped to the blackhole IP address are replaced with the new blackhole list. If you decide to change the blackhole IP address, you'll need to edit the hosts file to match before you run the script, or you'll get duplicate addresses for the blackholed hosts. (The script ignores the IP address used in the blackhole host lists obtained from the web, always using this one instead.)

To maintain your own blackholes, add the host names (only!) to /etc/blackhole.always.

If there are some hosts you don't want to be blackholed, add them to /etc/blackhole.never.

If there are entire domains you need to keep non-blackholed, add the domain with a dot prefix (say, .linuxquestions.org) to /etc/blackhole.never. All host names explicitly listed in /etc/blackhole.always will still be blackholed, even if they might match a hostname or domain in /etc/blackhole.never; this file filters only the host lists you obtain from the web.

To combine blackhole host lists from the web, add the URLs to /etc/blackhole.urls. The script uses wget to get them.

The script takes no command line arguments, as it is designed to be run from e.g. cron. Just stick it into /etc/cron.weekly/ or /etc/cron.monthly/.

You can create /etc/blackhole.conf to set different paths to the ones I listed above.

The script:

Code:

#!/bin/bash

# Load file names from /etc/blackhole.conf if any.
[ -r /etc/blackhole.conf ] && . /etc/blackhole.conf

# Set defaults.
[ -n "$BLACKHOLE_ALWAYS"  ] || BLACKHOLE_ALWAYS="/etc/blackhole.always"
[ -n "$BLACKHOLE_NEVER"   ] || BLACKHOLE_NEVER="/etc/blackhole.never"
[ -n "$BLACKHOLE_LISTS"   ] || BLACKHOLE_LISTS="/etc/blackhole.urls"
[ -n "$BLACKHOLE_ADDRESS" ] || BLACKHOLE_ADDRESS="127.0.0.254"
[ -n "$HOSTS_FILE"        ] || HOSTS_FILE="/etc/hosts"

# Make sure we use POSIX locale, so we handle all input charsets as-is.
export LANG=C LC_ALL=C

# Create a temporary working directory.
WORK="$(mktemp -d)" || exit $?
trap "rm -rf '$WORK'" EXIT

# Copy non-blackhole items from the current hosts file.
awk -v "addr=$BLACKHOLE_ADDRESS" '
    BEGIN {
        RS="[\t\v\f ]*(\r\n|\n\r|[\r\n])"
        FS="[\t\v\f ]+"
    }

    ($1 != addr) { printf("%s\n", $0) }
' "$HOSTS_FILE" > "$WORK/header" || exit $?

(
    # List the host names from the blackhole lists
    if [ -r "$BLACKHOLE_LISTS" ]; then
        while read URL ; do

            # Skip comment lines.
            [ "$URL" = "${URL##[#;]}" ] || continue

            echo -n "Downloading $URL: " >&2

            # Get the list, but filter to keep just the host names.
            wget -q -O - "$URL" | awk '
                (NF >= 2 && $1 !~ /[^.:0-9A-Fa-f]/) {
                    if (index($0, "#") > 0)
                        sub(/#.*$/, "")
                    for (i = 2; i <= NF; i++)
                        if (length($i) > 0)
                            printf("%s\n", $i)
                }
            ' && echo "Success" >&2 || echo "Error [$?]" >&2

        done < "$BLACKHOLE_LISTS"
    fi

) | (
    # Omit those that should never be blackholed.
    # Start domains with a dot.
    if [ -r "$BLACKHOLE_NEVER" ]; then
        awk -v "file=$BLACKHOLE_NEVER" '
            BEGIN {
                split("", host)
                split("", domain)
                while ((getline < file) > 0) {
                    if (index($0, "#") > 0)
                        sub(/#.*$/, "")
                    for (i = 1; i <= NF; i++)
                        if (length($i) > 1 && substr($i, 1, 1) == ".")
                            domain[substr($i, 1)] = 1
                        else if (length($i) > 0)
                            host[$i] = 1
                }
                domains = length(domain)
            }

            ($0 in host) { next }
            (domains > 0) {
                temp = $0
                while ((i = index(temp, ".")) > 0) {
                    temp = substr(temp, i + 1)
                    if (temp in domain)
                        next
                }
            }
            { print }
        '
    else
        cat
    fi

    # Add the always-blackholed host names
    if [ -r "$BLACKHOLE_ALWAYS" ]; then
        awk '   (NF >= 2 && $1 !~ /[^.:0-9A-Fa-f]/) {
                    if (index($0, "#") > 0)
                        sub(/#.*$/, "")
                    for (i = 2; i <= NF; i++)
                        if (length($i) > 0)
                            printf("%s\n", $i)
                }' "$BLACKHOLE_ALWAYS"
    fi

) | awk -v "to=$BLACKHOLE_ADDRESS" '
    ($0 in listed) { next }
    {
        listed[$0] = 1
        printf("%s %s\n", to, $0)
    }
' > "$WORK/footer" || exit $?

# Combine into a new hosts file..
HOSTS_TEMP="$HOSTS_FILE.$(hostname -s).$$"
if ! cat "$WORK/header" "$WORK/footer" > "$HOSTS_TEMP" ; then
    rm -f "$HOSTS_TEMP"
    rm -rf "$WORK"
    exit 1
fi

# .. create a hardlink (so the file will exist at all times) ..
if ! ln -f "$HOSTS_FILE" "$HOSTS_FILE.old" ; then
    rm -f "$HOSTS_TEMP"
    rm -rf "$WORK"
    exit 1
fi

# .. and try to replace the hosts file.
if ! mv -f "$HOSTS_TEMP" "$HOSTS_FILE" ; then
    rm -f "$HOSTS_TEMP"
    rm -rf "$WORK"
    exit 1
fi

# Success.
rm -rf "$WORK"
trap - EXIT

echo "$HOSTS_FILE updated successfully."
exit 0

I recommend you test the script (before moving it to /etc/cron.weekly/ or /etc/cron.monthly/) by copying your hosts file to current working directory, then running the script against it, and looking at the results:

Code:

cp /etc/hosts hosts
env HOSTS_FILE=hosts ./this-script

After that, you might wish to remove the final echo from the script, or cron will try to send you an e-mail every time the host list has been updated.