LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 01-27-2012, 02:57 AM   #1
saturndude
Member
 
Registered: Mar 2005
Location: across the river from Louisville KY
Distribution: Mandriva 2010.2 (64-bit)
Posts: 57

Rep: Reputation: 15
script help? modifying text files?


I want to do something with a script, and I tried, but it's hard for me. I wonder if you could give me a hand.

I'd like to have a file called "hosts-mods" containing modifications to my hosts file. First, additions, then a section of deletions.

Then when I download a hosts file from MVPS or hosts-file dot net, I will run my script (probably written with awk, it seemed like a good choice when I started this 6 years ago, so I got an O'Reilly book).

The script will go through the hosts-mods file, one line at a time, and act on the main hosts file, taking the correct action (adding to the main hosts file or commenting out a line and optionally moving all the commented out lines to the end of the file).

So the script will alter the main hosts file so it will have the modifications I want.

Looks easy. Sounds easy. But for me, it is unbelievably hard.

Later, I can add fancy stuff like writing "these files were added" to the screen and listing them. Or "these files were commented out" on the screen and then listing them.

Programming manuals introduce many concepts per page to save paper. But this is very confusing and is too fast for me (plus I don't know what structures I want in my program/script and which are useless). It would be nice if they would say "here is a good way of setting up a counter" or "this type of variable is often used for this purpose", but they don't want to limit people (someone might use a piece of code in a new way).

I know linux is great at searching for and comparing text in files. Can anybody give me a hand to get started? I'm a smart man, but sometimes I have to have some help.


Thanks!
 
Old 01-27-2012, 03:29 AM   #2
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Can you show what's in a hosts-mods file and explain how you want to use it to make changes?
 
Old 01-27-2012, 11:45 AM   #3
saturndude
Member
 
Registered: Mar 2005
Location: across the river from Louisville KY
Distribution: Mandriva 2010.2 (64-bit)
Posts: 57

Original Poster
Rep: Reputation: 15
Basic idea of functioning

hosts-mods is a text file. I broke the sites for this post, but you can still see how it is supposed to work:

127.0.0.1 www dot exscn dot com
127.0.0.1 forums dot exscn dot com
127.0.0.1 freezonesurvivors dot to
127.0.0.1 www dot dd-wrt dot org
127.0.0.1 www dot openwrt dot com
# 127.0.0.1 googlesyndication dot com
# 127.0.0.1 pagead dot googlesyndication dot com
# 127.0.0.1 pagead2 dot googlesyndication dot com
# 127.0.0.1 searchportal dot information dot com
# 127.0.0.1 www dot thepiratebay dot org

The top 5 lines are sites I want to block. On the bottom, popular hosts files often block those 5 sites, but I want to allow them.

My script will take the first line of the modification file, and probably load it into a variable. Then open the hosts file I got from the internet, and see if any line matches it (maybe ignoring the 127.0.0.1 part). If no line matches www dot exscn dot com, for example, I want to add it to the hosts file. If it is already in the hosts file, do nothing and move on to the second line of hosts-additions.

(of course, I want to search for both exscn dot com and www dot exscn dot com.)

For the last 5 lines, I also want to search for the domains. If any are found in the hosts file I got from the internet, delete the line and move on to the next line of hosts-mods. If it is not found, do nothing and move on to the next line of hosts-mods until all lines are processed.

(of course, I want to search for both piratebay dot org and www dot piratebay dot org.)

Last edited by saturndude; 01-27-2012 at 11:50 AM. Reason: had to break URLs of "bad" sites
 
Old 01-27-2012, 01:02 PM   #4
King_DuckZ
Member
 
Registered: Nov 2009
Location: Rome, IT
Distribution: Sabayon
Posts: 61

Rep: Reputation: 2
I'd do that in Ruby (or Python). Conceptually, you load the whole content of your original file into memory, as a list of lines. Let's call it listOriginal. Then you load the other files, in two other lists, listAdditions and listDeletions. Now what you probably need is:

a: the intersection of listOriginal and listDeletion
b: the union of listOriginal and listAddition
c: the difference of b and listDeletion
d: a new list made from a, but with comments before each line
e: the union of c and d

then you can dump e to disk and you're done.
In ruby this is quite short and not hard at all, but if you don't know about scripting this is not the easiest excercise to begin with. I can try to write something for you if you need, although I won't have much time to test and all, or maybe you can try a combination of the more classical sed, grep & co.
Learning ruby however might come in very handy, so if you don't need the script urgently I'd do this investment on myself and start scripting right away

Edit: there's a variety of alternative URLs to access piratebay. This is probably not the best place to look for them (and I'd be very sad if you used it to download my games), but I'm sure you can find mirrors or other links around that are not banned in your country.

Last edited by King_DuckZ; 01-27-2012 at 01:05 PM.
 
Old 01-27-2012, 01:35 PM   #5
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
This'll do it.
Code:
#!/bin/bash
{ 
sed -f /dev/fd/10 10<<-EOD
/^[^#][^ ]* www dot example dot com$/    s/^/#/   
$ a\
127.0.0.1 insertion dot this dot that
EOD
} | sort | uniq
 
Old 01-27-2012, 07:09 PM   #6
King_DuckZ
Member
 
Registered: Nov 2009
Location: Rome, IT
Distribution: Sabayon
Posts: 61

Rep: Reputation: 2
Quote:
Originally Posted by jthill View Post
This'll do it.
Code:
#!/bin/bash
{ 
sed -f /dev/fd/10 10<<-EOD
/^[^#][^ ]* www dot example dot com$/    s/^/#/   
$ a\
127.0.0.1 insertion dot this dot that
EOD
} | sort | uniq
For big lists it is going to be quite some copy & paste.
 
Old 01-27-2012, 07:25 PM   #7
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
? It's one line per insertion or deletion, you can't get shorter than that. For "example" read "deletion", and duplicate those lines to taste.
 
Old 01-28-2012, 01:49 AM   #8
saturndude
Member
 
Registered: Mar 2005
Location: across the river from Louisville KY
Distribution: Mandriva 2010.2 (64-bit)
Posts: 57

Original Poster
Rep: Reputation: 15
Ruby? Python? I can't. I don't have a lot of time, and I'm not really smart. I mean, I'm smart, but when it comes to programming, I am not smart AT ALL.

The problem is that every concept of awk is thrown at me in just a few pages and it's hard to ignore anything, because I don't know what parts will turn out to be irrelevant.

Look at the examples of the linux "find" command in O'Reilly's "Linux in a Nutshell". That's what I need -- LOTS of examples to help me get started. But O'Reilly's "Effective awk Programming" doesn't have many.

(I'm using ADSL from Cincinnati Bell, a rare mid-size telco that was never swallowed up by the big boys. They have no problems with Pirate Bay. They substitute their own "domain not found" page when I should see the one from OpenDNS; nothing I can do about that. That's my only complaint.)

I'm trying to combine case insensitivity, searching for a pattern, four possible outcomes, opening a couple of files, and appending stuff to a final hosts file (instead of the default "print pattern matches to screen"). Hard enough by themselves, I want to combine them. Very hard.

I thought of using IF/THEN/ELSE statements with awk. One line for each addition, and one for each deletion. It would go something like this (forgive the line numbers):

10 define any variables, if any (the "BEGIN" part of an awk script)

20 If a line in /etc/hosts contains www dot exscn dot com, delete it. Make it field #2, and make field #1 the 127.0.0.1 part and add it to the end of the hosts file so the site is blocked. ELSE make it field #2, and make field #1 the 127.0.0.1 part and add it to the end of the hosts file.

30 repeat this for other sites, one line for each site.

40 If googleadsyndication dot com is found, delete that line. Make googleadsyndication dot com field #2, make field #1 "#127.0.0.1", append to hosts file. ELSE make googleadsyndication dot com field #2, make field #1 "#127.0.0.1", append to hosts file.

50 Repeat line 40 for the next site I want commented out of the final hosts file.

Why am I writing both lines that I added and lines that I commented out to the bottom of the hosts file? Personal preference, that's all. I think I'll skip this at first.

(My efforts have more lines so I can break down the steps, it just helps me think better.)


Your sed example is pretty neat, and it helps me think. Will I need a line for each "www dot example dot com" that I want to delete? Will I need one line of "www dot insertion dot com" for each line I want to add? Looks good, except for the /dev/fd10 part. Is that a storage space for variables, like extra CPU registers? I don't normally do anything in /dev (or /proc). Or does that line ignore the first 10 lines of /etc/hosts where the "localhost" entry is?


Any help still greatly appreciated!

Last edited by saturndude; 01-28-2012 at 02:15 AM. Reason: "wrote a book", had to cut it down
 
Old 01-28-2012, 12:32 PM   #9
King_DuckZ
Member
 
Registered: Nov 2009
Location: Rome, IT
Distribution: Sabayon
Posts: 61

Rep: Reputation: 2
Quote:
Originally Posted by saturndude View Post
Ruby? Python? I can't. I don't have a lot of time, and I'm not really smart. I mean, I'm smart, but when it comes to programming, I am not smart AT ALL.
You don't really need to dig into the intricacies. One of the points of strength of Ruby is its ease of use. I'm typing off the tip of my head, but what you need can be done with something like this:
Code:
#!/usr/bin/env ruby

require 'set'

YourDownloadedFile = "/home/hello/hosts-mods"
HostsFile = "/etc/hosts"

#This will contain the whole hosts file, as a set of lines
aHosts = File.read(HostsFile).lines.collect
#Clean trailing spaces, line endings etc and drop empty lines
aHosts.collect! {|s| s.chomp.strip}.select! {|s| s.length > 0}

#Here we put the lines you want to add...
aAddLines = File.read(YourDownloadedFile).lines.collect
#As before, but also only take lines starting by "+"
aAddLines.select! {|s| s[0] == "+"}.collect! {|s| s.chomp.strip}.select! {|s| s.length > 0}

#...and here we put the lines you want to blacklist
aRemoveLines = File.read(YourDownloadedFile).lines.collect
aRemoveLines.select! {|s| s[0] = "-"}.collect! {|s| s.chomp.strip}.select! {|s| s.length > 0}

#We have three Arrays so far, but Sets would be better suited to our task, so let's convert data
#Note that Set can't contain duplicates in the same list, so they will be stripped away here if any.
sHosts = Set.new(aHosts)
sAddLines = Set.new(aAddLines)
sRemoveLines = Set.new(aRemoveLines)

#Do as said in my previous post
a = sHosts.intersection sRemoveLines
b = sHosts.union sAddLines
c = b.difference sRemoveLines
d = Set.new(a).collect! {|s| "#" + s}
e = c.union d

#Print the result on stdout
e.each {|s| puts s}
That's all. I didn't test it, but maybe someone else here can give you a hand. You can redirect stdout to /etc/hosts to overwrite it, or do whatever you need. Note that merging in the last Set wasn't really necessary, we could've printed c and then d, right away. Storing intermediate values in variables is not even necessary, but I thought it was clearer like this. I hope it helps you (if it works!)
Also, I'm assuming that additions are regular lines beginning with a +, and removals begins with -. So something like:

+1.2.3.4 www hello it
+5.6.7.8 www ciao it
-9.1.2.3 www salut fr

would make sure that hello and ciao are in your merged result, and would ensure that no salut appears anywhere in it.
 
Old 01-28-2012, 05:01 PM   #10
saturndude
Member
 
Registered: Mar 2005
Location: across the river from Louisville KY
Distribution: Mandriva 2010.2 (64-bit)
Posts: 57

Original Poster
Rep: Reputation: 15
Well then.....

Should I get a book on Ruby?

I thought of just comparing, adding and deleting whole lines in awk. Simplify it by ignoring the act of breaking things into fields, but then I saw this page:

http://www.notesbit.com/index.php/sc...nix-help-page/

which contained these pieces of code (and others) to use in awk:

"$3 !~ /regexp/" # regexp does not match in 3d field
‘{print NR ": " $0}’ # prefix a line number, colon, space to each line


Ruby sounds like a darn good idea as well. There is a limit to what I can learn from man pages (I'm "old school" and like a paper book), so I'll have to grab a book on Ruby and try it out. Anything is better than manual pattern searching and replacing with joe, especially with 50 or more modifications each time I update /etc/hosts.

I'll let you know how it goes.....

Last edited by saturndude; 01-28-2012 at 05:04 PM. Reason: insert number signs to comment what the code samples do
 
Old 01-28-2012, 05:39 PM   #11
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
Oh boy, you'd think I'd know better than to type it straight into the reply box. Apologies for the sloppy work.

Here's the script with the quoting right so it actually works:
Code:
#!/bin/bash
{ 
sed -f /dev/fd/10 10<<-'EOD'

# don't blackhole these hosts:

/.*127.0.0.1 googlesyndication.com$/            s/^#* */#/
/.*127.0.0.1 pagead.googlesyndication.com$/     s/^#* */#/
/.*127.0.0.1 pagead2.googlesyndication.com$/    s/^#* */#/
/.*127.0.0.1 searchportal.information.com$/     s/^#* */#/
/.*127.0.0.1 www.thepiratebay.org$/             s/^#* */#/

# do blackhole these hosts:

$ a\
127.0.0.1 www.exscn.com\
127.0.0.1 forums.exscn.com\
127.0.0.1 freezonesurvivors.to\
127.0.0.1 www.dd-wrt.org\
127.0.0.1 www.openwrt.com
EOD
} | sort | uniq
Put that in file edithosts, run it like this
Code:
# cat /path/to/newly/downloaded/blackhole/file /etc/hosts | edithosts > /etc/hosts.new
# ln -f /etc/hosts /etc/hosts.old
# mv -f /etc/hosts.new /etc/hosts
Note the backslash after each inserted host except the last.
 
Old 01-28-2012, 08:34 PM   #12
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
How about using a bit more complex Bash-awk script, but with
  • /etc/blackhole.always
    List of hostnames always blackholed
  • /etc/blackhole.never
    List of hostnames (domains if starting with a .) never blackholed
  • /etc/blackhole.urls
    List of URLs containing blackhole host lists to use

Oh yes, this one is grossly overengineered, but I can't help it; I like it this way.

This variant uses a fixed blackhole IP address, 127.0.0.254. You can pick any 127.x.y.z address you want, they all point to the loopback. Reserving one loopback address for blackholing means that you can autodetect them in the host file, and you can add a rule to reject all accesses to that address in your iptables rules.

In /etc/hosts, the hostnames mapped to the blackhole IP address are replaced with the new blackhole list. If you decide to change the blackhole IP address, you'll need to edit the hosts file to match before you run the script, or you'll get duplicate addresses for the blackholed hosts. (The script ignores the IP address used in the blackhole host lists obtained from the web, always using this one instead.)

To maintain your own blackholes, add the host names (only!) to /etc/blackhole.always.

If there are some hosts you don't want to be blackholed, add them to /etc/blackhole.never.

If there are entire domains you need to keep non-blackholed, add the domain with a dot prefix (say, .linuxquestions.org) to /etc/blackhole.never. All host names explicitly listed in /etc/blackhole.always will still be blackholed, even if they might match a hostname or domain in /etc/blackhole.never; this file filters only the host lists you obtain from the web.

To combine blackhole host lists from the web, add the URLs to /etc/blackhole.urls. The script uses wget to get them.

The script takes no command line arguments, as it is designed to be run from e.g. cron. Just stick it into /etc/cron.weekly/ or /etc/cron.monthly/.

You can create /etc/blackhole.conf to set different paths to the ones I listed above.

The script:
Code:
#!/bin/bash

# Load file names from /etc/blackhole.conf if any.
[ -r /etc/blackhole.conf ] && . /etc/blackhole.conf

# Set defaults.
[ -n "$BLACKHOLE_ALWAYS"  ] || BLACKHOLE_ALWAYS="/etc/blackhole.always"
[ -n "$BLACKHOLE_NEVER"   ] || BLACKHOLE_NEVER="/etc/blackhole.never"
[ -n "$BLACKHOLE_LISTS"   ] || BLACKHOLE_LISTS="/etc/blackhole.urls"
[ -n "$BLACKHOLE_ADDRESS" ] || BLACKHOLE_ADDRESS="127.0.0.254"
[ -n "$HOSTS_FILE"        ] || HOSTS_FILE="/etc/hosts"

# Make sure we use POSIX locale, so we handle all input charsets as-is.
export LANG=C LC_ALL=C

# Create a temporary working directory.
WORK="$(mktemp -d)" || exit $?
trap "rm -rf '$WORK'" EXIT

# Copy non-blackhole items from the current hosts file.
awk -v "addr=$BLACKHOLE_ADDRESS" '
    BEGIN {
        RS="[\t\v\f ]*(\r\n|\n\r|[\r\n])"
        FS="[\t\v\f ]+"
    }

    ($1 != addr) { printf("%s\n", $0) }
' "$HOSTS_FILE" > "$WORK/header" || exit $?

(
    # List the host names from the blackhole lists
    if [ -r "$BLACKHOLE_LISTS" ]; then
        while read URL ; do

            # Skip comment lines.
            [ "$URL" = "${URL##[#;]}" ] || continue

            echo -n "Downloading $URL: " >&2

            # Get the list, but filter to keep just the host names.
            wget -q -O - "$URL" | awk '
                (NF >= 2 && $1 !~ /[^.:0-9A-Fa-f]/) {
                    if (index($0, "#") > 0)
                        sub(/#.*$/, "")
                    for (i = 2; i <= NF; i++)
                        if (length($i) > 0)
                            printf("%s\n", $i)
                }
            ' && echo "Success" >&2 || echo "Error [$?]" >&2

        done < "$BLACKHOLE_LISTS"
    fi

) | (
    # Omit those that should never be blackholed.
    # Start domains with a dot.
    if [ -r "$BLACKHOLE_NEVER" ]; then
        awk -v "file=$BLACKHOLE_NEVER" '
            BEGIN {
                split("", host)
                split("", domain)
                while ((getline < file) > 0) {
                    if (index($0, "#") > 0)
                        sub(/#.*$/, "")
                    for (i = 1; i <= NF; i++)
                        if (length($i) > 1 && substr($i, 1, 1) == ".")
                            domain[substr($i, 1)] = 1
                        else if (length($i) > 0)
                            host[$i] = 1
                }
                domains = length(domain)
            }

            ($0 in host) { next }
            (domains > 0) {
                temp = $0
                while ((i = index(temp, ".")) > 0) {
                    temp = substr(temp, i + 1)
                    if (temp in domain)
                        next
                }
            }
            { print }
        '
    else
        cat
    fi

    # Add the always-blackholed host names
    if [ -r "$BLACKHOLE_ALWAYS" ]; then
        awk '   (NF >= 2 && $1 !~ /[^.:0-9A-Fa-f]/) {
                    if (index($0, "#") > 0)
                        sub(/#.*$/, "")
                    for (i = 2; i <= NF; i++)
                        if (length($i) > 0)
                            printf("%s\n", $i)
                }' "$BLACKHOLE_ALWAYS"
    fi

) | awk -v "to=$BLACKHOLE_ADDRESS" '
    ($0 in listed) { next }
    {
        listed[$0] = 1
        printf("%s %s\n", to, $0)
    }
' > "$WORK/footer" || exit $?

# Combine into a new hosts file..
HOSTS_TEMP="$HOSTS_FILE.$(hostname -s).$$"
if ! cat "$WORK/header" "$WORK/footer" > "$HOSTS_TEMP" ; then
    rm -f "$HOSTS_TEMP"
    rm -rf "$WORK"
    exit 1
fi

# .. create a hardlink (so the file will exist at all times) ..
if ! ln -f "$HOSTS_FILE" "$HOSTS_FILE.old" ; then
    rm -f "$HOSTS_TEMP"
    rm -rf "$WORK"
    exit 1
fi

# .. and try to replace the hosts file.
if ! mv -f "$HOSTS_TEMP" "$HOSTS_FILE" ; then
    rm -f "$HOSTS_TEMP"
    rm -rf "$WORK"
    exit 1
fi

# Success.
rm -rf "$WORK"
trap - EXIT

echo "$HOSTS_FILE updated successfully."
exit 0
I recommend you test the script (before moving it to /etc/cron.weekly/ or /etc/cron.monthly/) by copying your hosts file to current working directory, then running the script against it, and looking at the results:
Code:
cp /etc/hosts hosts
env HOSTS_FILE=hosts ./this-script
After that, you might wish to remove the final echo from the script, or cron will try to send you an e-mail every time the host list has been updated.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Bash command to 'cut' text into another text file & modifying text. velgasius Programming 4 10-17-2011 04:55 AM
Script to remove text from files r3b0ot Linux - Software 4 05-23-2011 07:02 AM
Modifying text files with perl Tleilax Programming 8 02-17-2009 01:54 PM
how can I differentiate two large text files using shell script? Files are like below surya_gadde Linux - Software 1 01-20-2009 02:52 AM
Help with script to batch edit text files OnoTadaki Programming 5 10-15-2007 02:44 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 10:56 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration