LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   IP to DNS converter that can handle a huge number of entries (http://www.linuxquestions.org/questions/programming-9/ip-to-dns-converter-that-can-handle-a-huge-number-of-entries-943497/)

pkramerruiz 05-06-2012 05:44 AM

IP to DNS converter that can handle a huge number of entries
 
Hi guys!
Last week I began to collect many unwanted IP addresses. Last night I finished.
Result? Now I have a 22.9 Mb text file on the desktop that has exactly 1.631.301 entries listed. (Each line an Ip address).

I'm looking desperately for a script that:
* Can handle this large number of entries.
* Detects whether an entry is not an IP address: (If the line contains a letter, or is not a number between three dots, like x.x.x.x or xx.xx.xx.xx, it should replace the whole line with the words NO IP).
* Each IP is replaced by its DNS.
* For the IP addresses that do not have names, should not be an error output. Only leave that line intact.
* If possible, dont cause too much server load, with so many request. Instead it would be much better to heavy load, my 64bit PC.

Nominal Animal 05-06-2012 02:55 PM

dig is your best bet, but I'd use a different approach.

First, I'd use awk to filter out only valid IPv4 addresses from your list, and convert to the reverse order used for DNS requests:
Code:

awk '#
    BEGIN {
        RS="[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"
        FS="[.]"
    }

    (NF==4 && $1>=0 && $1<=255 && $2>=0 && $2<=255 && $3>=0 && $3<=255 && $4>=0 && $4<=255) {

        # Loopback address?
        if ($1 == 127) next

        # Private address?
        if ($1 == 10) next
        if ($1 == 172 && $2 >= 16 && $2 <= 31) next
        if ($1 == 192 && $2 == 168) next

        # Link-local address?
        if ($1 == 169 && $2 == 254) next

        # Multicast address?
        if ($1 >= 224 && $1 <= 239) next

        # This seems like a real IP address.
        printf("%d.%d.%d.%d.in-addr.arpa.\n", $4, $3, $2, $1)
    }
' original-file > ipv4.list

Now you can use dig to go through the IPv4 address list in batch mode. It is basically the most lightweight option. If you want to reduce the load on your name servers, install dnscache so you do the queries directly to the target nameservers, not relying on your normal nameservers -- but I would not bother. The command to run is
Code:

dig +noall +answer -t any -f ipv4.list > ipv4.lookup
After that completes, you can edit the lookup results so they are easier to process:
Code:

awk '#
    BEGIN {
        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"
        FS = "[\t\v\f ]+"
    }

    NF > 3 {
        if (split($1, ip, ".") < 6) next
        name = $NF
        sub(/\.$/, "", name)
        printf("%d.%d.%d.%d %s\n", ip[4], ip[3], ip[2], ip[1], name)
    }
' ipv4.lookup > ipv4.names

At this point, you have a list of IPv4 addresses and matching hostnames in ipv4.names . Now you can easily repeat the filtering step you did first, except this time, use the name list to classify each address:
Code:

awk -v names="ipv4.names" '#
    BEGIN {
        RS="[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"
        FS="[\t\v\f ]+"

        while ((getline < names) > 0)
            if (NF == 2)
                name[$1] = $2
    }

    (NF > 1) {
        printf("%s BAD_INPUT\n", $1)
        next
    }

    (NF == 1) {
        if (split($1, ip, ".") < 4) {
            printf("%s NO_IP\n", $1)
            next
        }
        if (ip[1] < 0 || ip[1] > 255 || ip[2] < 0 || ip[2] > 255 ||
            ip[3] < 0 || ip[3] > 255 || ip[4] < 0 || ip[4] > 255) {
            printf("%s NO_IP\n", $1)
            next
        }

        if (ip[1] == 127) {
            printf("%s LOOPBACK\n", $1)
            next
        }

        if ((ip[1] == 10) ||
            (ip[1] = 172 && ip[2] >= 16 && ip[2] <= 31) ||
            (ip[1] == 192 && ip[2] == 168)) {
            printf("%s PRIVATE\n", $1)
            next
        }

        if (ip[1] == 169 && ip[2] == 254) {
            printf("%s LINK_LOCAL\n", $1)
            next
        }

        if (ip[1] >= 224 && ip[1] <= 239) {
            printf("%s MULTICAST\n", $1)
            next
        }

        addr = sprintf("%d.%d.%d.%d", ip[1], ip[2], ip[3], ip[4])
        if (addr in name)
            printf("%s KNOWN %s\n", $1, name[addr])
        else
            printf("%s UNKNOWN\n", $1)
    }
' original-file > final-results

In the final-results file, the original IP address will be in the first column, reason in the second column, and if the second column contains KNOWN, the name is in the third column.

Note: The above scriptlets have not been thoroughly tested, so there might be typos.


All times are GMT -5. The time now is 02:31 PM.