Read variable from wordlist within a script

Jajamd · 06-07-2012, 10:33 AM

Hello everyone,

A newbie requests your help.
I'm working on the creation of authentic wordlists as part of my linguistic studies. I chanced upon a script the other day to make wordlists from twitter:

#!/bin/bash

wget -q "http://search.twitter.com/search.json?q=$1&rpp=1000"

cat search.json* | tr "," \\n | grep "^\"text" | cut -d"\"" -f4- | tr " " \\n | sed s/\"//g | sed s/\^\#//g | sed s/\^\@//g | grep -v "^http:" | grep -v "\\\\" | sort | uniq > $1.txt

rm -f search.json*

The thing is that I would like to read $1 from a text file containing a list of predefined words, and not having to specify it manually.

Any idea how I could do that?

Thank you very much !

Ian John Locke II · 06-07-2012, 10:37 AM

I saw the same script and it wasn't for linguistic studies. If I remember correctly, it was to build a wordlist to attempt to crack some passwords of twitter users with specified interests. As such, I'm pretty sure that kind of content is not taken kindly to here.

Jajamd · 06-07-2012, 10:41 AM

Well, I guess linguists and crackers use the same tools. That makes sense. In all honesty, I'm not trying to crack twitter accounts. I'm trying to establish a list of authentic words related to a specific field of interest and analyze them. I've done it for wikipedia, and since I saw it was possible for twitter, then I thought why not give it a try... That's all.

Ian John Locke II · 06-07-2012, 11:03 AM

Well to automate the reading in of lines you would do:

Code:

while read -r line; do
    # Do whatever with $line, e.g.,
    echo $line
done < filename

If of course you're not a linguist, I disclaim any responsibility since a simple google search would have sufficed.

Jajamd · 06-07-2012, 11:08 AM

Thank you Ian. Rest assured, I'm not trying to crack accounts of any kind.

David the H. · 06-07-2012, 12:41 PM

That's a very ugly bit of code, by the way. It could certainly be replaced by a single, and much more efficient, awk command. If we had an example of the input text to work (and what needs to be extracted from it) with we might give it a try.

Ian John Locke II · 06-07-2012, 01:54 PM

Quote:

Originally Posted by David the H.

That's a very ugly bit of code, by the way. It could certainly be replaced by a single, and much more efficient, awk command. If we had an example of the input text to work (and what needs to be extracted from it) with we might give it a try.

Not to be inflamatory, but how difficult would it be to run:

Code:

wget -q "http://search.twitter.com/search.json?q=$1&rpp=1000"

with any random search term you might think of instead of $1?

For example, I used food to get this: http://sprunge.us/DZLL

Nominal Animal · 06-07-2012, 02:50 PM

Perhaps something like

Code:

wget -q -O - 'http://search.twitter.com/search.json?q='"search"'&rpp=1000' | awk '#
    BEGIN {
        RS = "\"text\":\""
        FS = "\""
    }

    (NF > 1 && length($1) > 0) {
        n = split($1, temp, /[\t\n\v\f\r ]+/)
        for (i = 1; i <= n; i++) {
            w = tolower(temp[i]);

            gsub(/[-!?.,_:;$\047()<>+]+/, "", w)

            if (w ~ /^[@#]/) continue
            if (w ~ /[0-9]/) continue
            if (w ~ /^https*:/) continue
            if (w ~ /^ftps*:/) continue
            if (w ~ /^www\./) continue
            if (w ~ /[.\/].*[.\/]/) continue

            word[w]++
        }
    }

    END {
        for (w in word)
            printf("%s %d\n", w, word[w])
    }' | sort

Each line contains one word, followed by its frequency (count) as an integer.

The middle of the snippet filters out unwanted words; URLs, hashtags, targets. The gsub() removes typical punctuation.

David the H. · 06-07-2012, 03:17 PM

Quote:

Originally Posted by Ian John Locke II

Not to be inflamatory, but how difficult would it be to run: [wget]

Yeah, I could do that.

I wasn't paying enough attention.

Anyway, it's late... <looks out at bright yellow thing shining through window> err... early, and I'm too tired right now. I'm getting some sleep.