ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A newbie requests your help.
I'm working on the creation of authentic wordlists as part of my linguistic studies. I chanced upon a script the other day to make wordlists from twitter:
I saw the same script and it wasn't for linguistic studies. If I remember correctly, it was to build a wordlist to attempt to crack some passwords of twitter users with specified interests. As such, I'm pretty sure that kind of content is not taken kindly to here.
Well, I guess linguists and crackers use the same tools. That makes sense. In all honesty, I'm not trying to crack twitter accounts. I'm trying to establish a list of authentic words related to a specific field of interest and analyze them. I've done it for wikipedia, and since I saw it was possible for twitter, then I thought why not give it a try... That's all.
That's a very ugly bit of code, by the way. It could certainly be replaced by a single, and much more efficient, awk command. If we had an example of the input text to work (and what needs to be extracted from it) with we might give it a try.
That's a very ugly bit of code, by the way. It could certainly be replaced by a single, and much more efficient, awk command. If we had an example of the input text to work (and what needs to be extracted from it) with we might give it a try.
Not to be inflamatory, but how difficult would it be to run:
wget -q -O - 'http://search.twitter.com/search.json?q='"search"'&rpp=1000' | awk '#
BEGIN {
RS = "\"text\":\""
FS = "\""
}
(NF > 1 && length($1) > 0) {
n = split($1, temp, /[\t\n\v\f\r ]+/)
for (i = 1; i <= n; i++) {
w = tolower(temp[i]);
gsub(/[-!?.,_:;$\047()<>+]+/, "", w)
if (w ~ /^[@#]/) continue
if (w ~ /[0-9]/) continue
if (w ~ /^https*:/) continue
if (w ~ /^ftps*:/) continue
if (w ~ /^www\./) continue
if (w ~ /[.\/].*[.\/]/) continue
word[w]++
}
}
END {
for (w in word)
printf("%s %d\n", w, word[w])
}' | sort
Each line contains one word, followed by its frequency (count) as an integer.
The middle of the snippet filters out unwanted words; URLs, hashtags, targets. The gsub() removes typical punctuation.
Last edited by Nominal Animal; 06-07-2012 at 02:51 PM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.