Script to Chop a Text File

JoshMiller · 04-30-2009, 09:20 AM

Hey,

I'm working on some scripts to run a TwitterBot that will ultimately do several different tasks (download files, store data, tweet various bits of info) based on what tweet is sent to it.

At the moment, I'm using the following to get the tweets.

curl -u ACCOUNT:PASSWORD http://twitter.com/statuses/friends_timeline.xml > recent.txt

This gives me a huge file though as it gets a lot of tweets. What I'd like to do is chunk this out by tweet and only do the last three tweets. The file length seems to be pretty regular so I should be able to do this by line. IE Lines 1-20 > Tweet1.txt Lines 21-40 >tweet2.txt .

Alternately, if someone knows of a better way to pull down only the last three tweets from twitter (even better, only new tweets) I'm open to suggestions on that too.

Sergei Steshenko · 04-30-2009, 09:22 AM

First, post here an example (its sufficient tail) of your 'recent.txt'.

Telemachos · 04-30-2009, 09:38 AM

Quote:

Originally Posted by JoshMiller

Alternately, if someone knows of a better way to pull down only the last three tweets from twitter (even better, only new tweets) I'm open to suggestions on that too.

I would start by looking at Twitter's API. There seem to be pre-rolled libraries available for many languages, so you should be able to find something to your liking.

JoshMiller · 04-30-2009, 09:46 AM

Quote:

Originally Posted by Sergei Steshenko

First, post here an example (its sufficient tail) of your 'recent.txt'.

Here's a chunk. After this point everything between <status> repeats for each tweet. A lot of this information will eventually be discarded.

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
<created_at>Thu Apr 30 12:11:48 +0000 2009</created_at>
<id>1658251631</id>
<text>07:11:48 up 2 days, 11:16, 3 users, load average: 0.80, 0.52, 0.23</text>
<source>web</source>
<truncated>false</truncated>
<in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
<favorited>false</favorited>
<in_reply_to_screen_name></in_reply_to_screen_name>
<user>
<id>35845774</id>
<name>Selphie Server</name>
<screen_name>SelphieBot</screen_name>
<location></location>
<description>Hi! I am a Server Bot for twitter.com/JoshMiller</description>
<profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/186733952/8-selphie-c_normal.jpg</profile_image_url>
<url>http://www.joshmiller.net</url>
<protected>false</protected>
<followers_count>1</followers_count>
<profile_background_color>9ae4e8</profile_background_color>
<profile_text_color>000000</profile_text_color>
<profile_link_color>0000ff</profile_link_color>
<profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color>
<profile_sidebar_border_color>87bc44</profile_sidebar_border_color>
<friends_count>1</friends_count>
<created_at>Mon Apr 27 20:11:58 +0000 2009</created_at>
<favourites_count>0</favourites_count>
<utc_offset>-21600</utc_offset>
<time_zone>Central Time (US & Canada)</time_zone>
<profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
<profile_background_tile>false</profile_background_tile>
<statuses_count>13</statuses_count>
<notifications>false</notifications>
<following>false</following>
</user>
</status>

Sergei Steshenko · 04-30-2009, 10:43 AM

Quote:

Originally Posted by JoshMiller

Here's a chunk. After this point everything between <status> repeats for each tweet. A lot of this information will eventually be discarded.

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
<created_at>Thu Apr 30 12:11:48 +0000 2009</created_at>
<id>1658251631</id>
<text>07:11:48 up 2 days, 11:16, 3 users, load average: 0.80, 0.52, 0.23</text>
<source>web</source>
<truncated>false</truncated>
<in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
<favorited>false</favorited>
<in_reply_to_screen_name></in_reply_to_screen_name>
<user>
<id>35845774</id>
<name>Selphie Server</name>
<screen_name>SelphieBot</screen_name>
<location></location>
<description>Hi! I am a Server Bot for twitter.com/JoshMiller</description>
<profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/186733952/8-selphie-c_normal.jpg</profile_image_url>
<url>http://www.joshmiller.net</url>
<protected>false</protected>
<followers_count>1</followers_count>
<profile_background_color>9ae4e8</profile_background_color>
<profile_text_color>000000</profile_text_color>
<profile_link_color>0000ff</profile_link_color>
<profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color>
<profile_sidebar_border_color>87bc44</profile_sidebar_border_color>
<friends_count>1</friends_count>
<created_at>Mon Apr 27 20:11:58 +0000 2009</created_at>
<favourites_count>0</favourites_count>
<utc_offset>-21600</utc_offset>
<time_zone>Central Time (US & Canada)</time_zone>
<profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
<profile_background_tile>false</profile_background_tile>
<statuses_count>13</statuses_count>
<notifications>false</notifications>
<following>false</following>
</user>
</status>

So, do I understand correctly: the tweet begins with

Code:

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>

and ends with

Code:

</status>

and there are no nested <status> ... </status> pairs in the tweet ?

JoshMiller · 04-30-2009, 10:59 AM

Quote:

Originally Posted by Sergei Steshenko

So, do I understand correctly: the tweet begins with

Code:

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>

and ends with

Code:

</status>

and there are no nested <status> ... </status> pairs in the tweet ?

No nested status pairs inside the tweet. Also

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">

is just the file header and not part of the tweet.

For what it's worth, the ultimate goal will be to read out information in these tweets, most notably the sections inside <text> (to choose actions) <created_at> (to tell if it's new) and <screen_name> (so it'll only respond to me)

Sergei Steshenko · 04-30-2009, 11:14 AM

Quote:

Originally Posted by JoshMiller

No nested status pairs inside the tweet. Also

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">

is just the file header and not part of the tweet.

For what it's worth, the ultimate goal will be to read out information in these tweets, most notably the sections inside <text> (to choose actions) <created_at> (to tell if it's new) and <screen_name> (so it'll only respond to me)

Well, either, as already suggested by Telemachos use the API, or write a script in Perl.

Regarding the latter - there is Text::Balanced module ( http://perldoc.perl.org/Text/Balanced.html ) - probably an overkill for your case.

There is a simple solution:

http://perldoc.perl.org/perlfaq6.htm...erent-lines%3f
.

ghostdog74 · 04-30-2009, 11:20 AM

Code:

awk 'BEGIN{RS="</status>"}
{ a[d++]=$0}
END{
 #print last 3 sections of status
 for(o=d-3;o<d;o++){
  print a[o] 
 }
}' file

Sergei Steshenko · 04-30-2009, 11:28 AM

Quote:

Originally Posted by ghostdog74

Code:

awk 'BEGIN{RS="</status>"}
{ a[d++]=$0}
END{
 #print last 3 sections of status
 for(o=d-3;o<d;o++){
  print a[o] 
 }
}' file

Well, again the whole file in an array

?

ghostdog74 · 04-30-2009, 11:36 AM

Quote:

Originally Posted by Sergei Steshenko

Well, again the whole file in an array

?

i forgot to add the phrase "If your file is not too big, you can try this"

@OP, if its really huge, another way

Code:

awk 'BEGIN{RS="</status>"}{
  last=second
  second=first
  first=$0 
}END{
  print last
  print second
  print first
}' file

JoshMiller · 04-30-2009, 02:31 PM

Quote:

Originally Posted by ghostdog74

i forgot to add the phrase "If your file is not too big, you can try this"

@OP, if its really huge, another way

Code:

awk 'BEGIN{RS="</status>"}{
  last=second
  second=first
  first=$0 
}END{
  print last
  print second
  print first
}' file

This looks like it will work correctly, also I think I can adapt this to pull out the string variables as well.

JoshMiller · 04-30-2009, 03:15 PM

Ok, using the above example and some information I found online, I've adapted the code some.

awk 'BEGIN{
RS=""
FS="n" }{
}END{
# first tweet id and text
print $8
print $9
#second tweet id and text
print $100
print $101
#third tweet id and text
print $192
print $193
}' /home/josh/scripts/recent.txt

This works... sort of... I picked out which lines I needed and when the recent.txt file contained only tweets made by the bot, things worked out, when I sent an @reply to it, the lines shifted and things didn't work. Based on the old code, I'm wondering how to simply output the lines based on tags. IE <id> and <text> Just one example will work since Is hould be able to pad in the rest.

Basically at this point, this is ooking like a good way to simply bypass outputting the tweets to another file andjust pull the info I want from the raw file.

Also I'm wonder what the command is so put the printed information into a spring variable. I tried echo and > and a few variations related to that but got nothing but errors.

Sergei Steshenko · 05-01-2009, 12:05 AM

Quote:

Originally Posted by JoshMiller

Ok, using the above example and some information I found online, I've adapted the code some.

awk 'BEGIN{
RS=""
FS="n" }{
}END{
# first tweet id and text
print $8
print $9
#second tweet id and text
print $100
print $101
#third tweet id and text
print $192
print $193
}' /home/josh/scripts/recent.txt

This works... sort of... I picked out which lines I needed and when the recent.txt file contained only tweets made by the bot, things worked out, when I sent an @reply to it, the lines shifted and things didn't work. Based on the old code, I'm wondering how to simply output the lines based on tags. IE <id> and <text> Just one example will work since Is hould be able to pad in the rest.

Basically at this point, this is ooking like a good way to simply bypass outputting the tweets to another file andjust pull the info I want from the raw file.

Also I'm wonder what the command is so put the printed information into a spring variable. I tried echo and > and a few variations related to that but got nothing but errors.

1) extract the whole tweet;
2) extract the tagged portion from it.

I think now Text::Balanced module ( http://perldoc.perl.org/Text/Balanced.html ) is becoming more relevant.

I am not sure what you mean in "bypass outputting the tweets to another file", but 'curl' can output to STDOUT and, say, Perl can read from STDIN, as well as other scripting languages.

JoshMiller · 05-05-2009, 11:26 AM

Hey all, thanks for the help. I have recently discovered ttytter which looks to do what I want in a much much more elegant fashion than I ever could manage. I can easily output the recent tweets to a file for processing by the system.

This leads to more questions on how to do a few things but that is likely a better topic for another thread.

bigearsbilly · 05-05-2009, 06:29 PM

what's a tweet?