LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-30-2009, 09:20 AM   #1
JoshMiller
LQ Newbie
 
Registered: Apr 2009
Posts: 10

Rep: Reputation: 0
Script to Chop a Text File


Hey,

I'm working on some scripts to run a TwitterBot that will ultimately do several different tasks (download files, store data, tweet various bits of info) based on what tweet is sent to it.

At the moment, I'm using the following to get the tweets.

curl -u ACCOUNT:PASSWORD http://twitter.com/statuses/friends_timeline.xml > recent.txt

This gives me a huge file though as it gets a lot of tweets. What I'd like to do is chunk this out by tweet and only do the last three tweets. The file length seems to be pretty regular so I should be able to do this by line. IE Lines 1-20 > Tweet1.txt Lines 21-40 >tweet2.txt .

Alternately, if someone knows of a better way to pull down only the last three tweets from twitter (even better, only new tweets) I'm open to suggestions on that too.
 
Old 04-30-2009, 09:22 AM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
First, post here an example (its sufficient tail) of your 'recent.txt'.
 
Old 04-30-2009, 09:38 AM   #3
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
Quote:
Originally Posted by JoshMiller View Post
Alternately, if someone knows of a better way to pull down only the last three tweets from twitter (even better, only new tweets) I'm open to suggestions on that too.
I would start by looking at Twitter's API. There seem to be pre-rolled libraries available for many languages, so you should be able to find something to your liking.
 
Old 04-30-2009, 09:46 AM   #4
JoshMiller
LQ Newbie
 
Registered: Apr 2009
Posts: 10

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Sergei Steshenko View Post
First, post here an example (its sufficient tail) of your 'recent.txt'.
Here's a chunk. After this point everything between <status> repeats for each tweet. A lot of this information will eventually be discarded.

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
<created_at>Thu Apr 30 12:11:48 +0000 2009</created_at>
<id>1658251631</id>
<text>07:11:48 up 2 days, 11:16, 3 users, load average: 0.80, 0.52, 0.23</text>
<source>web</source>
<truncated>false</truncated>
<in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
<favorited>false</favorited>
<in_reply_to_screen_name></in_reply_to_screen_name>
<user>
<id>35845774</id>
<name>Selphie Server</name>
<screen_name>SelphieBot</screen_name>
<location></location>
<description>Hi! I am a Server Bot for twitter.com/JoshMiller</description>
<profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/186733952/8-selphie-c_normal.jpg</profile_image_url>
<url>http://www.joshmiller.net</url>
<protected>false</protected>
<followers_count>1</followers_count>
<profile_background_color>9ae4e8</profile_background_color>
<profile_text_color>000000</profile_text_color>
<profile_link_color>0000ff</profile_link_color>
<profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color>
<profile_sidebar_border_color>87bc44</profile_sidebar_border_color>
<friends_count>1</friends_count>
<created_at>Mon Apr 27 20:11:58 +0000 2009</created_at>
<favourites_count>0</favourites_count>
<utc_offset>-21600</utc_offset>
<time_zone>Central Time (US &amp; Canada)</time_zone>
<profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
<profile_background_tile>false</profile_background_tile>
<statuses_count>13</statuses_count>
<notifications>false</notifications>
<following>false</following>
</user>
</status>
 
Old 04-30-2009, 10:43 AM   #5
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by JoshMiller View Post
Here's a chunk. After this point everything between <status> repeats for each tweet. A lot of this information will eventually be discarded.

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
<created_at>Thu Apr 30 12:11:48 +0000 2009</created_at>
<id>1658251631</id>
<text>07:11:48 up 2 days, 11:16, 3 users, load average: 0.80, 0.52, 0.23</text>
<source>web</source>
<truncated>false</truncated>
<in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
<favorited>false</favorited>
<in_reply_to_screen_name></in_reply_to_screen_name>
<user>
<id>35845774</id>
<name>Selphie Server</name>
<screen_name>SelphieBot</screen_name>
<location></location>
<description>Hi! I am a Server Bot for twitter.com/JoshMiller</description>
<profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/186733952/8-selphie-c_normal.jpg</profile_image_url>
<url>http://www.joshmiller.net</url>
<protected>false</protected>
<followers_count>1</followers_count>
<profile_background_color>9ae4e8</profile_background_color>
<profile_text_color>000000</profile_text_color>
<profile_link_color>0000ff</profile_link_color>
<profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color>
<profile_sidebar_border_color>87bc44</profile_sidebar_border_color>
<friends_count>1</friends_count>
<created_at>Mon Apr 27 20:11:58 +0000 2009</created_at>
<favourites_count>0</favourites_count>
<utc_offset>-21600</utc_offset>
<time_zone>Central Time (US &amp; Canada)</time_zone>
<profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
<profile_background_tile>false</profile_background_tile>
<statuses_count>13</statuses_count>
<notifications>false</notifications>
<following>false</following>
</user>
</status>
So, do I understand correctly: the tweet begins with

Code:
<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
and ends with

Code:
</status>
and there are no nested <status> ... </status> pairs in the tweet ?
 
Old 04-30-2009, 10:59 AM   #6
JoshMiller
LQ Newbie
 
Registered: Apr 2009
Posts: 10

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Sergei Steshenko View Post
So, do I understand correctly: the tweet begins with

Code:
<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
<status>
and ends with

Code:
</status>
and there are no nested <status> ... </status> pairs in the tweet ?
No nested status pairs inside the tweet. Also

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">

is just the file header and not part of the tweet.

For what it's worth, the ultimate goal will be to read out information in these tweets, most notably the sections inside <text> (to choose actions) <created_at> (to tell if it's new) and <screen_name> (so it'll only respond to me)
 
Old 04-30-2009, 11:14 AM   #7
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by JoshMiller View Post
No nested status pairs inside the tweet. Also

<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">

is just the file header and not part of the tweet.

For what it's worth, the ultimate goal will be to read out information in these tweets, most notably the sections inside <text> (to choose actions) <created_at> (to tell if it's new) and <screen_name> (so it'll only respond to me)
Well, either, as already suggested by Telemachos use the API, or write a script in Perl.

Regarding the latter - there is Text::Balanced module ( http://perldoc.perl.org/Text/Balanced.html ) - probably an overkill for your case.

There is a simple solution:

http://perldoc.perl.org/perlfaq6.htm...erent-lines%3f
.
 
Old 04-30-2009, 11:20 AM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Code:
awk 'BEGIN{RS="</status>"}
{ a[d++]=$0}
END{
 #print last 3 sections of status
 for(o=d-3;o<d;o++){
  print a[o] 
 }
}' file
 
Old 04-30-2009, 11:28 AM   #9
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by ghostdog74 View Post
Code:
awk 'BEGIN{RS="</status>"}
{ a[d++]=$0}
END{
 #print last 3 sections of status
 for(o=d-3;o<d;o++){
  print a[o] 
 }
}' file
Well, again the whole file in an array ?
 
Old 04-30-2009, 11:36 AM   #10
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by Sergei Steshenko View Post
Well, again the whole file in an array ?
i forgot to add the phrase "If your file is not too big, you can try this"
@OP, if its really huge, another way
Code:
awk 'BEGIN{RS="</status>"}{
  last=second
  second=first
  first=$0 
}END{
  print last
  print second
  print first
}' file

Last edited by ghostdog74; 04-30-2009 at 11:56 AM.
 
Old 04-30-2009, 02:31 PM   #11
JoshMiller
LQ Newbie
 
Registered: Apr 2009
Posts: 10

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by ghostdog74 View Post
i forgot to add the phrase "If your file is not too big, you can try this"
@OP, if its really huge, another way
Code:
awk 'BEGIN{RS="</status>"}{
  last=second
  second=first
  first=$0 
}END{
  print last
  print second
  print first
}' file
This looks like it will work correctly, also I think I can adapt this to pull out the string variables as well.
 
Old 04-30-2009, 03:15 PM   #12
JoshMiller
LQ Newbie
 
Registered: Apr 2009
Posts: 10

Original Poster
Rep: Reputation: 0
Ok, using the above example and some information I found online, I've adapted the code some.

awk 'BEGIN{
RS=""
FS="n" }{
}END{
# first tweet id and text
print $8
print $9
#second tweet id and text
print $100
print $101
#third tweet id and text
print $192
print $193
}' /home/josh/scripts/recent.txt

This works... sort of... I picked out which lines I needed and when the recent.txt file contained only tweets made by the bot, things worked out, when I sent an @reply to it, the lines shifted and things didn't work. Based on the old code, I'm wondering how to simply output the lines based on tags. IE <id> and <text> Just one example will work since Is hould be able to pad in the rest.

Basically at this point, this is ooking like a good way to simply bypass outputting the tweets to another file andjust pull the info I want from the raw file.

Also I'm wonder what the command is so put the printed information into a spring variable. I tried echo and > and a few variations related to that but got nothing but errors.
 
Old 05-01-2009, 12:05 AM   #13
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by JoshMiller View Post
Ok, using the above example and some information I found online, I've adapted the code some.

awk 'BEGIN{
RS=""
FS="n" }{
}END{
# first tweet id and text
print $8
print $9
#second tweet id and text
print $100
print $101
#third tweet id and text
print $192
print $193
}' /home/josh/scripts/recent.txt

This works... sort of... I picked out which lines I needed and when the recent.txt file contained only tweets made by the bot, things worked out, when I sent an @reply to it, the lines shifted and things didn't work. Based on the old code, I'm wondering how to simply output the lines based on tags. IE <id> and <text> Just one example will work since Is hould be able to pad in the rest.

Basically at this point, this is ooking like a good way to simply bypass outputting the tweets to another file andjust pull the info I want from the raw file.

Also I'm wonder what the command is so put the printed information into a spring variable. I tried echo and > and a few variations related to that but got nothing but errors.
1) extract the whole tweet;
2) extract the tagged portion from it.

I think now Text::Balanced module ( http://perldoc.perl.org/Text/Balanced.html ) is becoming more relevant.

I am not sure what you mean in "bypass outputting the tweets to another file", but 'curl' can output to STDOUT and, say, Perl can read from STDIN, as well as other scripting languages.
 
Old 05-05-2009, 11:26 AM   #14
JoshMiller
LQ Newbie
 
Registered: Apr 2009
Posts: 10

Original Poster
Rep: Reputation: 0
Hey all, thanks for the help. I have recently discovered ttytter which looks to do what I want in a much much more elegant fashion than I ever could manage. I can easily output the recent tweets to a file for processing by the system.

This leads to more questions on how to do a few things but that is likely a better topic for another thread.
 
Old 05-05-2009, 06:29 PM   #15
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
what's a tweet?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
bash script to create text in a file or replace value of text if already exists knightto Linux - Newbie 5 09-10-2008 11:13 PM
simple php script to add line/file to text file dnoy Programming 1 05-21-2008 05:08 PM
Need a script to search and replace text in file using shell script unixlearner Programming 14 06-21-2007 10:37 PM
How to find and change a specific text in a text file by using shell script Bassam Programming 1 07-18-2005 07:15 PM
chop chop, dlink dwl 650 rev M problems? victory! rhoyerboat Linux - Wireless Networking 0 02-08-2005 06:04 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration