[SOLVED] bash: Split a text file into an array? (NOT line-by-line)

DJCharlie · 09-18-2010, 10:25 AM

Hi all. First-time poster here...

Say I have a file (called twitterstatus.tmp) that looks like this:

Code:

<status>
  <id>24854489768</id>
  <text>Are we gonna ride the sun home?</text>
    <id>55266987</id>
    <screen_name>dj_johnnyfever</screen_name>
</status>
<status>
  <id>24852047832</id>
  <text>@dj_johnnyfever Hey Johnny! Can you see this yet?</text>
    <id>51269031</id>
    <screen_name>DJCharlieKJSR</screen_name>
</status>
<status>
  <id>24845941995</id>
  <text>Dog... donkey... Well, they both start with the letter &quot;N&quot;...</text>
    <id>55266987</id>
    <screen_name>dj_johnnyfever</screen_name>
</status>

How could I feed this into an array, with each element containing everything between the <status> </status> tags?

Thanks in advance!

kurumi · 09-18-2010, 11:36 AM

are you going to convert to csv?

DJCharlie · 09-18-2010, 11:43 AM

No. Once I have each segment set, I'll be searching for a specific string contained in the segment to act on.

So, from the sample I posted, say the script sees this segment:

Code:

<status>
  <id>24852047832</id>
  <text>@dj_johnnyfever Hey Johnny! Can you see this yet?</text>
    <id>51269031</id>
    <screen_name>DJCharlieKJSR</screen_name>
</status>

It would trigger on the @dj_johnnyfever keyword, and act accordingly.

The trouble I'm having is splitting the file into segments.

PTrenholme · 09-18-2010, 12:02 PM

Since that's XML, why not just use the XML functionality to do what you want?

By the way, since this is your first post, I suggest that you "Report" your thread to the moderators and request that they move it to the Programming sub-forum where you'd get much better responses. (It's not, really, a "general" question.)

DJCharlie · 09-18-2010, 12:04 PM

Well, ideally, I'd prefer it not be in XML. I'm actually stripping out the XML further along in the script. I need plain-text variables for that. But first I need to divide it into easily digestible segments bound by the <status> </status> tags.

And thanks, I'll report it.

quanta · 09-18-2010, 12:44 PM

I haven't a solution to convert directly into an array, but I found the following command to split into multiple files:

Code:

awk '/<status>/{ close("twitter"c".status"); c++ } { print $0 > "twitter"c".status" }' twitterstatus.tmp

Kenhelm · 09-18-2010, 01:06 PM

Try

Code:

eval arr=("$(sed "s/'/'\"'\"'/g; s/<status>/'&/; s/<\/status>/&'/" file)")

echo "${arr[0]}"
<status>
  <id>24854489768</id>
  <text>Are we gonna ride the sun home?</text>
    <id>55266987</id>
    <screen_name>dj_johnnyfever</screen_name>
</status>

sed puts each block into single quotes.
s/'/'\"'\"'/g protects any literal single quotes from eval by placing them in double quotes in a gap in the single quotes, e.g.

Code:

......It's now or never....
would become
'......It'"'"'s now or never....'

DJCharlie · 09-18-2010, 01:10 PM

Solved it! It's a bit crufty, but it works.

Basically, each segment I need is 5 lines long. So I do a post=`head -5 twitterstatus.tmp`, scan it for the keyword, and then using sed, delete the top 5 lines of the file. If the keyword is found, split $post into individual variables, and process from there!

Thanks for letting me bounce ideas off you, everyone!

XavierP · 09-18-2010, 01:59 PM

As requested, moved to Programming

grail · 09-19-2010, 09:22 PM

Well I am guessing it depends on what other things you wish to do, but here is something you could consider:

Code:

#!/usr/bin/awk -f

BEGIN{ RS="</status>" }

/@dj_johnnyfever/{ <do your stuff to this record> }