grep detecting carriage return, how ?

Grafbak · 03-17-2005, 12:44 PM

Hello,

i am trying to let grep detect a carriage return, but im not sure how to do it. I have looked around but i can't find a good grep tutorial that covers this, or how to detect an ascii code with grep.

I have a large xml-file which contains several blocks of <BF:EVENT> </BF:EVENT> and i want to extract these blocks, and save these in a separate file.
It's a little tricky to detect the last </bf:event> because they come in groups of like 50000 times <bf:event> <bf

aram> </bf

aram></bf:event>.
I want to track the last line number in the block containing </bf:event> by the fact that </bf:event> is NOT followed by <bf:event> on the next line.

I think it should be something like this :

for i in counter
do
x = `grep -n "<bf:event"|cut -f1 -d: | head -n1`
y = `grep -n "</bf:event>/CR<bf:event>"|cut -f1 -d: | tail -n1`
head -n"$X" xmlfile.xml|tail ("$X"-"$Y") > partfile"$i".xml
done

However i can't get it to do what i want. Can somebody please help me out ?

Matir · 03-17-2005, 12:48 PM

Grep only works on a line-by-line basis. You cannot match the grep expression to multiple lines. I believe awk can do this, however.

keefaz · 03-17-2005, 12:53 PM

Try this perl script from another thread :
http://www.linuxquestions.org/questi...=xml+file+perl

Grafbak · 03-17-2005, 02:01 PM

Thank you for the program in Perl Cedrik. However i would like to avoid using it, because i already have written some lines in bash, and i want to try if i can solve it using bash. There is some educational purpose in it for me as well, i like bash, almost anything is allowed. But i hope you are willing to help me using bash.

This is a scheme of the xml-file i am dealing with :

<BF:LOG>
<BF:ROUND>
<BF:SERVER>
<BF:SETTING>
</BF:SETTING>
<BF:EVENT>
</BF:EVENT>
<BF:SETTING>
</BF:SETTING>
<BF:EVENT>
</BF:EVENT>
<BF:ROUNDSTATS>
<BF:WINNINGTEAM> </BF:WINNINGTEAM>
<BF:VICTORYTYPE> </BF:VICTORYTYPE>
<BF:TEAMTICKETS> </BF:TEAMTICKETS>
<BF:PLAYERSTAT>
<BF:STATPARAM> </BF:STATPARAM>
</BF:PLAYERSTAT>
</BF:ROUNDSTATS>
</BF:ROUND>
</BF:LOG>

The event-blocks keep popping up everywhere in the file, and take up 99% of the space in the file.
I need to split these files up to max 550.000 lines, so the parser does not give me a mem-error.
When i have splitted them up i need to put the other tags back in the smaller files to make the log complete.
I am not exactly a highly experienced bash programmer, so forgive me my code.

This is what i have so far :

#!/bin/bash
filelist=""
filename=""
file_max_size="30000k"
i=""
a=0
xml_dest_dir="$HOME/download/logs/xmltest/test2/"
xml_feed_dir="$HOME/download/logs/xmltest/"
max_number_of_lines=0
global_line_counter=45
filelist=`ls "$xml_feed_dir"*.xml`
echo "$filelist"
#copy every file bigger than 30 mb to xml_dest_dir
find -size +"$file_max_size" -exec mv {} "$xml_dest_dir". \;
filelist_to_process=`ls "$xml_dest_dir"*.xml`
echo "filelist of files being processed: " $filelist_to_process
#MAIN LOOP
for i in $filelist_to_process;
do
let a=a+1
echo "now working on file: ""$i"
#strip the first 45 lines of setting, always the same in every file
head -n 45 "$i">"$xml_dest_dir""topfile_$a"
#set some variables for the next loop
max_number_of_lines=`wc -l "$i"|cut -f1 -d/`
echo "value for max_number_of_lines is " $max_number_of_lines
number_of_event_entries=`grep -n "<bf:event" "$i"|cut -f1 -d:|wc -l`
number_of_event_terms=`grep -n "</bf:event" "$i"|cut -f1 -d:|wc -l`
number_of_lines_in_file=`wc -l "$i"`
# subloop to detect the first event-blocks with create players
for j in $number_of_event_terms
do
linenumber_event_entry=`grep -n "<bf:event" "$i"|cut -f1 -d:|head -n"$j"|tail -n1`
linenumber_event_term=`grep -n "</bf:event" "$i"|cut -f1 -d:|head -n"$j"|tail -n1`
let linenumber_next=$linenumber_event_term+1
echo "value of linenumber_event_entry is " $linenumber_event_entry
echo "value of linenumber_event_terms is " $linenumber_event_term
echo "value of linenumber_next is " $linenumber_next

if [ $linenumber_next -ne $number_of_event_entries ]
then
let s=($j - 45)
head -n"$j" "$i"|tail -n"$s">"$xml_dest_dir/eventfile1_$a"
fi
done
done
exit

I know it is not working, but i seem to have more of a methodical problem doing it with grep, head and tail.
Yet i believe it must be possible to detect the end of a <bf:event> block by the fact that the line following </bf:event> may not be equal to <bf:event>. I don't know how to build my counter correctly for the line number i need for the if-statement.
It seemed so simple -=sigh=-

Can you help me with this please ?

ahh · 03-17-2005, 02:11 PM

How about using tac and looking for the first occurrence of </bf:event>?

Or I could be barking up the wrong tree.

Grafbak · 03-17-2005, 02:18 PM

Well the problem is that inside a real file it looks like this :

<bf:event name="createPlayer" timestamp="12.5835">
<bf:param type="int" name="player_id">0</bf:param>
<bf:param type="vec3" name="player_location">937.2/18.07/961.78</bf:param>
<bf:param type="string" name="name"><BeC.bF>Pank</bf:param>
<bf:param type="int" name="is_ai">0</bf:param>
<bf:param type="int" name="team">2</bf:param>
</bf:event>
<bf:event name="playerKeyHash" timestamp="13.1593">
<bf:param type="int" name="player_id">0</bf:param>
<bf:param type="string" name="keyhash">9b15a1ae21024d3e978398603bb636f4</bf:param>
</bf:event>
<bf:event name="createPlayer" timestamp="13.6766">
<bf:param type="int" name="player_id">1</bf:param>
<bf:param type="vec3" name="player_location">937.2/18.07/961.78</bf:param>
<bf:param type="string" name="name">Razorlight</bf:param>
<bf:param type="int" name="is_ai">0</bf:param>
<bf:param type="int" name="team">1</bf:param>
</bf:event>
<bf:event name="playerKeyHash" timestamp="13.8777">
<bf:param type="int" name="player_id">1</bf:param>
<bf:param type="string" name="keyhash">77688ed320616376490dfdf7a5ac288a</bf:param>
</bf:event>
<bf:event name="roundInit" timestamp="22.5044">
<bf:param type="int" name="tickets_team1">0</bf:param>
<bf:param type="int" name="tickets_team2">0</bf:param>
</bf:event>
<bf:event name="createPlayer" timestamp="37.067">
<bf:param type="int" name="player_id">2</bf:param>
<bf:param type="vec3" name="player_location">937.2/18.07/961.78</bf:param>
<bf:param type="string" name="name">DAS_BeKiffte_SChaAf</bf:param>
<bf:param type="int" name="is_ai">0</bf:param>
<bf:param type="int" name="team">1</bf:param>

So i have several of these blocks, i dont know how many, and i want the linenumber of the last </bf:event> in such every block. Hope that helps.

ahh · 03-17-2005, 02:33 PM

Does the <bf: param type...> exist outside of the event tags?

Grafbak · 03-17-2005, 02:37 PM

No the bf:param does not exist outside of the event tags.

ahh · 03-17-2005, 03:09 PM

So if you want to move all occurrences of bf:event ... /bf:event to another file,

Code:

grep "bf:event\|bf:param" > newfile

should do it.

And

Code:

grep -v "bf:event\|bf:param" > anotherfile

will give you the rest.

Is this what you were after?

Grafbak · 03-17-2005, 03:21 PM

That is almost what i'm after. However, would you know a simple command to split up the output from that grep command into separate 30 megabyte files ? (i'll be thinking along with you)
I need to put all the files back together again as well..

ahh · 03-17-2005, 03:48 PM

Split could do that.

Maybe

Code:

grep "bf:event\|bf:param" | split -C 30m

Matir · 03-17-2005, 03:50 PM

The disadvantage of the split is that the individual files would likely be completely unparsable.

ahh · 03-17-2005, 03:55 PM

Well, with the -C option as opposed to the -b option, at least you will only get complete lines.

It should be possible to add required tags to the top & bottom of these files to enable them to be parsed. And of course, they will still be readable as text.

Grafbak · 03-17-2005, 03:57 PM

I think i can use that command, im gonna try and play with it. The problem is i can not just collect all the <bf:event> blocks and throw em in one file. I need to split them up in separate files of maximum 30 mb.
But i when i reconstruct the file i need all the other blocks, like <bf:server> on te correct place again. I have just tried to parse the output of grep "bf:event\|bf:param" > newfile but then the parser gives an error because the server settings, and other tags are missing.
So if i have 700000 lines with bf:event tags, i want to take 550000 (roughly 30 mb) lines of bf:event, and simply copy the bf:server tags and closing tags to the 30 mb file so it will come through the parser. The remaining 150000 lines of bf:event will get the same bf:server tags, and will also be made 'complete' again.
I'm sorry if i wasn't clear on that.

ahh · 03-17-2005, 04:11 PM

Sorry if I seem a bit dim here, but lets see if I've got this straight:-

You have one large file.

You want to split it into several 30M files with the events, and several 30M files with the rest.

Then you want to be able to put it back together?

If this is correct, do you need to reconstruct it? The original file will not have changed by grepping it.