LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   grep detecting carriage return, how ? (https://www.linuxquestions.org/questions/programming-9/grep-detecting-carriage-return-how-302823/)

Grafbak 03-17-2005 12:44 PM

grep detecting carriage return, how ?
 
Hello,

i am trying to let grep detect a carriage return, but im not sure how to do it. I have looked around but i can't find a good grep tutorial that covers this, or how to detect an ascii code with grep.

I have a large xml-file which contains several blocks of <BF:EVENT> </BF:EVENT> and i want to extract these blocks, and save these in a separate file.
It's a little tricky to detect the last </bf:event> because they come in groups of like 50000 times <bf:event> <bf:param> </bf:param></bf:event>.
I want to track the last line number in the block containing </bf:event> by the fact that </bf:event> is NOT followed by <bf:event> on the next line.

I think it should be something like this :

for i in counter
do
x = `grep -n "<bf:event"|cut -f1 -d: | head -n1`
y = `grep -n "</bf:event>/CR<bf:event>"|cut -f1 -d: | tail -n1`
head -n"$X" xmlfile.xml|tail ("$X"-"$Y") > partfile"$i".xml
done


However i can't get it to do what i want. Can somebody please help me out ?

Matir 03-17-2005 12:48 PM

Grep only works on a line-by-line basis. You cannot match the grep expression to multiple lines. I believe awk can do this, however.

keefaz 03-17-2005 12:53 PM

Try this perl script from another thread :
http://www.linuxquestions.org/questi...=xml+file+perl

Grafbak 03-17-2005 02:01 PM

Thank you for the program in Perl Cedrik. However i would like to avoid using it, because i already have written some lines in bash, and i want to try if i can solve it using bash. There is some educational purpose in it for me as well, i like bash, almost anything is allowed. But i hope you are willing to help me using bash.

This is a scheme of the xml-file i am dealing with :

<BF:LOG>
<BF:ROUND>
<BF:SERVER>
<BF:SETTING>
</BF:SETTING>
<BF:EVENT>
</BF:EVENT>
<BF:SETTING>
</BF:SETTING>
<BF:EVENT>
</BF:EVENT>
<BF:ROUNDSTATS>
<BF:WINNINGTEAM> </BF:WINNINGTEAM>
<BF:VICTORYTYPE> </BF:VICTORYTYPE>
<BF:TEAMTICKETS> </BF:TEAMTICKETS>
<BF:PLAYERSTAT>
<BF:STATPARAM> </BF:STATPARAM>
</BF:PLAYERSTAT>
</BF:ROUNDSTATS>
</BF:ROUND>
</BF:LOG>

The event-blocks keep popping up everywhere in the file, and take up 99% of the space in the file.
I need to split these files up to max 550.000 lines, so the parser does not give me a mem-error.
When i have splitted them up i need to put the other tags back in the smaller files to make the log complete.
I am not exactly a highly experienced bash programmer, so forgive me my code. :newbie:

This is what i have so far :


#!/bin/bash
filelist=""
filename=""
file_max_size="30000k"
i=""
a=0
xml_dest_dir="$HOME/download/logs/xmltest/test2/"
xml_feed_dir="$HOME/download/logs/xmltest/"
max_number_of_lines=0
global_line_counter=45
filelist=`ls "$xml_feed_dir"*.xml`
echo "$filelist"
#copy every file bigger than 30 mb to xml_dest_dir
find -size +"$file_max_size" -exec mv {} "$xml_dest_dir". \;
filelist_to_process=`ls "$xml_dest_dir"*.xml`
echo "filelist of files being processed: " $filelist_to_process
#MAIN LOOP
for i in $filelist_to_process;
do
let a=a+1
echo "now working on file: ""$i"
#strip the first 45 lines of setting, always the same in every file
head -n 45 "$i">"$xml_dest_dir""topfile_$a"
#set some variables for the next loop
max_number_of_lines=`wc -l "$i"|cut -f1 -d/`
echo "value for max_number_of_lines is " $max_number_of_lines
number_of_event_entries=`grep -n "<bf:event" "$i"|cut -f1 -d:|wc -l`
number_of_event_terms=`grep -n "</bf:event" "$i"|cut -f1 -d:|wc -l`
number_of_lines_in_file=`wc -l "$i"`
# subloop to detect the first event-blocks with create players
for j in $number_of_event_terms
do
linenumber_event_entry=`grep -n "<bf:event" "$i"|cut -f1 -d:|head -n"$j"|tail -n1`
linenumber_event_term=`grep -n "</bf:event" "$i"|cut -f1 -d:|head -n"$j"|tail -n1`
let linenumber_next=$linenumber_event_term+1
echo "value of linenumber_event_entry is " $linenumber_event_entry
echo "value of linenumber_event_terms is " $linenumber_event_term
echo "value of linenumber_next is " $linenumber_next

if [ $linenumber_next -ne $number_of_event_entries ]
then
let s=($j - 45)
head -n"$j" "$i"|tail -n"$s">"$xml_dest_dir/eventfile1_$a"
fi
done
done
exit


I know it is not working, but i seem to have more of a methodical problem doing it with grep, head and tail.
Yet i believe it must be possible to detect the end of a <bf:event> block by the fact that the line following </bf:event> may not be equal to <bf:event>. I don't know how to build my counter correctly for the line number i need for the if-statement.
It seemed so simple -=sigh=-

Can you help me with this please ?

ahh 03-17-2005 02:11 PM

How about using tac and looking for the first occurrence of </bf:event>?

Or I could be barking up the wrong tree.

Grafbak 03-17-2005 02:18 PM

Well the problem is that inside a real file it looks like this :

<bf:event name="createPlayer" timestamp="12.5835">
<bf:param type="int" name="player_id">0</bf:param>
<bf:param type="vec3" name="player_location">937.2/18.07/961.78</bf:param>
<bf:param type="string" name="name">&lt;BeC.bF&gt;Pank</bf:param>
<bf:param type="int" name="is_ai">0</bf:param>
<bf:param type="int" name="team">2</bf:param>
</bf:event>
<bf:event name="playerKeyHash" timestamp="13.1593">
<bf:param type="int" name="player_id">0</bf:param>
<bf:param type="string" name="keyhash">9b15a1ae21024d3e978398603bb636f4</bf:param>
</bf:event>
<bf:event name="createPlayer" timestamp="13.6766">
<bf:param type="int" name="player_id">1</bf:param>
<bf:param type="vec3" name="player_location">937.2/18.07/961.78</bf:param>
<bf:param type="string" name="name">Razorlight</bf:param>
<bf:param type="int" name="is_ai">0</bf:param>
<bf:param type="int" name="team">1</bf:param>
</bf:event>
<bf:event name="playerKeyHash" timestamp="13.8777">
<bf:param type="int" name="player_id">1</bf:param>
<bf:param type="string" name="keyhash">77688ed320616376490dfdf7a5ac288a</bf:param>
</bf:event>
<bf:event name="roundInit" timestamp="22.5044">
<bf:param type="int" name="tickets_team1">0</bf:param>
<bf:param type="int" name="tickets_team2">0</bf:param>
</bf:event>
<bf:event name="createPlayer" timestamp="37.067">
<bf:param type="int" name="player_id">2</bf:param>
<bf:param type="vec3" name="player_location">937.2/18.07/961.78</bf:param>
<bf:param type="string" name="name">DAS_BeKiffte_SChaAf</bf:param>
<bf:param type="int" name="is_ai">0</bf:param>
<bf:param type="int" name="team">1</bf:param>


So i have several of these blocks, i dont know how many, and i want the linenumber of the last </bf:event> in such every block. Hope that helps.

ahh 03-17-2005 02:33 PM

Does the <bf: param type...> exist outside of the event tags?

Grafbak 03-17-2005 02:37 PM

No the bf:param does not exist outside of the event tags.

ahh 03-17-2005 03:09 PM

So if you want to move all occurrences of bf:event ... /bf:event to another file,
Code:

grep "bf:event\|bf:param" > newfile
should do it.

And
Code:

grep -v "bf:event\|bf:param" > anotherfile
will give you the rest.

Is this what you were after?

Grafbak 03-17-2005 03:21 PM

That is almost what i'm after. However, would you know a simple command to split up the output from that grep command into separate 30 megabyte files ? (i'll be thinking along with you)
I need to put all the files back together again as well..

ahh 03-17-2005 03:48 PM

Split could do that.

Maybe
Code:

grep "bf:event\|bf:param" | split -C 30m

Matir 03-17-2005 03:50 PM

The disadvantage of the split is that the individual files would likely be completely unparsable.

ahh 03-17-2005 03:55 PM

Well, with the -C option as opposed to the -b option, at least you will only get complete lines.

It should be possible to add required tags to the top & bottom of these files to enable them to be parsed. And of course, they will still be readable as text.

Grafbak 03-17-2005 03:57 PM

I think i can use that command, im gonna try and play with it. The problem is i can not just collect all the <bf:event> blocks and throw em in one file. I need to split them up in separate files of maximum 30 mb.
But i when i reconstruct the file i need all the other blocks, like <bf:server> on te correct place again. I have just tried to parse the output of grep "bf:event\|bf:param" > newfile but then the parser gives an error because the server settings, and other tags are missing.
So if i have 700000 lines with bf:event tags, i want to take 550000 (roughly 30 mb) lines of bf:event, and simply copy the bf:server tags and closing tags to the 30 mb file so it will come through the parser. The remaining 150000 lines of bf:event will get the same bf:server tags, and will also be made 'complete' again.
I'm sorry if i wasn't clear on that.

ahh 03-17-2005 04:11 PM

Sorry if I seem a bit dim here, but lets see if I've got this straight:-

You have one large file.

You want to split it into several 30M files with the events, and several 30M files with the rest.

Then you want to be able to put it back together?

If this is correct, do you need to reconstruct it? The original file will not have changed by grepping it.


All times are GMT -5. The time now is 06:51 AM.