[SOLVED] data processing

mailbox-1691 · 04-21-2011, 05:02 AM

Dear users,
I have a data like this
"x\x\xxxxxxxxxx\x\xxyyyyyyyyyxxxxxxxx\xxxyyyxxxxxxx\xx
xxx\A,(minus or plus)floating point,(minus or plus)floating point,(minus or plus)floatingpoint\B,... , .... , ..\C,.. , ...., ..\....\\@"

I need to extract like this for eg,

A -2.300 0. 5.03
B -2.300 2.34 0.00
...........
..........

simply the final results are xyz co-ordinates, Can any one tell me how to do it either in perl or sed or awk or any other program

Thanks a lot

David the H. · 04-21-2011, 08:56 AM

A picture, or in this case a sample, is worth a thousand words.

IOW, I do not fully understand the line format or your requirements as you've explained it. Could you please give us an actual representative sample of the text you're using, highlighting exactly you want to extract from it, and how you want the output to appear?

Are we talking multiple lines, or just one? How high do you expect the ABC lettering to go? Are there any variations in the text that might cause problems?

Give us some background so we understand your requirements.

Finally, please enclose everything in [code][/code] tags, to preserve formatting and to improve readability.

mailbox-1691 · 04-22-2011, 01:21 AM

Okey, Thanks for the reply, I understood, I attach a copy of the text

It seems cryptic, but i do find a pattern and tried to match with following regular expression in vim, so that i could grep in sed or perl

Code:

/\W\w,\d.,\A\d\+.\d\+,\A\d\+.\d\+

but it matches very few presumably, i could not able to generalize for all so that result should be like

Quote:

C 0. 6.1325527512 0.6911442287
C 0. 4.9424684093 1.4312934211
..........
..........
H 0. 7.0464046059 -5.5938128729

I have many files like this to match three floating points for x,y,z with corresponding label for it, any help appreciated. Thanks.

David the H. · 04-22-2011, 06:58 AM

THIS is a perfect example of why providing actual sample text is important.

Is the file really formatted like that, with newlines scattered at random and a space at the beginning of each line? Because that really causes headaches in processing. Having to work across lines and remove spaces means several times the work. Ugh!

Anyway, I think I have something for you. Instead of trying to directly match the desired strings with a regex, I went a slightly different route.

Code:

tr -d "[[:space:]]\n" <sing.txt | awk 'BEGIN{ RS="\\" } /^[[:upper:]],/ { gsub(","," "); print }'

The tr command at the beginning is there simply to clean up the file format. It removes all spaces and newlines, so that the whole file is turned into one single unbroken line.

This is piped into awk, which breaks it back up into one record per "\"-delimited field. Then if a record starts with an upper-case letter followed by a comma, it replaces the commas with spaces and prints it.

Note that gsub is only supported by gawk or nawk.)

The initial cleanup could certainly also be done by awk, but it's simpler this way, IMO.

kurumi · 04-22-2011, 09:33 AM

here's a Ruby command you can try

Code:

$ ruby -0777 -ne '$_.scan(/,(-?[0-9.]+),([0-9.]+)\\([A-Z])/).each{|x| print "#{x[-1]},#{x[1]},#{x[0]}\n" }' file
C,0.6911442287,6.1325527512
C,0.7165833967,3.7095474353
C,1.426622055,2.4652112046
C,0.71899088,1.2389133759
C,0.71899088,-1.2389133759
C,1.426622055,-2.4652112046
C,0.7165833967,-3.7095474353
C,0.6911442287,-6.1325527512
C,5.6966058882,2.4498989676
C,3.5617348632,1.2346328478
C,2.86197007,2.463592843
C,5.6891543825,0.
C,2.8528109717,0.
C,5.0091360672,-1.2279289318
C,5.0224928115,-3.6797065617
C,2.86197007,-2.463592843
C,5.048424537,-6.1065799124
C,3.6464412319,-6.1266798118
C,2.8941073542,-4.9440117491
C,5.048424537,6.1065799124
C,2.8941073542,4.9440117491
H,3.1510513472,-7.0902553374
H,6.8182586514,-4.887465462
H,6.7840514357,-2.443376658
H,6.7765480345,0.
H,6.7840514357,2.443376658
H,5.5938128729,7.0464046059
H,1.1981389566,7.0902192127
Q,0.,0.

mailbox-1691 · 04-22-2011, 10:43 AM

David,
Thanks David, The newlines are random but they are at 71th character on each line, excluding the white space at the beginning. I checked with other similar files, it works. Atleast until now, it's for sure after delimitation(\) it would start with an upper case. Can you please explain me what would be the options in awk if i also have two letters(uppercase followed by a lower)
eg

Quote:

C 3.3508756125 1.4163640333 0.
Fe 2.30000 3.2496341 0.

Kurumi,
It misses one of the floating point, i suppose, it would not be difficult with minor modifications. Thanks

David the H. · 04-22-2011, 11:45 AM

I've updated it so that the entire thing is done in awk. It was so stupidly simple I should've seen it before. I'd failed to realize earlier that with RS set to backslash only, newlines can be treated just like any other character. So all we need to do is add a second gsub command.

A ? in regex means "zero or one" of the previous character (or expression), so to match an optional lowercase letter simply expand it to this:

Code:

awk 'BEGIN{ RS="\\" } {gsub("[[:space:]]","") } /^[[:upper:]][[:lower:]]?,/ { gsub(","," "); print }' sing.txt

The nice thing with this is that the main regex only has to match a partial string for it to print. You simply need to be able to differentiate the wanted from the unwanted fields in the file.

And what I meant by "random" was that the file wraps in such a way that newlines or spaces can appear pretty much anywhere inside the actual data. That's a hard thing to deal with when you're trying to extract regular patterns.