Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Dear users,
I have a data like this
"x\x\xxxxxxxxxx\x\xxyyyyyyyyyxxxxxxxx\xxxyyyxxxxxxx\xx
xxx\A,(minus or plus)floating point,(minus or plus)floating point,(minus or plus)floatingpoint\B,... , .... , ..\C,.. , ...., ..\....\\@"
I need to extract like this for eg,
A -2.300 0. 5.03
B -2.300 2.34 0.00
...........
..........
simply the final results are xyz co-ordinates, Can any one tell me how to do it either in perl or sed or awk or any other program
A picture, or in this case a sample, is worth a thousand words.
IOW, I do not fully understand the line format or your requirements as you've explained it. Could you please give us an actual representative sample of the text you're using, highlighting exactly you want to extract from it, and how you want the output to appear?
Are we talking multiple lines, or just one? How high do you expect the ABC lettering to go? Are there any variations in the text that might cause problems?
Give us some background so we understand your requirements.
Finally, please enclose everything in [code][/code] tags, to preserve formatting and to improve readability.
Last edited by David the H.; 04-21-2011 at 08:57 AM.
Reason: minor wording change
THIS is a perfect example of why providing actual sample text is important.
Is the file really formatted like that, with newlines scattered at random and a space at the beginning of each line? Because that really causes headaches in processing. Having to work across lines and remove spaces means several times the work. Ugh!
Anyway, I think I have something for you. Instead of trying to directly match the desired strings with a regex, I went a slightly different route.
The tr command at the beginning is there simply to clean up the file format. It removes all spaces and newlines, so that the whole file is turned into one single unbroken line.
This is piped into awk, which breaks it back up into one record per "\"-delimited field. Then if a record starts with an upper-case letter followed by a comma, it replaces the commas with spaces and prints it.
Note that gsub is only supported by gawk or nawk.)
The initial cleanup could certainly also be done by awk, but it's simpler this way, IMO.
David,
Thanks David, The newlines are random but they are at 71th character on each line, excluding the white space at the beginning. I checked with other similar files, it works. Atleast until now, it's for sure after delimitation(\) it would start with an upper case. Can you please explain me what would be the options in awk if i also have two letters(uppercase followed by a lower)
eg
Quote:
C 3.3508756125 1.4163640333 0.
Fe 2.30000 3.2496341 0.
Kurumi,
It misses one of the floating point, i suppose, it would not be difficult with minor modifications. Thanks
I've updated it so that the entire thing is done in awk. It was so stupidly simple I should've seen it before. I'd failed to realize earlier that with RS set to backslash only, newlines can be treated just like any other character. So all we need to do is add a second gsub command.
A ? in regex means "zero or one" of the previous character (or expression), so to match an optional lowercase letter simply expand it to this:
The nice thing with this is that the main regex only has to match a partial string for it to print. You simply need to be able to differentiate the wanted from the unwanted fields in the file.
And what I meant by "random" was that the file wraps in such a way that newlines or spaces can appear pretty much anywhere inside the actual data. That's a hard thing to deal with when you're trying to extract regular patterns.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.