LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 01-05-2007, 08:29 AM   #1
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Rep: Reputation: 30
parsing data - better way of doing that


Hi All,

Code:
>cat filename
(ab.) (bigone)

I need to extract the data between "(" and ")" and display the different elements in different lines

Currently its done as,

Code:
>sed 's/(/#/g;s/)/#/g' filename | sed 's/#\(.*\)# #\(.*\)#/\1 \2/' | tr ' ' '\n'
Code:
>ab.
bigone
but this is not a decent way of doing it, too many process involved.

Any ideas/ptrs to make it better?

Thanks
 
Old 01-05-2007, 08:50 AM   #2
taylor_venable
Member
 
Registered: Jun 2005
Location: Indiana, USA
Distribution: OpenBSD, Ubuntu
Posts: 892

Rep: Reputation: 41
Regexes are a pretty standard way of parsing out data like this. An alternative to the scenario you've presented is to use Perl, one of the original intentions of which was to build text processing like sed and awk into a "real" language.

One possible other set of tools you could use would be Lex / maybe Yacc, but I think that's even more complicated than you've got (Lex use regexes as well).

If you're asking how you can improve this particular set of commands, you could do this:
Code:
[ taylor @ zeltennia ] : ~ > echo "(ab.) (bigone)" | sed 's/(\(.*\)) (\(.*\))/\1\
> \2/'
ab.
bigone
This puts everything into one sed regex.
 
Old 01-05-2007, 12:13 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
if you have Python, there's no need for regular expressions.
sample input file:
Code:
(ab.) (bigone) sometext1 (1) sometext2 (2) (3)
Code:
#!usr/bin/python
result = [] #store result
flag = 0
for line in open("filename"):
    for ch in line:
 	if ch == "(":
 		flag = 1
 	elif ch == ")":
 		flag = 0
 		result.append("\n")
 	elif flag :
 		result.append(ch)

print ''.join(result)
output:
Code:
#/home/test> python test.py
ab.
bigone
1
2
3

Last edited by ghostdog74; 01-08-2007 at 01:59 AM.
 
Old 01-05-2007, 05:28 PM   #4
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 719

Rep: Reputation: 72
Hi, kshkid.

You appear to want to optimize this task.

The -1 rule of optimization is that you optimize where you absolutely need to.

So if you think that your sed solution needs to be improved because it has "too many process involved", then you can probably eliminate perl and python because you're hauling in a lot of code (the interpreters) in at least one additional process.

So that really leaves you with one option that I can think of, and that is to code up a program in a compiled language and keep the binary around -- preferably statically linked to save resources.

That doesn't sound like it would save much in people time, however, which is the expensive resource. Machine resources are cheap.

I like to use the tools that we have. So, here is my solution, which generalizes, so that you can have any number of parenthesized strings on a line ... cheers, makyo
Code:
#!/bin/sh

# @(#) s4       Demonstrate extraction from within parentheses.
# Break into separate lines, front-trim, back-trim, remove empty.

echo "(ab.) (bigone) (1) (2) (3)" |
sed -e "s/)/)\n/g" |
sed \
-e "s/^.*(//" \
-e "s/).*//" \
-e "/^$/d"
Which produces:
Code:
% ./s4
ab.
bigone
1
2
3
 
Old 01-05-2007, 06:49 PM   #5
muha
Member
 
Registered: Nov 2005
Distribution: xubuntu, grml
Posts: 451

Rep: Reputation: 37
I was thinking: use bash expansion.
Code:
test="(ab.) (bigone) (and one)"
$ echo -e "${test//)/\n}"
(ab.
 (bigone
 (and one
Might not work in ksh though
 
Old 01-05-2007, 08:30 PM   #6
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 719

Rep: Reputation: 72
Hi, muha.

You raise a good point. The OP did not say how long the files were. So using bash expansion methods might be very useful, assuming that kshkid is satisfied with the relatively slow speed of line IO into and out of the shell.

Did you intend to add more processing so that your output matched what he was expecting? ... cheers, makyo
 
Old 01-08-2007, 12:14 AM   #7
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by taylor_venable
Regexes are a pretty standard way of parsing out data like this. An alternative to the scenario you've presented is to use Perl, one of the original intentions of which was to build text processing like sed and awk into a "real" language.

One possible other set of tools you could use would be Lex / maybe Yacc, but I think that's even more complicated than you've got (Lex use regexes as well).

If you're asking how you can improve this particular set of commands, you could do this:
Code:
[ taylor @ zeltennia ] : ~ > echo "(ab.) (bigone)" | sed 's/(\(.*\)) (\(.*\))/\1\
> \2/'
ab.
bigone
This puts everything into one sed regex.
Thanks a lot, this is what I wanted and you gave me that exactly!

Cheers
 
Old 01-08-2007, 12:22 AM   #8
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by makyo
Hi, kshkid.

You appear to want to optimize this task.

The -1 rule of optimization is that you optimize where you absolutely need to.

So if you think that your sed solution needs to be improved because it has "too many process involved", then you can probably eliminate perl and python because you're hauling in a lot of code (the interpreters) in at least one additional process.

So that really leaves you with one option that I can think of, and that is to code up a program in a compiled language and keep the binary around -- preferably statically linked to save resources.

That doesn't sound like it would save much in people time, however, which is the expensive resource. Machine resources are cheap.

I like to use the tools that we have. So, here is my solution, which generalizes, so that you can have any number of parenthesized strings on a line ... cheers, makyo
Code:
#!/bin/sh

# @(#) s4       Demonstrate extraction from within parentheses.
# Break into separate lines, front-trim, back-trim, remove empty.

echo "(ab.) (bigone) (1) (2) (3)" |
sed -e "s/)/)\n/g" |
sed \
-e "s/^.*(//" \
-e "s/).*//" \
-e "/^$/d"
Which produces:
Code:
% ./s4
ab.
bigone
1
2
3

Hi thank you very much for the reply,

if i am not wrong, this is the one equivalent to your sed command

Code:
echo "(ab.) (bigone) (1) (2) (3)" | sed -e "s/)/)\n/g;s/^.*(//;s/).*//;/^$/d"
and since you had used,
"s/^.*(//"

everything till 3 would be stripped off.

Could you please explain the output?

I dont think we would get all the characters within '(' and ')'
 
Old 01-08-2007, 01:14 AM   #9
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 719

Rep: Reputation: 72
Hi, kshkid.

The reason I used 2 sed commands is because the first one is necessary to break the strings onto separate lines, which then feeds the second sed. If the two sed commands are combined into one, then you will have inserted an embedded newline after each ")", but the embedded newline is just another character to the matching engine, so the greedy match will span the longest string of characters, which is up to the "3".

It's possible that you could do the entire operation with one sed command by making use of the hold space, but that would add complexity. I did not investigate that possibility.
Quote:
Everything should be made as simple as possible, but not simpler. -- Albert Einstein
Best wishes ... cheers, makyo
 
Old 01-08-2007, 01:41 AM   #10
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Hi makyo,

I tried the script that you give,
and it promptly returns only '3' with the same input you had specified.

Cheers
 
Old 01-08-2007, 07:05 AM   #11
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 719

Rep: Reputation: 72
Hi, kshkid.

The only way I could make this fail was to run the script with an old version of sed on a Solaris box. The older versions do not recognize "\n" as a symbol for NEWLINE.

The script was successful on the versions of sed on Debian sarge and SuSE 9.

I was able to work around that old sed limitation by using a BELL (control-G) character instead of the "\n" in the first sed and then adding a tr command in the pipeline between the two seds to substitute a NEWLINE (control-J) for the BELL character. Those control characters will not copy-paste into the forum, so if you are interested in that solution, you'll need to make the changes on your own.

A method to check that intermediate step is to add a tee in the pipeline to see what is being passed:
Code:
echo stuff | first sed changes ")" to ")BELL" |
  tee t1 |
    tr change BELL to NEWLINE |
      tee t2 |
        second sed
Then cat t1, cat t2 to see contents of intermediate steps, and when you get it working, remove the tee commands. You could also check to see if there is a newer version of sed available to you or to install a more recent version ... cheers, makyo

PS the version of sed that works for me is:
Code:
% sed --version
GNU sed version 4.1.2
( edit 1: add version note )

Last edited by makyo; 01-08-2007 at 09:12 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing rows and column data from a file using perl dav_y2k Programming 1 10-08-2006 12:57 PM
LXer: On data models, data types and dangerous liaisons LXer Syndicated Linux News 0 07-22-2006 11:33 PM
Home Office Biotech Data Mining - Data Collection Adler Linux - General 20 11-03-2004 05:17 AM
Burn Data DVD... Read Data in Linux and Windows SaintStrive Linux - Newbie 3 09-18-2004 06:04 PM
Parsing. liguorir Programming 2 09-04-2003 05:56 PM


All times are GMT -5. The time now is 05:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration