LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   text searching (http://www.linuxquestions.org/questions/linux-software-2/text-searching-329302/)

sopiaz57 06-01-2005 04:01 PM

text searching
 
Hey guys, i am a bit new to text searching and expressioins, basically I have a text file filled with garbage, what is valuable is the data inbetween ( )


what sort of string would process all the text inbetween these parens.

Thanks in advance.

homey 06-01-2005 05:51 PM

Hi,
This may get you started where the info is on one line. If the ( and ) are on separate lines, you may need to temporarily replace the \n with something else like a title ~

Code:

cat file.txt | grep -o '([^)]*)'

tvynr 06-02-2005 02:44 AM

I had luck with the following command. It is capable of handing files with corresponding ( and ) on multiple lines.

Code:

cat test.txt | tr -d '\012' | egrep -o '\([^\(]*?\)'
I escaped the parentheses shown in the previous post to make it work on my machine. Using this command without the escape characters produced some very odd results which I intend to investigate at some point.

Cheers,

Zachary Palmer

sopiaz57 06-02-2005 09:17 AM

looks interesting, can you break it down for me.

tvynr 06-02-2005 01:00 PM

I'd be glad to. :) If I cover something you already know, bear with me; I have no idea how much experience you have. :)

Code:

cat test.txt
The output of this part is the contents of the file. cat reads the file's contents and writes them to its standard output. The pipe symbol ('|') makes the standard output for the left-hand command readable as the standard input to the right-hand command, sort of plugging them together.

Code:

tr -d '\012'
tr stands for "translate"; this program is designed to translate characters on the standard input to another set of characters, written to the standard output. In this case, rather than translating the characters, the -d flag causes tr to delete the characters instead. There is only one character specified in the set (between the ' marks): \012. tr assumes that patterns of the form \nnn indicate the character with the octal value nnn. In this case, the octal value 12 (decimal value 10) represents the Linux newlne character.

Code:

egrep -o '\([^\(]*\)'
This snippet is fairly complicated. To my understanding, egrep does the same thing as grep -e: it interprets the contents of the expression not as a simple string but as a regular expression. If you're not familiar with regular expressions, check out the egrep manpage or have a Google. I'll explain the expression itself below. The -o flag causes egrep to only display the part of the line which matched the expression; normally grep displays the whole line on which the pattern occurred.

The pattern contains a number of backslash characters because regular expressions uses parentheses as a grouping symbol. For example, the regular expression "ab*" means 'a' followed by zero or more 'b's (ex., "a", "ab", "abb", but not "aba" or "abab"). The regular expression "(ab)*", on the other hand, means zero or more 'ab's (ex., "ab", "abab", and even "", but not "abb" or "aab"). Since we want to look for literal parentheses, we use the backslash character to tell grep that the character immediately following it isn't regular expression syntax but an actual character in the pattern.

If you rewrite the pattern to use bold characters for any of the regular expression formatting characters, it looks like this:

([^(]*)

That is, the expression looks for an open parenthesis, then the pattern [^(] zero or more times, then a closed parenthesis.

The pattern [^(] is fairly simplistic. The [] characters tell the regular expression that the contents are to be interpreted as a character set; any of the characters or character representations contained within match it. For example, [abd] matches 'a', 'b', or 'd', [a-ce] matches 'a', 'b', 'c', or 'e', and [a-zA-Z0-9] matches any alphanumeric character.

The ^ symbol in sets indicates "not." That is, [^a] represents any character which isn't 'a'. Therefore, the expression [^(] matches any character which isn't '('.

So, the expression matches an open parenthesis, any number of characters which aren't an open parenthesis, and then a closed parenthesis.

***I should note that I made an assumption about your intentions. Take, for example, the file

Code:

0123(abcd)efgh)4567(ijk)
If you want this expression to provide:

Code:

(abcd)efgh)
(ijk)

you use the expression I gave you. If, however, you want

Code:

(abcd)
(ijk)

(stopping at the first available closed parenthesis rather than the last one), you want to use

Code:

cat test.txt | tr -d '\012' | egrep -o '\([^\)]*?\)'
(noting that the escaped parenthesis inside of the set by the ^ is a closed parenthesis and not an open one).

Any questions? :)

Cheers,

Zachary Palmer


All times are GMT -5. The time now is 12:13 PM.