text searching
Hey guys, i am a bit new to text searching and expressioins, basically I have a text file filled with garbage, what is valuable is the data inbetween ( )
what sort of string would process all the text inbetween these parens. Thanks in advance. |
Hi,
This may get you started where the info is on one line. If the ( and ) are on separate lines, you may need to temporarily replace the \n with something else like a title ~ Code:
cat file.txt | grep -o '([^)]*)' |
I had luck with the following command. It is capable of handing files with corresponding ( and ) on multiple lines.
Code:
cat test.txt | tr -d '\012' | egrep -o '\([^\(]*?\)' Cheers, Zachary Palmer |
looks interesting, can you break it down for me.
|
I'd be glad to. :) If I cover something you already know, bear with me; I have no idea how much experience you have. :)
Code:
cat test.txt Code:
tr -d '\012' Code:
egrep -o '\([^\(]*\)' The pattern contains a number of backslash characters because regular expressions uses parentheses as a grouping symbol. For example, the regular expression "ab*" means 'a' followed by zero or more 'b's (ex., "a", "ab", "abb", but not "aba" or "abab"). The regular expression "(ab)*", on the other hand, means zero or more 'ab's (ex., "ab", "abab", and even "", but not "abb" or "aab"). Since we want to look for literal parentheses, we use the backslash character to tell grep that the character immediately following it isn't regular expression syntax but an actual character in the pattern. If you rewrite the pattern to use bold characters for any of the regular expression formatting characters, it looks like this: ([^(]*) That is, the expression looks for an open parenthesis, then the pattern [^(] zero or more times, then a closed parenthesis. The pattern [^(] is fairly simplistic. The [] characters tell the regular expression that the contents are to be interpreted as a character set; any of the characters or character representations contained within match it. For example, [abd] matches 'a', 'b', or 'd', [a-ce] matches 'a', 'b', 'c', or 'e', and [a-zA-Z0-9] matches any alphanumeric character. The ^ symbol in sets indicates "not." That is, [^a] represents any character which isn't 'a'. Therefore, the expression [^(] matches any character which isn't '('. So, the expression matches an open parenthesis, any number of characters which aren't an open parenthesis, and then a closed parenthesis. ***I should note that I made an assumption about your intentions. Take, for example, the file Code:
0123(abcd)efgh)4567(ijk) Code:
(abcd)efgh) Code:
(abcd) Code:
cat test.txt | tr -d '\012' | egrep -o '\([^\)]*?\)' Any questions? :) Cheers, Zachary Palmer |
All times are GMT -5. The time now is 07:30 AM. |