Can you parse text with regex?
There are a few tools that can parse text in a limited
fashion. GREP can select lines of text containing a phrase or particular pattern match. AWK can go a little further and select certain 'fields' of information in selected lines. CUT can select a range of characters in a line of text, but it is limited to contiguous space. As an example, one might be able to winnow down to an IP address if you consider the max characters would be 15. (xxx.xxx.xxx.xxx). However, any IP address that did not use all of the space, such as 71.25.125.14, would be padded with blanks and there is no way to consistently select only the numbers in question. Further, what if you wanted to parse the address and use only the first 3 sets of digits. You could use cut again, but you would have to examine each before knowing how to cut. Is it possible to somehow use regex's to parse data such as this simply? For instance to select a IP address pattern is somewhat simple: [0-9]*\.[0-9]*\.[0-9]*\.[0-9]* but how can you use this to cut/parse this information? I know Perl could do it but it takes a rather lengthy script. There must be a simpler way! |
Using sed for example, you can save the IP information on a line and throw away the rest.
sed 's/^.*\([[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\).*$/\1/' Suppose that you use k3b to backup items in a download directory, and you want to delete the items backed up to free up more space. You saved the k3b file as backup.k3b. Using "file backup.k3b" you discover that the .k3b file is a zip file. Unziping it you find two files. mimetype and maindate.xml. The files that you backed up are inside <url>...</url> tags. unzip backup.k3b sed -e '/^<url>/!d -e 's/<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm The replacement "\1" is a placeholder for the saved information \(<filename>\), so you end up with a list of files backed up. The "tr" command replaces newlines with nulls so that you can handle files containing white space. In this example, we don't have information about the contents of the filename entries, as in the IP example, but we can use the tags as anchors, so we know the location of the information to extract. |
I read several SED tutorials and analyzed the code and it
seems that it should work. However, when I try to use it: sed -e expression #1, char 88: Invalid content of \{\} char 88 refers to the first \. ecountered in the regular expression pattern. The sed command you recommended was copied "as is" into the script, but is syntactically incorrect. It should read: sed 's/^.*\([[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}\).*$/\1/' This produces output, however the first grouping is missing, i.e. if the IP address was 192.168.0.100, the output would be 168.0.100. I'm sure that this is a small logic error but I just don't see it. |
This sed program may work better for extracting IP address from text:
s/^.*[^[:digit:]]\([[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\).*$/\1/ If the text might start with an IP address, then you may need to add another sed command. There are two problems with your line: sed 's/^.*\([[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}\).*$/\1/'
Also, consider what you want to happen if there are two or more IP address on a line. Written one way, a sed command might extract the first IP address. Written another way, it could discard the first and extract the second. |
Quote:
way. The following also works to produce the correct output in this case: Quote:
|
All times are GMT -5. The time now is 07:40 PM. |