-   Slackware (
-   -   Can you parse text with regex? (

raypen 03-19-2006 07:16 PM

Can you parse text with regex?
There are a few tools that can parse text in a limited
fashion. GREP can select lines of text containing a phrase
or particular pattern match. AWK can go a little further and
select certain 'fields' of information in selected lines.
CUT can select a range of characters in a line of text, but
it is limited to contiguous space.

As an example, one might be able to winnow down to an IP
address if you consider the max characters would be 15.
( However, any IP address that did not
use all of the space, such as, would be padded
with blanks and there is no way to consistently select only
the numbers in question.

Further, what if you wanted to parse the address and use only
the first 3 sets of digits. You could use cut again, but you would
have to examine each before knowing how to cut.

Is it possible to somehow use regex's to parse data such as this

For instance to select a IP address pattern is somewhat simple:


but how can you use this to cut/parse this information?

I know Perl could do it but it takes a rather lengthy script.
There must be a simpler way!

jschiwal 03-19-2006 08:05 PM

Using sed for example, you can save the IP information on a line and throw away the rest.
sed 's/^.*\([[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\).*$/\1/'

Suppose that you use k3b to backup items in a download directory, and you want to delete the items backed up to free up more space. You saved the k3b file as backup.k3b.
Using "file backup.k3b" you discover that the .k3b file is a zip file. Unziping it you find two files. mimetype and maindate.xml. The files that you backed up are inside <url>...</url> tags.

unzip backup.k3b
sed -e '/^<url>/!d -e 's/<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm

The replacement "\1" is a placeholder for the saved information \(<filename>\), so you end up with a list of files backed up. The "tr" command replaces newlines with nulls so that you can handle files containing white space.

In this example, we don't have information about the contents of the filename entries, as in the IP example, but we can use the tags as anchors, so we know the location of the information to extract.

raypen 03-20-2006 01:28 AM

I read several SED tutorials and analyzed the code and it
seems that it should work. However, when I try to use it:

sed -e expression #1, char 88: Invalid content of \{\}

char 88 refers to the first \. ecountered in the regular
expression pattern.

The sed command you recommended was copied "as is" into
the script, but is syntactically incorrect. It should read:

sed 's/^.*\([[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}\).*$/\1/'

This produces output, however the first grouping is missing, i.e.
if the IP address was, the output would be 168.0.100.
I'm sure that this is a small logic error but I just don't see it.

jschiwal 03-21-2006 03:22 AM

This sed program may work better for extracting IP address from text:

If the text might start with an IP address, then you may need to add another sed command.

There are two problems with your line:
sed 's/^.*\([[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}\).*$/\1/'
  • The "^.*" will swallow up some of the numbers, upto the the last number before the fist dot.
    I made the same mistake in my first post. The first wild card ".*" expands as large as it can up to the [[:digit:]]\. anchor. So it matches '^.*\([[:digit:]]\{1}\.' instead of 's/^.*\([[:digit:]]\{3\}\.'
  • The dots need to be escaped "\." to be taken literally, otherwise, they are regex wild cards.

Also, consider what you want to happen if there are two or more IP address on a line. Written one way, a sed command might extract the first IP address. Written another way, it could discard the first and extract the second.

raypen 03-21-2006 01:26 PM


The dots need to be escaped "\." to be taken literally, otherwise, they are regex wild cards.
I had already added the backslashes to be proper, but in this case it didn't matter; the code works either

The following also works to produce the correct output in this case:


sed 's/^.*:\..........

All times are GMT -5. The time now is 01:41 AM.