LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 03-19-2006, 06:16 PM   #1
raypen
Member
 
Registered: Jun 2002
Location: Midwest
Distribution: Slackware
Posts: 365

Rep: Reputation: 30
Can you parse text with regex?


There are a few tools that can parse text in a limited
fashion. GREP can select lines of text containing a phrase
or particular pattern match. AWK can go a little further and
select certain 'fields' of information in selected lines.
CUT can select a range of characters in a line of text, but
it is limited to contiguous space.

As an example, one might be able to winnow down to an IP
address if you consider the max characters would be 15.
(xxx.xxx.xxx.xxx). However, any IP address that did not
use all of the space, such as 71.25.125.14, would be padded
with blanks and there is no way to consistently select only
the numbers in question.

Further, what if you wanted to parse the address and use only
the first 3 sets of digits. You could use cut again, but you would
have to examine each before knowing how to cut.

Is it possible to somehow use regex's to parse data such as this
simply?

For instance to select a IP address pattern is somewhat simple:

[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*

but how can you use this to cut/parse this information?

I know Perl could do it but it takes a rather lengthy script.
There must be a simpler way!
 
Old 03-19-2006, 07:05 PM   #2
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 683Reputation: 683Reputation: 683Reputation: 683Reputation: 683Reputation: 683
Using sed for example, you can save the IP information on a line and throw away the rest.
sed 's/^.*\([[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\.[[:digit:]]\{1,3}\).*$/\1/'

Suppose that you use k3b to backup items in a download directory, and you want to delete the items backed up to free up more space. You saved the k3b file as backup.k3b.
Using "file backup.k3b" you discover that the .k3b file is a zip file. Unziping it you find two files. mimetype and maindate.xml. The files that you backed up are inside <url>...</url> tags.

unzip backup.k3b
sed -e '/^<url>/!d -e 's/<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm

The replacement "\1" is a placeholder for the saved information \(<filename>\), so you end up with a list of files backed up. The "tr" command replaces newlines with nulls so that you can handle files containing white space.

In this example, we don't have information about the contents of the filename entries, as in the IP example, but we can use the tags as anchors, so we know the location of the information to extract.

Last edited by jschiwal; 03-19-2006 at 07:08 PM.
 
Old 03-20-2006, 12:28 AM   #3
raypen
Member
 
Registered: Jun 2002
Location: Midwest
Distribution: Slackware
Posts: 365

Original Poster
Rep: Reputation: 30
I read several SED tutorials and analyzed the code and it
seems that it should work. However, when I try to use it:

sed -e expression #1, char 88: Invalid content of \{\}

char 88 refers to the first \. ecountered in the regular
expression pattern.

The sed command you recommended was copied "as is" into
the script, but is syntactically incorrect. It should read:

sed 's/^.*\([[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}\).*$/\1/'

This produces output, however the first grouping is missing, i.e.
if the IP address was 192.168.0.100, the output would be 168.0.100.
I'm sure that this is a small logic error but I just don't see it.
 
Old 03-21-2006, 02:22 AM   #4
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 683Reputation: 683Reputation: 683Reputation: 683Reputation: 683Reputation: 683
This sed program may work better for extracting IP address from text:
s/^.*[^[:digit:]]\([[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\).*$/\1/

If the text might start with an IP address, then you may need to add another sed command.

There are two problems with your line:
sed 's/^.*\([[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}.[[:digit:]]\{1,3\}\).*$/\1/'
  • The "^.*" will swallow up some of the numbers, upto the the last number before the fist dot.
    I made the same mistake in my first post. The first wild card ".*" expands as large as it can up to the [[:digit:]]\. anchor. So it matches '^.*\([[:digit:]]\{1}\.' instead of 's/^.*\([[:digit:]]\{3\}\.'
  • The dots need to be escaped "\." to be taken literally, otherwise, they are regex wild cards.

Also, consider what you want to happen if there are two or more IP address on a line. Written one way, a sed command might extract the first IP address. Written another way, it could discard the first and extract the second.

Last edited by jschiwal; 03-21-2006 at 02:40 AM.
 
Old 03-21-2006, 12:26 PM   #5
raypen
Member
 
Registered: Jun 2002
Location: Midwest
Distribution: Slackware
Posts: 365

Original Poster
Rep: Reputation: 30
Quote:
The dots need to be escaped "\." to be taken literally, otherwise, they are regex wild cards.
I had already added the backslashes to be proper, but in this case it didn't matter; the code works either
way.

The following also works to produce the correct output in this case:

Quote:
sed 's/^.*:\..........
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
TextPad-like text editor with regex for linux? nickleus Linux - Software 5 10-19-2010 12:39 PM
How to parse log files into text view using GLADE shandy^^^ Programming 8 02-07-2006 08:13 PM
bash script help to parse out text slack guy Linux - Newbie 3 12-30-2004 08:42 AM
RegEx Supported Text Editor - Perl Flavor germicide Linux - Software 1 10-05-2004 03:02 PM
Perl Regex Help -- Readin In Text Files smaida Programming 1 04-04-2004 11:27 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 09:01 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration