LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   parse a string in perl (https://www.linuxquestions.org/questions/programming-9/parse-a-string-in-perl-574841/)

2007fld 08-05-2007 07:47 AM

parse a string in perl
 
I have a string
$_ = "The following is a directory /U01/abc/def/dir3";

How to parse $_, so I just extract the directorty and get
my $dir="/U01/abc/def/dir3";


The directory always starts with /U01, but the directory has different length. How to use the regular expression match?

Also, the string before the directory has a fixed length, so maybe there is a function like "strpos" ?

Thanks!!

wjevans_7d1@yahoo.co 08-05-2007 08:37 AM

There's a simple regular expression to do this, but I'd rather teach you to fish, rather than just give you a fish.

I'm sure someone will be along any moment to give the answer away and spoil the learning (and the fun!), but let's try anyway.

Normally, to remove a chunk of a string from the string, you'd use a statement like this:

Code:

$some_string=~s/to_be_removed//;
So, care to take a first crack at what that statement should be in your case? It doesn't have to be correct, but it needs to be at least an attempt. I'll show you, step by step, how to fix it so it does exactly what you want.

2007fld 08-05-2007 11:48 AM

Hi thanks so much! After replacing "to_be_removed" with "The following is a directory", I got exactly what I need: "/U01/abc/def/dir3". Thanks!

But the problem is in my real case, in my original string, the part before the directory is a not a fixed string, actually it's something with time stamps, hostnames. And after the directory, there are also some more info.

so the $_ is more like this:
$_ = "2007 07 28 hostname 1 /U01/abc/def/dir3 filename1 1234byte";
and I need to extract the directory.

Thanks!

wjevans_7d1@yahoo.co 08-05-2007 04:10 PM

Cool. To continue the fishing lesson, let's take a further look at the question: How do I remove everything before the first slash?

To start, take a gander at these documents on regular expressions, in this order:

Code:

man perlrequick
man perlretut
man perlre

There's a lot in those. Don't read them all from cover to cover right now, although you'll eventually want to do that. Just go until you think you can post a guess at an answer to the above question. You may not even have to read all three, or even two, of these documents.

Don't worry about getting it 100% right, just post a stab at it.

You're doing fine so far!

2007fld 08-05-2007 05:12 PM

Hi, I think I got it;)

Here is what I used:

$_ =~ / (\/U01[\w*|\d*|\/]*)/;
print $1;

This extracts the directoy very well.

Thanks so much!!

Looking forward to seeing your way to solve the problem!

wjevans_7d1@yahoo.co 08-05-2007 07:50 PM

Your solution works, but we can discover things by looking at the details.

The main thing is the [] construct in a regular expression. It means "accept any character that appears within the brackets. The asterisk (*), as seen here:
Code:

$_ =~ / (\/U01[\w*|\d*|\/]*)/;
                          ^
                          ^

means "accept anything represented by the previous item (in this case, the [] list) as many times as it occurs."

So you will be accepting, as many times as they occur, any of these items you placed in the [] list:
Code:

anything in the \w list, which includes letters, digits, and underscore
anything in the \d list, which includes digits
|  (the pipe character)
*  (asterisk)
/  (slash, which you correctly spelled \/)

.. because they're all in the list.

You don't need the pipe character, which I'm sure you placed there to indicate "or", because a [] list already implies that you're listing alternatives. Indeed, if you change your experiment so that the test string contains | somewhere (within the abc, say), you'll find that your regular expression will accept that. But since you don't really want | in the list, you could simplify the regular expression thus:
Code:

$_ =~ / (\/U01[\w*\d*\/]*)/;
You also don't need the * within the list, because the list only needs to mention each character once, and the * outside the list allows repetition. Indeed, if you change your experiment so that the test string contains * somewhere (within the abc, say), you'll find that your regular expression will accept that. But since you don't really want * in the list, you could simplify the regular expression thus:
Code:

$_ =~ / (\/U01[\w\d\/]*)/;
Since \w includes numerical digits, and \d includes only numerical digits, you can lose the \d:
Code:

$_ =~ / (\/U01[\w\/]*)/;
You could generalize this by leaving the U01 out of the string, thus making the script more generally useful:
Code:

$_ =~ / (\/[\w\/]*)/;
But perhaps some other character will appear in the directory path. ab-c is perfectly valid; so is ab.c; so are quite a few other characters. So this is better, because it accepts everything up to the next space or tab:
Code:

$_ =~ / (\/[^ \t]*)/;
I put in the tab also as a defensive measure. If you don't want that, do this instead:
Code:

$_ =~ / (\/[^ ]*)/;
You still need the [] to let the ^ mean "a list that includes everything but".

And speaking of tabs, just to be defensive, maybe you want to do this:
Code:

$_ =~ /[ \t](\/[^ \t]*)/;
Hope you had fun with this.

2007fld 08-05-2007 08:27 PM

Thank you sooooo much! This is fun!

chrism01 08-06-2007 01:28 AM

or you could use rindex() http://perldoc.perl.org/functions/rindex.html to get the last '/', and substr() http://perldoc.perl.org/functions/substr.html for a more readable solution.

ghostdog74 08-06-2007 02:04 AM

if you think your directory path is always a field by itself, you can do away with complicated regexp and use split ( or others) instead
Code:

$string="The following is a directory /U01/abc/def/dir3 blah blah";
my @array=split /\s+/,$string;
foreach $item (@array) {
    print $item if $item =~ /\//;
}


wjevans_7d1@yahoo.co 08-06-2007 08:28 AM

The nice thing about Perl is that there are several ways to do practically everything. Perl is known as the Swiss Army chainsaw of programming languages.

bigearsbilly 08-06-2007 09:25 AM

I think correctly and for portability one should maybe use...

Code:

use File::Basename;
my $filepath = dirname $file;
my $filename = basename $file;


chrism01 08-06-2007 06:33 PM

Actually, re-reading this "the string before the directory has a fixed length", just use substr().

ghostdog74 08-06-2007 07:43 PM

Quote:

Originally Posted by wjevans_7d1@yahoo.co
The nice thing about Perl is that there are several ways to do practically everything.

too much of nice things is not a good thing.

2007fld 08-07-2007 02:41 PM

Thank you all for the helpful info.!


All times are GMT -5. The time now is 09:44 AM.