PCRE Regex

Osiris990 · 10-18-2007, 12:38 PM

I'm looking for a regex expression that will match options in a configuration file. The format is...

optionname = optionvalue

...where 'optionname' can be any string consisting of a-z(case independent), 0-9, underscores, dashes, and periods; 'optionvalue' can be a string of any length with any characters (excluding newlines of course) in it; and there can be multiple or no spaces before or after the equals sign.

What I have now: ^[a-zA-Z0-9\-_\.]{1,}+[\s?]+=+[\s?]+[.+]$

It doesn't seem to be working right though... Suggestions?

Also, if anyone knows of a good config file parsing API or something similar that will keep me from having to write my own, I would much prefer that.

Thanks
Shane

raskin · 10-18-2007, 01:11 PM

^[-a-zA-Z0-9_.]+\s*=\s*.+$

Explanation: you do not need to insert "+" between parts of expression, it is a modifier equivalent to {1,}. "*" is equivalent to {0,}. You do not need to escape anything inside [] and you cannot even escape "-", but it will be escaped at least if it is the first symbol inside [] (there are other cases, but they are all distinct from sane use for a range). \s should stand on its own, I guess [[:space:]] should have similar effect.

PS. I do not use perl, I even cannot write a simple regular-expression replace application in it, so I didn't test it in perl; all I say is from being sed and vim user and 'man regex'. So there can be some subtle error.

matthewg42 · 10-18-2007, 01:26 PM

The expression you posted can be improved quite a lot:

I'd add \s* to either side of the = symbol, and also at the beginning and end of line too. This means "any whitespace"... it gives you some formatting freedom
\w can be used instead of [a-zA-Z0-9_]
The {1,} means "1 or more times", which can more clearly be specified with +
+ goes after an expression, not before it - unless you want to be able to use any number of = characters as your assignment operator, you don't want to put =+
putting + inside [square brackets] changes its meaning... I don't think you want to do that at the end of your expression.
Please put code in [code] tabe to improve readability.
In Perl itself, you can put parts of an expression in (brackets) to extract those sections to $1 $2 and so on.

The total expression can be changed to this:

Code:

^\s*(\w+)\s*=\s*(.*)$

For example, in a perl program:

Code:

#!/usr/bin/perl

use strict;
use warnings;

# some test data
my @input = split /\n/, <<EOD;
setting1 = value1
setting2=value2
  setting3 = value3
invalid setting = value4
EOD

foreach (@input) {
    if (/^\s*(\w+)\s*=\s*(.*)$/) {
        printf("id=%s; value=%s\n", $1, $2);
    }
    else {
        warn "invalid format: $_\n";
    }
}

makyo · 10-18-2007, 01:56 PM

Hi.

If you are willing to use INI-style config files, Perl Best Practices, Conway, suggests the modules Config::{General,Std,Tiny} available on http://cpan.org/

Otherwise, matthewg42's code looks good for the equal-style files ... cheers, makyo

chrism01 · 10-18-2007, 10:36 PM

I use this:

Code:

    # Process cfg file records
    while ( defined ( $cfg_rec = <CONFIG_FILE> ) )
    {
        # Remove unwanted chars
        chomp $cfg_rec;                 # newline
        $cfg_rec =~ s/#.*//;            # comments
        $cfg_rec =~ s/^\s+//;           # leading whitespace
        $cfg_rec =~ s/\s+$//;           # trailing whitespace

        next unless length($cfg_rec);   # anything left?

        # Split 'key=value' string
        ($key, $value) = split( /\s*=\s*/, $cfg_rec, 2);

        # Assign to global hash, forcing uppercase keys
        $cfg::params{uc($key)} = $value;
    }

As you can see, I end up with a hash where the name of the option (key) is forced to upper (so it stands out in the code & is always upper), but the associated hash value is unchanged apart from leading/trailing spaces.
The hash is in a package cfg, which makes it effectively 'global' as I otherwise avoid global variables.

As you can see, it also allows the cfg file to have blank lines and comments, which this routine strips out first.

Osiris990 · 10-19-2007, 11:18 AM

Er... Something I neglected to mention. I'm using PCRE in *C* not in Perl. Any snippets you give me would be most helpful if given in C... Perl is Greek to me. =/ Thanks for all the attention though, guys. =] I'll try those variations on the regex string and see if any of them work.

Edit: Also, the \s* at the beginning and ends is unnecessary, as I have it strip whitespace from the beginning and end of the string as it's read in.

matthewg42 · 10-19-2007, 12:41 PM

I think you need escape all \ characters in your C strings.

Osiris990 · 10-19-2007, 02:07 PM

Okay, so with a little bit of editing, the code posted in the first reply worked. Now I'm faced with the problem of pulling the values out (similar to how you can in javascript/PHP/perl[so I hear]). I don't think it works quite the same way as Perl. =/ I've checked around Google, and I've searched through the PCRE man pages, but I've come up with nothing. Anyone know what I need to do?

raskin · 10-19-2007, 02:31 PM

I think you should read 'man 3 pcreapi', about pcre_get_substring_list() and similar.

Osiris990 · 10-22-2007, 10:53 AM

Alright, well I seem to have that working... Ish. The problem is, I don't know how to group it right (I guess) to get it to extract the right thing. I can extract the substring 0 (which is just the whole thing), but the list goes no further. I get error code -7 (no substring matching that number) when I put the index at 1 or over. How would I group them so that I can have it pull out the right substrings? I want to pull out the option name and the option value I.E.:

thisoption = this option value rules

I want to pull out the 'thisoption' and the 'this option value rules'.

Thanks,
Shane

raskin · 10-22-2007, 02:43 PM

'(' and ')' group parts in regular expressions. Like
^([-a-zA-Z0-9_.]+)\s*=\s*(.+)$