LinuxQuestions.org

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - regex (https://www.linuxquestions.org/questions/linux-newbie-8/regex-853637/)

Could someone please help me figure out the regex formula for this string?

City, ST 12345

City can be more than one word, and ST is always any two capital letters.
Since the city name can be almost anything but is always followed by a comma so it might be a good idea to start matching at the comma.

Thanks.

I would really appreciate the member darkcrimson to get in tough with me regarding this thread. Thanks again to everyone.

Basic (and more) regex info: http://www.regular-expressions.info/ . You should read through it if you want to learn regex.

We need to know where you are using this regular expression. ie: grep? a Perl script? Because there are different flavors of regex engines with different capabilities. And we need to know what you want to do with the match: just print it? Parse out the city, state, zip?

Perl-Compatible Regular Expression:

Code:

^(.+?), ([A-Z]{2}) (\d{5})$

The city will be in $1, the state in $2, and the zip in $3.

That's:
- ^ start of line
- (.+?) One or more characters, non-greedy, captured
- , literal comma, literal space
- ([A-Z]{2}) two capital letters, captured
- literal space
- (\d{5}) Five numbers, captured
- $ end of line

You have done yourself the favor of describing how the regex should match the input. Having done this much is most of the work; the rest is just translating the long-form description to the concise regex version. AlucardZero has already mentioned the distinction between regex implementations in different tools and languages, so I'll just use Perl as an example.
You said 'anything but is always followed by a comma', which I will translate to the more accurate 'at least one of anything, followed by a comma'. Happily, there is an almost direct translation of these words to regex code.

Code:

.+,

dot (anything)
+ (at least one of the preceding)
, (literal comma)

Then, you didn't mention the whitespace, but it's there, and whitespace can sometimes be in multiples, so as long as we specify at least one, we'll be robust in how we match:

Code:

\s+

\s (whitespace of any sort)
+ (at least one of the preceding)

Then, you said 'always any two capital letters'. Nice, concise, and once again, directly translates to regex code

Code:

[A-Z][A-Z]

This should need no explanation, I would guess. However, it is distinct from AlucardZero's example in that it seems simpler to read, and from my understanding of the way regexes work, may be slightly more efficient. Until the regex has been use a few million times, I doubt the difference is measurable.

Now, more whitespace, as before, followed by what many would guess to be a US zip code of five digits. Now, for five digits, I will agree with AlucardZero,s example:

Code:

\s+[0-9]{5}

The whole thing, as a snippet of Perl code:

Code:

$address =~ m/(.+),\s+([A-Z][A-Z])\s+([0-9]{5})/;

$city = $1;

$state = $2;

$zip = $3;

Now, next time you think about how to describe what you want, simply go the extra step and translate it to code. The problem almost solves itself.

--- rod.

Quote:

Originally Posted by AlucardZero (Post 4210009)

Code:

^(.+?), ([A-Z]{2}) (\d{5})$

Sorr, Yes it's just a regular unix expression combined with grep. Thanks for any and all replies.