LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   help :() How to find an address using "grep" & regEx's. (https://www.linuxquestions.org/questions/programming-9/help-how-to-find-an-address-using-grep-and-regexs-632044/)

linuxmaveric 03-31-2008 07:56 PM

help :() How to find an address using "grep" & regEx's.
 
Hello all! I am new to linux and the command line. How could I set up a regular expression using grep that would find a "City, ST Zip" where city can be any word, state can only be Two Capital latters "ST", and a 5 digit zip in a file. How would I set that up. I understand basic grep arguments but regular expressions seem almost like hieroglyphics :)

I hope this better explains the problem:


Write a regular expression that would match lines of the following form:

City, ST 12345

"City" can be more than one word, and ST is always two capital letters.

HINT: Since the city name can be almost anything, it might be good to start matching at the comma.

ophirg 03-31-2008 09:04 PM

Hi linuxmaveric

I think the best regular expression would be:
"\w+\s*,\s*ST\s+[0-9]{5,5}"

But wait...
If you think you are going to work with the command line or with scripts, then my advice is to learn regular expressions. They are really handy with tools like grep. And later, you can learn how to use them with awk and then with programming languages like perl and python.

Look at http://www.regular-expressions.info/.
They have nice resources there.

prad77 03-31-2008 09:04 PM

regex="\s*(.*)\s*,\s*([A-Z]{{2}})\s+(\d{{5}}?)\s*"

It could like the above one. may be you have tune it further too...
Ofcourse it is interesting to explore otherwise,

http://www.grymoire.com/Unix/Regular.html#uh-8

Gentoo

linuxmaveric 03-31-2008 09:28 PM

Thanks for the code examples!
 
Thanks for the code. I'll work with these to start.
Thanks for the head start and links to resources.
RR.

linuxmaveric 04-02-2008 05:26 PM

I figured out a simpler and quick solution to my original problem.
 
I created a simpler and easier solution for my problem that I originally posted.
Using grep & regEx to find a address consisting of "City, State and zip." City could be anything, state always two capitals "ST", and a 5 digit zip.

Here is my basic solution:

grep ", [A-Z][A-Z] [-0-9][0-9][0-9][0-9][0-9]" /filename

This would do the job nicely. But thanks everyone for the suggestions. They gave me the ideas to figure this out.

sundialsvcs 04-02-2008 10:33 PM

Sounds like a homework problem ... :tisk: ... but it's a valid exercise nonetheless.

A good strategy for planning a regular-expression is to look for "the rocks in the stream." These are the definable anchor-points, and the variable content flows around them.

So, what are the "rocks" in this scenario? Let's see...
  1. The "comma followed by one whitespace."
  2. "A sequence of exactly two alphabetic characters" (which, significantly, is both a "rock" and a data-item that we'll want to capture...)
  3. A series of one-or-more whitespace characters between the state and the zip-code.

Another pair of "rocks" is the beginning of the line and the end of the line ... denoted by the characters '^' and '$'. If you know that the pattern you're looking for must start at the first position of the line and/or must conclude at the last position, you should include this in your pattern since that's very useful to the computer.

So... what's between those "rocks?" Data, of course, and that's the other thing to consider when you're building a regular-expression. You'll enclose these pieces in (parentheses) as a signal that you want to capture whatever characters match these things. If the string that you've been given "matches" the regular-expression, you'll have (some kind of) "easy" way to extract these pieces.

Okay... so what do we have here? Let's see:
  1. A data-item that we want to extract... beginning at the "^"start of the line"^" and consisting of zero-or-more "*" any-characters ".", which we want to capture. (Parentheses...)
  2. The sequence of characters {comma, white"\s"pace}, which is just a rock.
  3. Exactly two "[A-Z]"alphabetic characters. Data.
  4. ... Followed by one or more white"\s"pace characters. Another rock.
  5. Followed by "{5}" "\d"igit-characters, which we want to (capture) as data. ...
  6. Followed by the "$"end-of-the-line"$".

Now, since this undoubtedly is a homework assignment, I'm gonna stop right there. :D But, each and every one of those bullet points corresponds to one-or-more somethings in a regular-expression pattern.

linuxmaveric 04-03-2008 12:45 AM

Naughty naughty :) thats funny
 
Actually, this was not homework/for testing as all our tests are done in class. But part of a classroom project and forums were Ok'd. So no harm done. But your emoticon is pretty funny LOL!I got good ideas from the previous posts but our project was to simplify the solution as much as possible. My instructor did except my solution grep ", [A-Z][A-Z] [0-9][0-9][0-9][0-9][0-9]" /filename.

"ST" is just an example I used. State could be anything. Thanks for the reply but I already nailed this one on the head. I imagine there are so many different ways to figure this out; using perl, php, ruby...heck even python has its own twist. Regex's can be so varied. :)

Wow, Thanks for your detailed explanation. You sound like you have a strong command of regex. I will be there someday too. Wish ,me luck!
Once again Thank you everyone for your help.
RR.

Darkcrimson 04-27-2010 11:07 PM

Looks like someone's taking the Unix/Linux System Administration course with O'reilly. I remember this question...taken right from the quiz, haha. Good luck to you.

grail 04-28-2010 12:31 AM

May I also suggest looking up regular expressions and quantifiers, look for something like {n,m} - might help make your simple solution even shorter.


All times are GMT -5. The time now is 12:10 AM.