LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   regexp question: first instance for each line (http://www.linuxquestions.org/questions/programming-9/regexp-question-first-instance-for-each-line-425592/)

buldir 03-16-2006 09:58 PM

regexp question: first instance for each line
 
So I have some text like:
Code:

Identification_Information:
  Citation:
    Citation_Information:
      Originator: Shmo, Joe
      Originator: Shmoe, Jan
      Publication_Date: 092005
      Title: Some Title Goes Here
      Geospatial_Data_Presentation_Form: map
      Series_Information:
        Series_Name: Report
        Issue_Identification: PIR 2005-6
      Publication_Information:
        Publication_Place: Backwoods, USA
        Publisher: Department of Natural Resources
      Other_Citation_Details: 15 p., 1 sheet, scale 1:250,000
      Online_Linkage: http://www.bananas.org

and I want to list the first instance, in each line, of a word ending with a ":", and beginning with a capital letter. So far I have:
Code:

(\w?[A-Z][a-z].+:)
Which gives:
Code:

Identification_Information:
Citation:
Citation_Information:
Originator:
Originator:
Publication_Date:
Title:
Geospatial_Data_Presentation_Form:
Series_Information:
Series_Name:
Issue_Identification:
Publication_Information:
Publication_Place:
Publisher:
Other_Citation_Details: 15 p., 1 sheet, scale 1:
Online_Linkage: http:

I need to get rid of the " 15 p., 1 sheet, scale 1:" and " http:" in the last two lines. Any help would greatly appreciated.

xhi 03-16-2006 10:14 PM

Code:

(\w?[A-Z][a-z].+?:)
i think the ? should stop it after the first match..

edit> actually this should do it..
Code:

(.+?:)

buldir 03-17-2006 12:26 AM

Quote:

Originally Posted by xhi
Code:

(\w?[A-Z][a-z].+?:)
i think the ? should stop it after the first match..

edit> actually this should do it..
Code:

(.+?:)

Thanks for the quick response.
Code:

(.+?:)
gives me:
Code:

Supplemental_Information:
(contact information below). web site (http:
Process_Description:
environment to a 1:
Other_Citation_Details:
15 p., 1 sheet, scale 1:
Online_Linkage: http:

for the text:

Supplemental_Information: (contact information below). web site (http://www.
Process_Description: environment to a 1:250,000 topographic basemap.
Other_Citation_Details: 15 p., 1 sheet, scale 1:250,000
Online_Linkage: http://www.

which is close. I still need to get rid of any other text beyond the first colon. I tried placing:
Code:

{1}
after the colon, but no luck.

buldir 03-17-2006 02:43 AM

Here's my last attempt before I hit the sack...
Code:

(\w?[A-Z][a-z].+[a-z]:[^//0-9])
which takes care of the four troublesome lines I mentioned above and gives me
Code:

Supplemental_Information:
Process_Description:
Other_Citation_Details:
Online_Linkage:

but not for the text:

Ordering: Order by phone, Payment accepted: Cash, check, money order, VISA, or MasterCard

which is still:

Code:

Ordering: Order by phone, Payment accepted:
Almost...

muha 03-17-2006 05:32 AM

Using sed i get this:
Code:

$ sed -n 's/\ *\([A-Z][^:]*:\).*/\1/p' file
Identification_Information:
Citation:
Citation_Information:
Originator:
Originator:
Publication_Date:
Title:
Geospatial_Data_Presentation_Form:
Series_Information:
Series_Name:
Issue_Identification:
Publication_Information:
Publication_Place:
Publisher:
Other_Citation_Details:
Online_Linkage:

Does that work?

xhi 03-17-2006 08:31 AM

Quote:

Originally Posted by buldir
Thanks for the quick response.
Code:

(.+?:)
gives me:
Code:

Supplemental_Information:
(contact information below). web site (http:
Process_Description:
environment to a 1:
Other_Citation_Details:
15 p., 1 sheet, scale 1:
Online_Linkage: http:

for the text:

Supplemental_Information: (contact information below). web site (http://www.
Process_Description: environment to a 1:250,000 topographic basemap.
Other_Citation_Details: 15 p., 1 sheet, scale 1:250,000
Online_Linkage: http://www.

which is close. I still need to get rid of any other text beyond the first colon. I tried placing:
Code:

{1}
after the colon, but no luck.

oops .. should have anchored it to the start of the string..
Code:

^(.+?:)
see if that works.. what lang is this btw?

buldir 03-20-2006 01:20 PM

Thanks muha and xhi. This problem wasn't related to any specific language. I needed a regexp to highlight all elements in a metadata file using the program EditPad Pro.
Code:

^(.+?:)
works great. I couldn't use sed because the program only supports regular expressions. I was testing the regular expression in another program called Expresso, but because I forgot to check the "Multiline" options box, the "^" at the beginning of the regexp was not applied to every line, but the entire string. After I checked the option, the regexp that xhi suggested worked like a charm. Thanks again to you both.


All times are GMT -5. The time now is 09:04 AM.