select all text between a patteren using grep

mauran · 07-13-2007, 03:47 PM

How can I select all the text between a specific pattern using grep?
or can I?

Quote:

balahblah

<text> blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblahblah
</text>

bbbnnkkkmnmm

I need to select all the text situated between <text></text>

acid_kewpie · 07-13-2007, 04:40 PM

grep doesn't select text, it finds and prints entire matching lines from a regex. "man grep" for details.

macemoneta · 07-13-2007, 04:57 PM

You can use pcregrep as in:

pcregrep -Mi "^<text>\s.*\s</text>" somefile

ghostdog74 · 07-13-2007, 10:56 PM

Code:

awk '/^$/{next}
   {  
	if (match($0,"<text>")) {		
          starttag=RSTART;endtag=RLENGTH
		  line=$0
		  if (match(line,"</text>")){		    
			line=substr(line,starttag+endtag,RSTART-RLENGTH)
			print "line:" line
		  }
		}		
		else if (match($0,"</text>")){
		  print substr(line,starttag+endtag)
		}
	}' "file"

farkus888 · 07-13-2007, 11:13 PM

pretty sure sed can do that, a quick google for "sed one liners" should get you that method. I know its quicker than that awk method if its there.

syg00 · 07-13-2007, 11:55 PM

"sed -n '/<text>/,/<\/text>/p' filename" should do it. Will need some work if both are on the one line.
And yes, go look at the one-liners on the sed site on sf.

ghostdog74 · 07-14-2007, 12:09 AM

Quote:

Originally Posted by syg00

"sed -n '/<text>/,/<\/text>/p' filename" should do it. Will need some work if both are on the one line.
And yes, go look at the one-liners on the sed site on sf.

If i am not wrong, OP wants to extract the text in between the tags. So i guess some more manipulations required with the sed method.

Code:

awk '/<text>/,/<\/text>/' file #equivalent to sed -n '/<text>/,/<\/text>/p

ghostdog74 · 07-14-2007, 12:15 AM

Quote:

Originally Posted by farkus888

I know its quicker than that awk method if its there.

well, it really doesn't matter, does it?

syg00 · 07-14-2007, 12:29 AM

Quote:

Originally Posted by ghostdog74

If i am not wrong, OP wants to extract the text in between the tags. So i guess some more manipulations required with the sed method.

Yeah, you might be right - I was thinking "inclusive" of tags.
Oh well.

As always, lots of ways of getting the job done. Won't take much to clean-up, depending on what the OP actually wanted. Could be done any number of ways.

farkus888 · 07-14-2007, 12:58 AM

Quote:

Originally Posted by ghostdog74

well, it really doesn't matter, does it?

I always like to find the shortest [least code required] method to do something. especially for some one like this, if they couldn't figure this out on their own they probably dont understand all of whats going on in the code you provided. not trying to knock you by any means, just providing insight on a more simple method. I know when I am new to something it drives me crazy to have people show me over complicated methods for doing something very simple, it makes it harder to understand so I can do it on my own next time. I try to help people not have the problems learning that I had, not just give them a one time fix for their problem.

jschiwal · 07-14-2007, 01:11 AM

If the tags and the contents are on the same line, then It can be done easily using sed:
sed -n '/<text>/,/<\/text>/s/.*<text>\(.*\)<\/text>/\1/p' file.

I've used something similar with k3b. If you save the project to a file, it actually creates a zip archive containing two file. One of them is named maindata.xml.

Code:

jschiwal@hpamd64:~> unzip podcasts.k3b
Archive:  podcasts.k3b
 extracting: mimetype
  inflating: maindata.xml

The xml file contains a catalog of backed up files. You could use this file to give you a list of names that are safe to delete because they are backed up.

Code:

...
<file name="JM-001.ogg" >
<url>/home/jschiwal/Podcasts/JM-001.ogg</url>
</file>
<file name="LQ-Podcast-050207.mp3" >
<url>/home/jschiwal/Podcasts/LQ-Podcast-050207.mp3</url>
</file>
<file name="LQ-Podcast-051207.mp3" >
<url>/home/jschiwal/Podcasts/LQ-Podcast-051207.mp3</url>
</file>

Notice the similar pattern. The filenames are between the <url></url> tags.

Code:

sed -n '/^<url>/s/^<url>\(.*\)<\/url>/\1/p' maindata.xml

...
/home/jschiwal/Podcasts/CrankyGeeks/crankygeeks.064.mp4
/home/jschiwal/Podcasts/CrankyGeeks/crankygeeks.066.mp4
/home/jschiwal/Podcasts/CrankyGeeks/crankygeeks.067.mp4
/home/jschiwal/Podcasts/JM-001.ogg
/home/jschiwal/Podcasts/LQ-Podcast-050207.mp3
/home/jschiwal/Podcasts/LQ-Podcast-051207.mp3
...

In this case, because the source is an xml file, you need to watch for the patterns > < & and replace them with the characters >,<,& respectively. So adding three sed commands are necessary.

Code:

sed -n '/^<url>/{
s/^<url>\(.*\)<\/url>/\1/
s/&gt;/>/g
}' maindata.xml
jschiwal@hpamd64:~> sed -n '/^<url>/{
s/^<url>\(.*\)<\/url>/\1/
s/&gt;/>/g
s/&lt;/</g
> s/&amp;/\&/g
> p
> }' maindata.xml
...
/home/jschiwal/Podcasts/50@10712b865b6a420bdea05b6cc5bfde98
/home/jschiwal/Podcasts/CrankyGeeks/crankygeeks.064.mp4
/home/jschiwal/Podcasts/CrankyGeeks/crankygeeks.066.mp4
/home/jschiwal/Podcasts/CrankyGeeks/<crankygeeks>&.067.mp4
/home/jschiwal/Podcasts/JM-001.ogg
/home/jschiwal/Podcasts/LQ-Podcast-050207.mp3
/home/jschiwal/Podcasts/LQ-Podcast-051207.mp3

Whatever method you use, it is best to test it out. You may have forgotten some patterns that can trip you up. The first time I did this I forgot about the reserved characters in xml, and files containing these characters weren't being deleted.
In composing this message, I added one sed rule at a time and tested it before going to the next one. Simply pressing the up arrow in the shell, and adding semicolons between sed commands, I can convert this into a true oneliner:

Code:

sed -n '/^<url>/{s/^<url>\(.*\)<\/url>/\1/;s/&gt;/>/g;s/&lt;/</g;s/&amp;/\&/g;p}' maindata.xml

I hope I remember to change the filename back to "crankygeeks.067.mp4" after this demonstration!

syg00 · 07-14-2007, 02:13 AM

Quote:

Originally Posted by jschiwal

sed -n '/<text>/,/<\/text>/s/.*/<text>\(.*\)<\/text>/\1/p' file.

A small typo maybe ???

ghostdog74 · 07-14-2007, 03:32 AM

Quote:

Originally Posted by farkus888

I always like to find the shortest [least code required] method to do something.

that's the problem with one liners in general IMO. They are short and specific to do a task, but not necessarily easily understandable to the one reading/maintaining it.

jschiwal · 07-14-2007, 03:52 AM

Quote:

Originally Posted by syg00

A small typo maybe ???

Yes. That is the one line I didn't test out. I'll blame it on finger memory.

syg00 · 07-14-2007, 03:52 AM

Edit: Response to ghostdog74.

"quick and dirty" hacks are fine for ad hoc one-time needs.
In a corporate environment, it pays to have a better (and better documented) generic solution. Personally I prefer perl in such a circumstance, but each to their own.

For a home user it may not matter.