LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   grep, sed, awk or tr - searching words in a string (http://www.linuxquestions.org/questions/programming-9/grep-sed-awk-or-tr-searching-words-in-a-string-709719/)

hal8000b 03-06-2009 05:20 PM

grep, sed, awk or tr - searching words in a string
 
I'm making a number of changes to html web pages. I've used Quanta "find in files" option, but would like to have something fully automatic.

First problem is I need to get just the title of the page
Example, from the string:-
<title>Download Page</title>

I need to parse the string so it just returns
"Download Page" (without quotes).

I've used
tr '</>' ' ' (which gets rid of the <, >, /, characters , but how do I get rid of the string "title" but still keep other characters in the string?

Thanks in advance

colucix 03-06-2009 05:35 PM

Using sed you can keep part of the pattern. Just embed it in escaped parentheses and refer to it as \1, like in the following example:
Code:

echo "<title>Download Page</title>" | sed 's/<title>\(.*\)<\/title>/\1/'
you have to carefully chose the regular expression to retrieve a unique result. In the case of the title it should be easy, but what if you have multiple html tags in the same line?

I'd suggest to use an already coded HTML parser. There are plenty of them available for free and written in different languages. Just google for them to get the idea! :)

Edit: just thought about a more simple sed command, just removing the unwanted part:
Code:

echo "<title>Download Page</title>" | sed 's/<\/*title>//g'

syg00 03-06-2009 08:04 PM

I prefer the first offering - pick the data you want to keep. Easy to make it handle the potential for extra data on the record. Even the unlikely multiple <title>..</title> pairs.
The "simple" latter offering won't deal with extra data at all.

Where regex is concerned I favour being as explicit as possible - it's way too easy for things to slip "under the radar".


All times are GMT -5. The time now is 08:56 AM.