LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 03-06-2009, 06:20 PM   #1
hal8000b
Member
 
Registered: Mar 2001
Location: UK
Distribution: Mint, Arch, Debian7
Posts: 170

Rep: Reputation: 21
grep, sed, awk or tr - searching words in a string


I'm making a number of changes to html web pages. I've used Quanta "find in files" option, but would like to have something fully automatic.

First problem is I need to get just the title of the page
Example, from the string:-
<title>Download Page</title>

I need to parse the string so it just returns
"Download Page" (without quotes).

I've used
tr '</>' ' ' (which gets rid of the <, >, /, characters , but how do I get rid of the string "title" but still keep other characters in the string?

Thanks in advance
 
Old 03-06-2009, 06:35 PM   #2
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Using sed you can keep part of the pattern. Just embed it in escaped parentheses and refer to it as \1, like in the following example:
Code:
echo "<title>Download Page</title>" | sed 's/<title>\(.*\)<\/title>/\1/'
you have to carefully chose the regular expression to retrieve a unique result. In the case of the title it should be easy, but what if you have multiple html tags in the same line?

I'd suggest to use an already coded HTML parser. There are plenty of them available for free and written in different languages. Just google for them to get the idea!

Edit: just thought about a more simple sed command, just removing the unwanted part:
Code:
echo "<title>Download Page</title>" | sed 's/<\/*title>//g'

Last edited by colucix; 03-06-2009 at 06:52 PM.
 
Old 03-06-2009, 09:04 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,483

Rep: Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077
I prefer the first offering - pick the data you want to keep. Easy to make it handle the potential for extra data on the record. Even the unlikely multiple <title>..</title> pairs.
The "simple" latter offering won't deal with extra data at all.

Where regex is concerned I favour being as explicit as possible - it's way too easy for things to slip "under the radar".
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
bash - awk, sed, grep, ... advice schneidz Programming 13 08-25-2008 10:30 AM
Sed, Awk, grep,Search,delete joyds219 Linux - Newbie 6 04-03-2008 07:15 AM
awk/sed to grep the text ahpin Linux - Software 3 10-17-2007 01:34 AM
Need to strip words from front of line. sed/awk/grep? joadoor Linux - Software 6 08-28-2006 05:39 AM
diffrence between grep, sed, awk and egrep Fond_of_Opensource Linux - Newbie 3 08-18-2006 09:15 AM


All times are GMT -5. The time now is 08:33 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration