LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 04-05-2009, 12:59 PM   #1
Mountain
Member
 
Registered: Nov 2007
Location: A place with no mountains
Distribution: Kubuntu, sidux, openSUSE
Posts: 214

Rep: Reputation: 33
How to do search & replace on a text file--need to extract URLs from a sitemap file


I need a little search & replace help.

Here's how far I have gotten.
1. started with a site map file like this:
Code:
<?xml version="1.0" encoding="utf-8"?>
<!--Created by Devintelligence.com Sitemap Generator-->
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
  <url>
    <loc>http://example.com</loc>
    <lastmod>2009-04-04</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
2. got rid of all lines except those with URLs
Code:
$ cat < example.com.SiteMap.xml | grep '^\s*<loc>' > example.com.UrlList.txt
result looks like this:
Code:
    <loc>http://example.com</loc>
    <loc>http://example.com/forums/43.aspx</loc>
    <loc>http://example.com/blogs/300.aspx</loc>
3. Now I need to get rid of the white space at the beginning of the line and keep just the URL between the opening and closing tags, and output that as a new file.

Not sure of the next step...
reading about awk, sed, etc. and just confused...

My final result should be a file with lines like this:
Code:
http://example.com
http://example.com/forums/43.aspx
http://example.com/blogs/300.aspx
 
Old 04-05-2009, 01:10 PM   #2
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,488

Rep: Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956
All the tasks can be accomplished by a single sed command:
Code:
sed '/<loc>/!d; s/ *<loc>\(.*\)<\/loc>/\1/' file
The command !d does not delete lines containing <loc>, the substitution command print out the text between the <loc> and </loc> tags.
 
Old 04-05-2009, 01:20 PM   #3
Mountain
Member
 
Registered: Nov 2007
Location: A place with no mountains
Distribution: Kubuntu, sidux, openSUSE
Posts: 214

Original Poster
Rep: Reputation: 33
Solved
Code:
sed 's/^\s*<loc>\(.*\)<\/loc>/\1/' infile > outfile
 
Old 04-05-2009, 01:22 PM   #4
Mountain
Member
 
Registered: Nov 2007
Location: A place with no mountains
Distribution: Kubuntu, sidux, openSUSE
Posts: 214

Original Poster
Rep: Reputation: 33
Quote:
Originally Posted by colucix View Post
All the tasks can be accomplished by a single sed command:
Code:
sed '/<loc>/!d; s/ *<loc>\(.*\)<\/loc>/\1/' file
The command !d does not delete lines containing <loc>, the substitution command print out the text between the <loc> and </loc> tags.
Thank you! That's better than my two-step solution.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Trying to make a script that will search and replace text in a file Jongi Programming 28 07-08-2007 12:37 PM
Need command to search and replace text in file acascianelli AIX 12 04-11-2007 08:16 PM
Script to search and replace in text file - kinda... jeffreybluml Programming 45 11-07-2004 05:37 PM
Search and replace text in file using shell script? matthurne Linux - Software 2 11-02-2004 10:11 AM
trying to search and replace text file for single & multiple line breaks separately brokenfeet Programming 7 08-29-2003 01:56 PM


All times are GMT -5. The time now is 05:05 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration