read file and filter out specific tags in file
I am a newbie in shell scripting and would appreciate a help with this qstn. many thanks in advance and apologies for the big input file.
I have a .xml file that is a concat of multiple rss files. reqrmnt is to filter out all extra content in the file and keep only the actual items. eg: <?xml version="1.0" encoding="iso-8859-1"?> <rss> ........... some text here <channel> .......... some more tags here <item> <title>Item Example 1</title> <link>http://www.domain.com/link1.htm</link> </item> <item> <title>Item Example 2</title> <link>http://www.domain.com/link2.htm</link> </item> </channel> </rss> <rss> ..... some other tags ...... <item> <title>Item Example 3</title> <link>http://www.domain.com/link3.htm</link> </item> ....... more tags ....... <item> <title>Item Example 4</title> <link>http://www.domain.com/link4.htm</link> </item> <item> <title>Item Example 5</title> <link>http://www.domain.com/link5.htm</link> </item> </rss> //item can have more attribs output should be: <item> <title>Item Example 1</title> <link>http://www.domain.com/link1.htm</link> </item> and other items much thanks, Soni |
grep -v rss ?
i think if your format of RSS is static(one tag per line) it's quite simple to remove unwanted tags with grep, unless format will change to, say, single-line, where you will need to either use hard regexes or external programming lang. |
Code:
$ awk '/<\/item>/{f=0}/<item>/{f=1}f ' file |
All times are GMT -5. The time now is 08:46 PM. |