LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-12-2007, 03:50 PM   #1
musther
Member
 
Registered: Sep 2007
Posts: 36

Rep: Reputation: 15
Bash script to strip some content from XML file.


I've got a large xml file containing TV listings for my mythtv box, and I want to filter it so I'm left with just the channels I recieve.

It's in this form:

Code:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv lots=of tags=which give=basic_info>

<channel id="1012.dvb.guide">
	<display-name>Nat Geographic</display-name>
	<icon src="http://********/epg/icons/national_geographic.jpg" />
</channel>

#There are lots of the above channel sections, and then lots of the following programme sections.

<programme channel="1021.dvb.guide" start="20071013003000 +1300" stop="20071013012500 +1300">
	<title lang="eng">Sexiest Action Heroes</title>
	<desc>They are our sexy heroes and hell-raisers, the lucsious victors and vixens who play by their own rules and always hold the winning hand - smashing a few skulls in the process.</desc>
	<category>tvshow</category>
	<category>Reality</category>
	<rating system="SKY-NZ">
		<value>R16</value>
	</rating>
</programme>

</tv>

<!--
	[{'rating': '6', 'description': 'So then there's a bunch of footer info.....
So we have the sections;

Code:
Header stuff

Channels

Programmes

Footer stuff
So basically what I need to do is, from this, construct another file which has the header, the channels I have, the programmes on the channels I have, and then the footer stuff.

The difficult bit (as far as my bash text processing skills go) is filtering the channels and programmes. I figure I'll put the channel ID's of the channels I want in a list in a file, from which the script can get them, then I suppose the best way would be to strip everything which doesn't match, so while I'll end up doing is:

Find a text string which goes:

Code:
<channel id="$i">******</channel>
and if $i isn't in the list, remove it, then do the same kind of thing for the programme entries.

Cheers!
 
Old 10-12-2007, 05:11 PM   #2
Disillusionist
Senior Member
 
Registered: Aug 2004
Location: England
Distribution: Ubuntu
Posts: 1,013

Rep: Reputation: 83
Try this perl script:
Code:
#!/usr/bin/perl -w
use strict;
use File::Copy;

my $l_input="progs.xml";
my $l_output="output.xml";
my $l_selection="1012.dvb.guide";
my $l_print=1;

if (open(INPUT, "$l_input"))
{
   open(OUTPUT,"> $l_output") or die "Unable to open output file";

   while (<INPUT>)
   {
      if (/^\<channel/)
      {
         ## Found an entry for channel check if it matches criteria
         if (/$l_selection/)
         {
            $l_print=1;
         } else {
            $l_print=0;
         }
      }

      if (/^\<programme/)
      {
         ## Found an entry for programme check if it matches criteria
         if (/$l_selection/)
         {
            $l_print=1;
         } else {
            $l_print=0;
         }
      }

      if (/^\<\/tv/)
      {
         $l_print=1;
      }

      if ( $l_print == 1 )
      {
         print OUTPUT "$_";
      }      
   }
}
Will print every line up until the first instance which starts with either <programme or <channel
In these instances, it checks the selection and decides if it should print the section or not.

Once it gets to a line starting </tv
it sets the print option back on.

The code is not perfect and only allows for one set of selection criteria, but its a starting point.
 
Old 10-13-2007, 09:31 PM   #3
musther
Member
 
Registered: Sep 2007
Posts: 36

Original Poster
Rep: Reputation: 15
Thanks for that, it's very helpful. I tried and tried to make it accept more strings using arrays etc, but I don't really know perl and didn't manage to get it working.

Is there any chance you could help me to get it to accept more, filling them in at the top of the script is perfectly acceptable for this solution?

Thanks
 
Old 10-13-2007, 09:54 PM   #4
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Can you provide a bigger sample of data? maybe the full header/footer and a couple of channels. or maybe pastebin the whole file..

Also I'm guessing you want some sort of an array like this at the top of the script?
mychannels = [chan1, chan2, chan3]

Last edited by angrybanana; 10-13-2007 at 10:02 PM.
 
Old 10-13-2007, 10:50 PM   #5
musther
Member
 
Registered: Sep 2007
Posts: 36

Original Poster
Rep: Reputation: 15
Here's a much larger chunk of data (still far from the whole file):

Code:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="epgsnoop/0.12beta" generator-info-url="http://nice.net.nz/epgsnoop" date="20071013030543 +1300">
<channel id="1205.dvb.guide">
	<display-name>Unknown</display-name>
</channel>
<channel id="1012.dvb.guide">
	<display-name>Nat Geographic</display-name>
	<icon src="http://nice.net.nz/epg/icons/national_geographic.jpg" />
</channel>
<channel id="1021.dvb.guide">
	<display-name>E!</display-name>
	<icon src="http://nice.net.nz/epg/icons/e.jpg" />
</channel>
<channel id="1026.dvb.guide">
	<display-name>BBC World</display-name>
	<icon src="http://nice.net.nz/epg/icons/bbc_world.jpg" />
</channel>
<channel id="1031.dvb.guide">
	<display-name>TV One</display-name>
	<icon src="http://nice.net.nz/epg/icons/1.jpg" />
	<url>http://www.tv1.co.nz</url>
</channel>
<channel id="1028.dvb.guide">
	<display-name>Southland TV</display-name>
	<icon src="http://nice.net.nz/epg/icons/southland_tv.jpg" />
</channel>
<channel id="1038.dvb.guide">
	<display-name>Prime</display-name>
	<icon src="http://nice.net.nz/epg/icons/prime.jpg" />
</channel>
<channel id="1050.dvb.guide">
	<display-name>Preview</display-name>
</channel>
<channel id="1063.dvb.guide">
	<display-name>SKY Box Office</display-name>
	<icon src="http://nice.net.nz/epg/icons/sky_box_office.jpg" />
</channel>
<channel id="1071.dvb.guide">
	<display-name>SKY Box Office</display-name>
	<icon src="http://nice.net.nz/epg/icons/sky_box_office.jpg" />
</channel>
<channel id="1192.dvb.guide">
	<display-name>Weather</display-name>
	<icon src="http://nice.net.nz/epg/icons/weather_channel.jpg" />
</channel>
<channel id="1079.dvb.guide">
	<display-name>KTV 2</display-name>
	<icon src="http://nice.net.nz/epg/icons/ktv2.jpg" />
</channel>
<channel id="1086.dvb.guide">
	<display-name>CTV 6</display-name>
	<icon src="http://nice.net.nz/epg/icons/ctv6.jpg" />
</channel>
<channel id="1078.dvb.guide">
	<display-name>KTV 1</display-name>
	<icon src="http://nice.net.nz/epg/icons/ktv1.jpg" />
</channel>
<channel id="1007.dvb.guide">
	<display-name>Juice TV</display-name>
	<icon src="http://nice.net.nz/epg/icons/juice_tv.jpg" />
</channel>
<programme channel="1032.dvb.guide" start="20071016170000 +1300" stop="20071016173000 +1300">
	<title lang="eng">Neighbours</title>
	<desc>Paul's shocking revelation ends his relationship with Rebecca - but she bounces back with a bold proposition for Toadie and Rosie. Carmella negotiates a compromise with Ollie over baby planning.</desc>
	<category>tvshow</category>
	<category>General Show</category>
	<rating system="SKY-NZ">
		<value>G</value>
	</rating>
</programme>
<programme channel="1032.dvb.guide" start="20071016173000 +1300" stop="20071016180000 +1300">
	<title lang="eng">Hope And Faith</title>
	<desc>Hope and Faith try to help Sydney and end up damaging of Charley's vintage convertable.</desc>
	<category>tvshow</category>
	<category>General Show</category>
	<rating system="SKY-NZ">
		<value>G</value>
	</rating>
</programme>
<programme channel="1032.dvb.guide" start="20071016180000 +1300" stop="20071016183000 +1300">
	<title lang="eng">My Wife And Kids</title>
	<desc>Michael goes overboard when Jay requests more romantic attention from him.</desc>
	<category>tvshow</category>
	<category>General Show</category>
	<rating system="SKY-NZ">
		<value>G</value>
	</rating>
</programme>
<programme channel="1032.dvb.guide" start="20071016183000 +1300" stop="20071016190000 +1300">
	<title lang="eng">Friends</title>
	<desc>Chandler and Monica's relationship becomes less of a secret; Ross must prove to his boss that he is sane in order to return to his job.</desc>
	<category>tvshow</category>
	<category>General Show</category>
	<rating system="SKY-NZ">
		<value>G</value>
	</rating>
</programme>
<programme channel="1002.dvb.guide" start="20071015112000 +1300" stop="20071015124500 +1300">
	<title lang="eng">Date Movie</title>
	<desc>Before Julia can have her Big Fat Greek Wedding, she has to Meet the Parents, deal with The Wedding Planner and confront a woman who wants to stop her Best Friend's Wedding. Starring: Alyson Hannigan. (WS)</desc>
	<category>movie</category>
	<category>Comedy</category>
	<rating system="SKY-NZ">
		<value>M S</value>
	</rating>
</programme>
<programme channel="1002.dvb.guide" start="20071015124500 +1300" stop="20071015143000 +1300">
	<title lang="eng">Chaos</title>
	<desc>Two cops, one a rookie and one a grizzled veteran, are partnered up and must try to uncover how five bank robbers escaped from a bank during a heist. Starring: Jason Statham, Ryan Phillippe, Wesley Snipes. (WS)</desc>
	<category>movie</category>
	<category>Action</category>
	<rating system="SKY-NZ">
		<value>M VLS</value>
	</rating>
</programme>
<programme channel="1021.dvb.guide" start="20071013190000 +1300" stop="20071013193000 +1300">
	<title lang="eng">Girls Of The Playboy Mansion</title>
	<desc>Let Them Eat Birthday Cake. For Holly's birthday, the Girls decide to throw a lavish party with a Marie Antoinette theme.</desc>
	<category>tvshow</category>
	<category>Reality</category>
	<rating system="SKY-NZ">
		<value>18+ S</value>
	</rating>
</programme>
<programme channel="1021.dvb.guide" start="20071013193000 +1300" stop="20071013200000 +1300">
	<title lang="eng">E! News</title>
	<desc>The most comprehensive, up-to-the-minute reports on the day's top entertainment news.</desc>
	<category>tvshow</category>
	<category>Reality</category>
	<rating system="SKY-NZ">
		<value>PG</value>
	</rating>
</programme>
<programme channel="1021.dvb.guide" start="20071013200000 +1300" stop="20071013203000 +1300">
	<title lang="eng">The Daily 10</title>
	<desc>The Daily 10 is a fast-paced, hosts-driven, topical entertainment news show with attitude that recaps the top ten entertainment stories of the moment.</desc>
	<category>tvshow</category>
	<category>Reality</category>
	<rating system="SKY-NZ">
		<value>PG</value>
	</rating>
</programme>
<programme channel="1021.dvb.guide" start="20071013203000 +1300" stop="20071013213000 +1300">
	<title lang="eng">Best Of The Girls Of The...</title>
	<desc>Playboy Mansion. Hef and the girls have a very special 'movie night' at the mansion. The happy quartet snuggle up to watch and relive some of their favorite moments from the past three seasons.</desc>
	<category>tvshow</category>
	<category>Reality</category>
	<rating system="SKY-NZ">
		<value>18+ S</value>
	</rating>
</programme>
<programme channel="1021.dvb.guide" start="20071013213000 +1300" stop="20071013220000 +1300">
	<title lang="eng">Girls Of The Playboy Mansion</title>
	<desc>Snow Place Like Home. It's Christmas at the Mansion and the staff work like elves to create a winter wonderland for Hef and the Girls - complete with a snow covered front yard!</desc>
	<category>tvshow</category>
	<category>Reality</category>
	<rating system="SKY-NZ">
		<value>18+ S</value>
	</rating>
</programme>
</tv>

<!--
	[{'rating': '6', 'description': 'Catch all the latest moves on No Mercy as history was made one more time. See who will be the next champion to hold the title when the smoke clears.', 'language': 'eng', 'start': '0xffffffffff', 'country': 'NZL', 'durationinfo': '03:26:11 (UTC)', 'title': '<EM>WWE On Demand - Scheduling Only</EM>', 'channel_id': '1097', 'duration': '0x0032611', 'startinfo': '2038-04-22 ff:ff:ff (UTC)', 'ratinginfo': 'minimum age: 9 years'}, ValueError("invalid literal for int() with base 10: 'ff'",)]
	[{'rating': '6', 'description': "Action/Sport: Rocky Balboa comes out of retirement to step into the ring for the last time and face the heavyweight champ Mason 'The Line' Dixon. Starring Sylvester Stallone, Burt Young. (WS)", 'language': 'eng', 'start': '0xffffffffff', 'country': 'NZL', 'durationinfo': '01:40:24 (UTC)', 'title': '<EM>Rocky Balboa</EM>', 'channel_id': '1097', 'duration': '0x0014024', 'startinfo': '2038-04-22 ff:ff:ff (UTC)', 'ratinginfo': 'minimum age: 9 years'}, ValueError("invalid literal for int() with base 10: 'ff'",)]
-->
Of course the one thing this doesn't do is provide a sample of channels and then programmes belonging to those same channels. Or you can just download the whole file here:
http://musther.googlepages.com/listings.xml.tar.gz
 
Old 10-14-2007, 02:06 AM   #6
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Hope you like python!

Here's my python solution:
Code:
#add the channels you want to this list
wanted = ['1035.dvb.guide',
        '1026.dvb.guide',
        ]


import sys
from xml.etree.ElementTree import ElementTree

#prints out usage message if number of arguments is wrong
if not len(sys.argv)==3:
        print "usage: %s input.xml output.xml"%sys.argv[0]
        sys.exit(1)

#reads the input xml file
input = sys.argv[1]
xml = open(input)
data = xml.read()

#get header and footer, then seek the file back to 0 to start parsing xml info
header = data[:data.find('<tv')-1]
footer = data[data.find('<!--\n'):]
del data
xml.seek(0)

#parse the data
tree = ElementTree(file=xml)
root = tree.getroot()
lroot = list(root)
for element in lroot:
        if not (element.attrib.get('id') in wanted or \
          element.attrib.get('channel') in wanted):
                  root.remove(element)

#write the data
out_xml = open(sys.argv[2],'w')
out_xml.write(header+'\n')
tree.write(out_xml)
out_xml.write('\n'+footer)
Just add whatever you want to the wanted list. Give the script an input and an output and it should work.
If you don't want to type in the ".dvb.guide" for every entry then change the wanted code to be like this:
Code:
wanted = ['1035',
        '1026',
        ]
wanted = [x+'.dvb.guide' for x in wanted]
Hope that does what you wanted...and I hope you have python, or are capable of getting it.

PS: Always wanted to learn a bit about elementtree, and now I did

Edit: Ignore this script, Disillusionist script is much better, using element tree for this was overkill, and the performance suffers because of it.

Last edited by angrybanana; 10-14-2007 at 04:48 AM.
 
Old 10-14-2007, 03:26 AM   #7
Disillusionist
Senior Member
 
Registered: Aug 2004
Location: England
Distribution: Ubuntu
Posts: 1,013

Rep: Reputation: 83
To change the perl script to use an array for multiple values is just a few small changes

1. change the l_selection into an array.
2. define a scalar variable l_choice to hold individual contents of the array @l_selection
3. modify the search statement to use the new scalar
4. set the default of $l_print to 0 outside the inner loop

Full listing here:
Code:
#!/usr/bin/perl -w
use strict;
use File::Copy;

my $l_input="progs.xml";
my $l_output="output.xml";
my @l_selection=qw( 1032.dvb.guide 1026.dvb.guide 1192.dvb.guide );
my $l_choice;
my $l_print=1;

if (open (INPUT, "$l_input"))
{
   open (OUTPUT, "> $l_output") or die "Failed to open output file!\n";

   while (<INPUT>)
   {
      if (/\<channel/)
      {
         $l_print=0;
         foreach $l_choice (@l_selection) {
            if (/$l_choice/)
            {
               $l_print=1;
            }
         }
      }

      if (/\<programme/)
      {
         $l_print=0;
         foreach $l_choice (@l_selection) {
            if (/$l_choice/)
            {
               $l_print=1;
            }
         }
      }

      if (/\<\/tv/)
      {
         $l_print=1;
      }

      if ( $l_print == 1 )
      {
         print OUTPUT "$_";
      }

   }
}
Angrybanana - Nice python script!

Last edited by Disillusionist; 10-14-2007 at 03:33 AM.
 
Old 10-14-2007, 12:14 PM   #8
musther
Member
 
Registered: Sep 2007
Posts: 36

Original Poster
Rep: Reputation: 15
Thank you both very much, I've tried the python - perfect, and I'll fiddle with the perl later too (I might as well learn something while I'm doing this).

I've just noticed angrybanana's note (I'm glad to be able to give you the opportunity to learn about element treees!), so when I've had a fiddle with Disillusionist's perl, I'll be using that.

Thank you both again.
 
Old 10-14-2007, 01:21 PM   #9
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
Whenever parsing XML, I'd advise against trying to parse the XML structure yourself using regular expressions and so on. The reason is that this sort of approach often makes assumptions about where line breaks and other non-syntax whitespace is, and while your program may work for some or most examples, it will likely break when the input is subtly different because of whitespace.

There are good XML parser libraries for most languages, and so it's a good idea to use them. Perl has the XML::Parser module which is very flexible but a little tricky to use. There are also a bunch of easier to use modules built on top of this, which can use much easier to get to grips with, although don't offer all the flexibility.

There are a few programs which will allow you to manipulate XML from bash scripts, although they are more limited than a proper parsing library. xmlstarlet and xpath spring to mind. I had a brief go at using xmlstarlet, and found a nice mechanism to remove sections with a given ID, but not to remove all sections but a list of known IDs... I think this is just a little too complex for such a program, although I would love to be corrected if someone knows how to do it.

For the record, here's now to remove a named channel from your XML:
Code:
xmlstarlet ed -P -d "/tv/channel[@id='1035.dvb.guide']" input_file.xml > modified_file.xml
Angrybanana's python script looks like the right approach to me.
 
Old 10-14-2007, 01:56 PM   #10
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
Oh, one more thing - I just found an XMLTV module for Perl, and a bunch of command-line utilities. It looks like what you want to do it already implemented in the program tv_grep.
 
Old 10-14-2007, 05:47 PM   #11
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Here's a non 3am 1/2 asleep version of my code :)

I looked over the code and found the performance issue. This script will work MUCH faster. This will deal with the issues matthewg42 mentioned of blind parsing.
If this script does what you want I would say use this.
Code:
#add the channels you want to this list
wanted = ('1035.dvb.guide',
        '1026.dvb.guide',
        '1021.dvb.guide',
        '1050.dvb.guide',
        '1071.dvb.guide',
        '1025.dvb.guide',
        )

import sys
from xml.etree.ElementTree import ElementTree

#prints out usage message if number of arguments is wrong
if not len(sys.argv)==3:
        print "usage: %s input.xml output.xml"%sys.argv[0]
        sys.exit(0)

#reads the input xml file
input = sys.argv[1]
xml = open(input)
data = xml.read()

#get header and footer, then seek the file back to 0 to start parsing xml info
header = data[:data.find('<tv')-1]
footer = data[data.find('<!--\n'):]
del data
xml.seek(0)

#parse the data
tree = ElementTree(file=xml)
root = tree.getroot()
wanted_channels = [x for x in root.findall('channel') \
                if x.attrib.get('id') in wanted]
wanted_programme = [x for x in root.findall('programme') \
                if x.attrib.get('channel') in wanted]

#replace children with the ones we want
root[:] = wanted_channels + wanted_programme

#write the data
out_xml = open(sys.argv[2],'w')
out_xml.write(header+'\n')
tree.write(out_xml)
out_xml.write('\n'+footer)
One thing I don't like about my script is how it handles the header/footer. If they are static, I'd feel a lot better if they were added in that way. Currently, they're being searched for in a way that could cause problems in the future if the format changes.

I also noticed that the footer is a comment... Do you need the footer?
If you don't need the footer, I'd make a small change to the script that'd make it a lot better.

Hope this helps.

PS: If you're wondering what the performance issue was, I think it was iterating over the elements, removing one, then reiterating again to remove the next one and so on. Now it's simply gets all the wanted ones and sets the children to that in one go.

EDIT: So we don't have to post back and forth, Here's a version that ignores the footer and uses a static header that you put into the script.
Code:
#add the channels you want to this list
wanted = ('1035.dvb.guide',
        '1026.dvb.guide',
        '1021.dvb.guide',
        '1050.dvb.guide',
        '1071.dvb.guide',
        '1025.dvb.guide',
        )

#Put your custom header here 
header = """<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">"""

import sys
from xml.etree.ElementTree import ElementTree

#prints out usage message if number of arguments is wrong
if not len(sys.argv)==3:
        print "usage: %s input.xml output.xml"%sys.argv[0]
        sys.exit(0)

#reads the input xml file
input = sys.argv[1]
xml = open(input)

#parse the data
tree = ElementTree(file=xml)
root = tree.getroot()
wanted_channels = [x for x in root.findall('channel') \
                if x.attrib.get('id') in wanted]
wanted_programme = [x for x in root.findall('programme') \
                if x.attrib.get('channel') in wanted]

#replace children with the ones we want
root[:] = wanted_channels + wanted_programme

#write the data
out_xml = open(sys.argv[2],'w')
out_xml.write(header+'\n')
tree.write(out_xml)

Last edited by angrybanana; 10-14-2007 at 06:10 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parse XML in bash script MikeyCarter Linux - Software 1 02-16-2007 01:19 PM
Bash scripting. Strip chars from file names, etc. mooreted Programming 7 02-11-2007 08:52 PM
Bash script to strip a certain directory out of directories in a directory? rylan76 Linux - General 3 08-29-2006 11:35 AM
Add file content to a variable (bash)? LinuxSeeker Programming 4 12-19-2005 01:41 PM
Need help to strip XML & XSL tags from multiple files dfrechet Programming 9 10-12-2005 06:52 AM


All times are GMT -5. The time now is 08:41 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration