Bash script to strip some content from XML file.
I've got a large xml file containing TV listings for my mythtv box, and I want to filter it so I'm left with just the channels I recieve.
It's in this form: Code:
<?xml version="1.0" encoding="ISO-8859-1"?> Code:
Header stuff The difficult bit (as far as my bash text processing skills go) is filtering the channels and programmes. I figure I'll put the channel ID's of the channels I want in a list in a file, from which the script can get them, then I suppose the best way would be to strip everything which doesn't match, so while I'll end up doing is: Find a text string which goes: Code:
<channel id="$i">******</channel> Cheers! |
Try this perl script:
Code:
#!/usr/bin/perl -w In these instances, it checks the selection and decides if it should print the section or not. Once it gets to a line starting </tv it sets the print option back on. The code is not perfect and only allows for one set of selection criteria, but its a starting point. |
Thanks for that, it's very helpful. I tried and tried to make it accept more strings using arrays etc, but I don't really know perl and didn't manage to get it working.
Is there any chance you could help me to get it to accept more, filling them in at the top of the script is perfectly acceptable for this solution? Thanks |
Can you provide a bigger sample of data? maybe the full header/footer and a couple of channels. or maybe pastebin the whole file..
Also I'm guessing you want some sort of an array like this at the top of the script? mychannels = [chan1, chan2, chan3] |
Here's a much larger chunk of data (still far from the whole file):
Code:
<?xml version="1.0" encoding="ISO-8859-1"?> http://musther.googlepages.com/listings.xml.tar.gz |
Hope you like python!
Here's my python solution:
Code:
#add the channels you want to this list If you don't want to type in the ".dvb.guide" for every entry then change the wanted code to be like this: Code:
wanted = ['1035', PS: Always wanted to learn a bit about elementtree, and now I did :) Edit: Ignore this script, Disillusionist script is much better, using element tree for this was overkill, and the performance suffers because of it. |
To change the perl script to use an array for multiple values is just a few small changes
1. change the l_selection into an array. 2. define a scalar variable l_choice to hold individual contents of the array @l_selection 3. modify the search statement to use the new scalar 4. set the default of $l_print to 0 outside the inner loop Full listing here: Code:
#!/usr/bin/perl -w |
Thank you both very much, I've tried the python - perfect, and I'll fiddle with the perl later too (I might as well learn something while I'm doing this).
I've just noticed angrybanana's note (I'm glad to be able to give you the opportunity to learn about element treees!), so when I've had a fiddle with Disillusionist's perl, I'll be using that. Thank you both again. |
Whenever parsing XML, I'd advise against trying to parse the XML structure yourself using regular expressions and so on. The reason is that this sort of approach often makes assumptions about where line breaks and other non-syntax whitespace is, and while your program may work for some or most examples, it will likely break when the input is subtly different because of whitespace.
There are good XML parser libraries for most languages, and so it's a good idea to use them. Perl has the XML::Parser module which is very flexible but a little tricky to use. There are also a bunch of easier to use modules built on top of this, which can use much easier to get to grips with, although don't offer all the flexibility. There are a few programs which will allow you to manipulate XML from bash scripts, although they are more limited than a proper parsing library. xmlstarlet and xpath spring to mind. I had a brief go at using xmlstarlet, and found a nice mechanism to remove sections with a given ID, but not to remove all sections but a list of known IDs... I think this is just a little too complex for such a program, although I would love to be corrected if someone knows how to do it. For the record, here's now to remove a named channel from your XML: Code:
xmlstarlet ed -P -d "/tv/channel[@id='1035.dvb.guide']" input_file.xml > modified_file.xml |
Oh, one more thing - I just found an XMLTV module for Perl, and a bunch of command-line utilities. It looks like what you want to do it already implemented in the program tv_grep.
|
Here's a non 3am 1/2 asleep version of my code :)
I looked over the code and found the performance issue. This script will work MUCH faster. This will deal with the issues matthewg42 mentioned of blind parsing.
If this script does what you want I would say use this. Code:
#add the channels you want to this list I also noticed that the footer is a comment... Do you need the footer? If you don't need the footer, I'd make a small change to the script that'd make it a lot better. Hope this helps. PS: If you're wondering what the performance issue was, I think it was iterating over the elements, removing one, then reiterating again to remove the next one and so on. Now it's simply gets all the wanted ones and sets the children to that in one go. EDIT: So we don't have to post back and forth, Here's a version that ignores the footer and uses a static header that you put into the script. Code:
#add the channels you want to this list |
Hi. This is old thread but i have similar problem as op. I'm trying to filter xml file that has multiple languages. I've done lots of trial and error but no succes so far. My setup can't handle multiple languages so i need to filter unneeded ones (colored red). below is my example:
Code:
<?xml version="1.0" encoding="utf-8" ?> |
Old but interesting problem. Here is one more solution:
Code:
# works only with at least to channel ids The output still needs to wrapped in <tv></tv> tags. |
Quote:
Between <programme> and </programme> every line that includes "lang="fin"" will be keeped and if not the line is removed. Thanks. |
Quote:
|
All times are GMT -5. The time now is 09:16 PM. |