looking for expert sed/script help for translation/substitution

hotbaws11 · 09-10-2010, 06:16 AM

Hi All,

I'm looking for some expert help on sed/script to work out the best way to transform one xml format into another however there are a few complexities around translation.

The extra complexities are to:
1) take the start and stop time (YYYYMMDDHHMMSS) and convert to start time to unix time plus output the difference in seconds between both times.
2) oid, tsid and sid are found by looking up an external file and finding the value against the channel. For example one of the lines in the file will be 2:806:27e2=channel1

Is there any way to write piped sed commands that can do this? If not, any ideas how the script should look like?

Thanks in advance.

Input File

Code:

<programme start="20100910060000 +0100" stop="20100910061000 +0100" channel="channel1">
<title lang="en">This is the title</title>
<desc>This is the description</desc>
</programme>

Output File

Code:

<service oid="0002" tsid="0806" sid="27e2">
<event id="0">
<name lang="OFF" string="This is the title"/>
<text lang="OFF" string="This is the description"/>
<time start_time="1284098400" duration="600"/>
</event>
</service>

Look up file for oid, tsid and sid

Code:

2:806:27e2=channel1
2:756:37a3=channel2
5:4a06:42e5=channel3

ghostdog74 · 09-10-2010, 07:30 AM

show all the sample files! you are only showing the output format that you want right?

1) the external file to get the sid,oid, tsid
2) The actual xml file to transform.

And sed is definitely not the correct tool to use with regard to this.!

hotbaws11 · 09-10-2010, 07:43 AM

Quote:

Originally Posted by ghostdog74

show all the sample files! you are only showing the output format that you want right?

1) the external file to get the sid,oid, tsid
2) The actual xml file to transform.

And sed is definitely not the correct tool to use with regard to this.!

Updated post above show this. If sed isn't the tool, what should my script look like to read and parse the file? The source file has about 22k of programs so looking for best script for parsing.

I tried to write a script which reads one line at a time and do the translation that way but it is extremely slow and would probably take 12 hours to parse! so need help on how to write 'good' code for parsing the file.

ghostdog74 · 09-10-2010, 09:00 AM

Ideally, you should use an XML parser. Here's an example using Python and lxml

Code:

#!/usr/bin/env python
import time
from collections import defaultdict
from lxml import etree
channels=defaultdict()
lookup_file="lookup"
input_file="file"
# store channels
for line in open(lookup_file):
    line=line.rstrip()
    s,c=line.split("=")
    channels[c]=s
tree = etree.parse(input_file)
for item in tree.iter('programme'):
    starttime=item.get("start")
    stoptime=item.get("stop")
    unixtime=[]
    for t in [starttime,stoptime]:
        styr=t[:4]
        stmth=t[4:6]
        stday=t[6:8]
        sthr=t[8:10]
        stmin=t[10:12]
        stsec=t[12:14]
        mkt=time.mktime(map(int,(styr,stmth,stday,sthr,stmin,stsec,0,0,-1)))
        if t==starttime: ST=mkt
        unixtime.append(mkt)
    channel=item.get("channel")
    oid,tsid,sid=channels[channel].split(":")
    timediff=unixtime[-1]-unixtime[0]
desc=tree.findtext('desc')
title=tree.findtext('title')
print "Description: ",desc
print "Title: ", title
print "oid,tsid and sid: ",oid.zfill(3),tsid,sid
print "Start Time: ",ST

output

Code:

$ ./python.py
Description:  This is the description
Title:  This is the title
oid,tsid and sid:  002 806 27e2
Start Time:  1284069600.0
Time diff:  600.0

I left out the creation of the output file. Either use lxml to write, or use normal file i/o in Python.(see docs)

hotbaws11 · 09-10-2010, 10:41 AM

Thanks for the info however I'd prefer to keep to script if it's possible to get good code that can parse 22k programs in an hour.

As I said, my code at the moment works but its very messy and thus terrible on the performance front. I'll keep trying to tweak it and will take suggestions on board.

Cheers.

grail · 09-11-2010, 02:00 AM

Quote:

I'd prefer to keep to script if it's possible

Not sure what you mean here? ghostdog has provided a python script.

If you have a limitation on which language the script is to be in (although this is not always a good idea as some are better suited than others), then
you need to let us know? I was going to suggest Perl.

H_TeXMeX_H · 09-11-2010, 10:32 AM

I agree with the above comments, if time is of the essence, bash and even many scripting languages are not the way to go. I'd write a small C program to convert them or use Perl or the python above should be good enough. Why do you need to stick to script ?

hotbaws11 · 09-12-2010, 04:17 AM

thanks for the response guys. Basically I need a shell script (sh) as a) that's all I know and b) I think that is the only thing available on the box I'll be running this on. Again I have no idea how to install any new languages onto the box.

I've managed to get quite good performance (less than an hour) by:
1) extracting all the information out of the source file and into 'raw' format by using a multiple sed commands. Raw format is something like event_num~channel~startdate~enddate~title~desc
2) Then using usual read file method to read each line and output it in xml file format. I need the read file method to a) convert startdate/enddate into unix time. I can't get this to work in sed itself (i.e. taking a sed variable and passing it to date function). b) lookup the external file for sid,onid,tsid.

I don't have the script handy, otherwise I would post it here. Again, if you have any suggestions on how I could improve my sed statements to do a) and b) above then that would make it even faster

grail · 09-12-2010, 07:07 AM

Well I agree with ghostdog's original post that sed is almost definitely not the right tool for this job.
You could parse it with awk if you can guarantee the format.

As H has said also it may be best to work out what scripting languages you have. Most distros have perl and / or python which would be more adept
in this case.

Could also be a good chance to expand your repertoire