looking for expert sed/script help for translation/substitution
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
looking for expert sed/script help for translation/substitution
Hi All,
I'm looking for some expert help on sed/script to work out the best way to transform one xml format into another however there are a few complexities around translation.
The extra complexities are to:
1) take the start and stop time (YYYYMMDDHHMMSS) and convert to start time to unix time plus output the difference in seconds between both times.
2) oid, tsid and sid are found by looking up an external file and finding the value against the channel. For example one of the lines in the file will be 2:806:27e2=channel1
Is there any way to write piped sed commands that can do this? If not, any ideas how the script should look like?
Thanks in advance.
Input File
Code:
<programme start="20100910060000 +0100" stop="20100910061000 +0100" channel="channel1">
<title lang="en">This is the title</title>
<desc>This is the description</desc>
</programme>
Output File
Code:
<service oid="0002" tsid="0806" sid="27e2">
<event id="0">
<name lang="OFF" string="This is the title"/>
<text lang="OFF" string="This is the description"/>
<time start_time="1284098400" duration="600"/>
</event>
</service>
show all the sample files! you are only showing the output format that you want right?
1) the external file to get the sid,oid, tsid
2) The actual xml file to transform.
And sed is definitely not the correct tool to use with regard to this.!
Updated post above show this. If sed isn't the tool, what should my script look like to read and parse the file? The source file has about 22k of programs so looking for best script for parsing.
I tried to write a script which reads one line at a time and do the translation that way but it is extremely slow and would probably take 12 hours to parse! so need help on how to write 'good' code for parsing the file.
Ideally, you should use an XML parser. Here's an example using Python and lxml
Code:
#!/usr/bin/env python
import time
from collections import defaultdict
from lxml import etree
channels=defaultdict()
lookup_file="lookup"
input_file="file"
# store channels
for line in open(lookup_file):
line=line.rstrip()
s,c=line.split("=")
channels[c]=s
tree = etree.parse(input_file)
for item in tree.iter('programme'):
starttime=item.get("start")
stoptime=item.get("stop")
unixtime=[]
for t in [starttime,stoptime]:
styr=t[:4]
stmth=t[4:6]
stday=t[6:8]
sthr=t[8:10]
stmin=t[10:12]
stsec=t[12:14]
mkt=time.mktime(map(int,(styr,stmth,stday,sthr,stmin,stsec,0,0,-1)))
if t==starttime: ST=mkt
unixtime.append(mkt)
channel=item.get("channel")
oid,tsid,sid=channels[channel].split(":")
timediff=unixtime[-1]-unixtime[0]
desc=tree.findtext('desc')
title=tree.findtext('title')
print "Description: ",desc
print "Title: ", title
print "oid,tsid and sid: ",oid.zfill(3),tsid,sid
print "Start Time: ",ST
output
Code:
$ ./python.py
Description: This is the description
Title: This is the title
oid,tsid and sid: 002 806 27e2
Start Time: 1284069600.0
Time diff: 600.0
I left out the creation of the output file. Either use lxml to write, or use normal file i/o in Python.(see docs)
Thanks for the info however I'd prefer to keep to script if it's possible to get good code that can parse 22k programs in an hour.
As I said, my code at the moment works but its very messy and thus terrible on the performance front. I'll keep trying to tweak it and will take suggestions on board.
Not sure what you mean here? ghostdog has provided a python script.
If you have a limitation on which language the script is to be in (although this is not always a good idea as some are better suited than others), then
you need to let us know? I was going to suggest Perl.
I agree with the above comments, if time is of the essence, bash and even many scripting languages are not the way to go. I'd write a small C program to convert them or use Perl or the python above should be good enough. Why do you need to stick to script ?
thanks for the response guys. Basically I need a shell script (sh) as a) that's all I know and b) I think that is the only thing available on the box I'll be running this on. Again I have no idea how to install any new languages onto the box.
I've managed to get quite good performance (less than an hour) by:
1) extracting all the information out of the source file and into 'raw' format by using a multiple sed commands. Raw format is something like event_num~channel~startdate~enddate~title~desc
2) Then using usual read file method to read each line and output it in xml file format. I need the read file method to a) convert startdate/enddate into unix time. I can't get this to work in sed itself (i.e. taking a sed variable and passing it to date function). b) lookup the external file for sid,onid,tsid.
I don't have the script handy, otherwise I would post it here. Again, if you have any suggestions on how I could improve my sed statements to do a) and b) above then that would make it even faster
Well I agree with ghostdog's original post that sed is almost definitely not the right tool for this job.
You could parse it with awk if you can guarantee the format.
As H has said also it may be best to work out what scripting languages you have. Most distros have perl and / or python which would be more adept
in this case.
Could also be a good chance to expand your repertoire
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.