LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-10-2010, 06:16 AM   #1
hotbaws11
LQ Newbie
 
Registered: Sep 2010
Posts: 4

Rep: Reputation: 0
looking for expert sed/script help for translation/substitution


Hi All,

I'm looking for some expert help on sed/script to work out the best way to transform one xml format into another however there are a few complexities around translation.

The extra complexities are to:
1) take the start and stop time (YYYYMMDDHHMMSS) and convert to start time to unix time plus output the difference in seconds between both times.
2) oid, tsid and sid are found by looking up an external file and finding the value against the channel. For example one of the lines in the file will be 2:806:27e2=channel1

Is there any way to write piped sed commands that can do this? If not, any ideas how the script should look like?

Thanks in advance.

Input File
Code:
<programme start="20100910060000 +0100" stop="20100910061000 +0100" channel="channel1">
<title lang="en">This is the title</title>
<desc>This is the description</desc>
</programme>
Output File
Code:
<service oid="0002" tsid="0806" sid="27e2">
<event id="0">
<name lang="OFF" string="This is the title"/>
<text lang="OFF" string="This is the description"/>
<time start_time="1284098400" duration="600"/>
</event>
</service>
Look up file for oid, tsid and sid
Code:
2:806:27e2=channel1
2:756:37a3=channel2
5:4a06:42e5=channel3

Last edited by hotbaws11; 09-10-2010 at 07:38 AM. Reason: updated to show input and output files
 
Old 09-10-2010, 07:30 AM   #2
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
show all the sample files! you are only showing the output format that you want right?

1) the external file to get the sid,oid, tsid
2) The actual xml file to transform.

And sed is definitely not the correct tool to use with regard to this.!
 
Old 09-10-2010, 07:43 AM   #3
hotbaws11
LQ Newbie
 
Registered: Sep 2010
Posts: 4

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by ghostdog74 View Post
show all the sample files! you are only showing the output format that you want right?

1) the external file to get the sid,oid, tsid
2) The actual xml file to transform.

And sed is definitely not the correct tool to use with regard to this.!
Updated post above show this. If sed isn't the tool, what should my script look like to read and parse the file? The source file has about 22k of programs so looking for best script for parsing.

I tried to write a script which reads one line at a time and do the translation that way but it is extremely slow and would probably take 12 hours to parse! so need help on how to write 'good' code for parsing the file.
 
Old 09-10-2010, 09:00 AM   #4
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Ideally, you should use an XML parser. Here's an example using Python and lxml
Code:
#!/usr/bin/env python
import time
from collections import defaultdict
from lxml import etree
channels=defaultdict()
lookup_file="lookup"
input_file="file"
# store channels
for line in open(lookup_file):
    line=line.rstrip()
    s,c=line.split("=")
    channels[c]=s
tree = etree.parse(input_file)
for item in tree.iter('programme'):
    starttime=item.get("start")
    stoptime=item.get("stop")
    unixtime=[]
    for t in [starttime,stoptime]:
        styr=t[:4]
        stmth=t[4:6]
        stday=t[6:8]
        sthr=t[8:10]
        stmin=t[10:12]
        stsec=t[12:14]
        mkt=time.mktime(map(int,(styr,stmth,stday,sthr,stmin,stsec,0,0,-1)))
        if t==starttime: ST=mkt
        unixtime.append(mkt)
    channel=item.get("channel")
    oid,tsid,sid=channels[channel].split(":")
    timediff=unixtime[-1]-unixtime[0]
desc=tree.findtext('desc')
title=tree.findtext('title')
print "Description: ",desc
print "Title: ", title
print "oid,tsid and sid: ",oid.zfill(3),tsid,sid
print "Start Time: ",ST
output
Code:
$ ./python.py
Description:  This is the description
Title:  This is the title
oid,tsid and sid:  002 806 27e2
Start Time:  1284069600.0
Time diff:  600.0
I left out the creation of the output file. Either use lxml to write, or use normal file i/o in Python.(see docs)
 
Old 09-10-2010, 10:41 AM   #5
hotbaws11
LQ Newbie
 
Registered: Sep 2010
Posts: 4

Original Poster
Rep: Reputation: 0
Thanks for the info however I'd prefer to keep to script if it's possible to get good code that can parse 22k programs in an hour.

As I said, my code at the moment works but its very messy and thus terrible on the performance front. I'll keep trying to tweak it and will take suggestions on board.

Cheers.
 
Old 09-11-2010, 02:00 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Quote:
I'd prefer to keep to script if it's possible
Not sure what you mean here? ghostdog has provided a python script.

If you have a limitation on which language the script is to be in (although this is not always a good idea as some are better suited than others), then
you need to let us know? I was going to suggest Perl.
 
Old 09-11-2010, 10:32 AM   #7
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
I agree with the above comments, if time is of the essence, bash and even many scripting languages are not the way to go. I'd write a small C program to convert them or use Perl or the python above should be good enough. Why do you need to stick to script ?
 
Old 09-12-2010, 04:17 AM   #8
hotbaws11
LQ Newbie
 
Registered: Sep 2010
Posts: 4

Original Poster
Rep: Reputation: 0
thanks for the response guys. Basically I need a shell script (sh) as a) that's all I know and b) I think that is the only thing available on the box I'll be running this on. Again I have no idea how to install any new languages onto the box.

I've managed to get quite good performance (less than an hour) by:
1) extracting all the information out of the source file and into 'raw' format by using a multiple sed commands. Raw format is something like event_num~channel~startdate~enddate~title~desc
2) Then using usual read file method to read each line and output it in xml file format. I need the read file method to a) convert startdate/enddate into unix time. I can't get this to work in sed itself (i.e. taking a sed variable and passing it to date function). b) lookup the external file for sid,onid,tsid.

I don't have the script handy, otherwise I would post it here. Again, if you have any suggestions on how I could improve my sed statements to do a) and b) above then that would make it even faster
 
Old 09-12-2010, 07:07 AM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well I agree with ghostdog's original post that sed is almost definitely not the right tool for this job.
You could parse it with awk if you can guarantee the format.

As H has said also it may be best to work out what scripting languages you have. Most distros have perl and / or python which would be more adept
in this case.

Could also be a good chance to expand your repertoire
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
SED - substitution carolflb Linux - Newbie 5 02-06-2010 12:20 AM
Help with a sed substitution string hawgwild Programming 2 09-24-2009 02:35 PM
Problems with a substitution using sed wtaicken Programming 4 12-15-2008 04:04 AM
variable substitution in sed gaynut Programming 1 07-14-2008 07:38 AM
sed substitution with p flag 7stud Linux - Newbie 2 03-03-2007 04:15 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration