LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-04-2008, 01:26 AM   #1
sxn
LQ Newbie
 
Registered: May 2006
Distribution: Gentoo
Posts: 8

Rep: Reputation: 1
A sed challenge


Dear All,

Suppose someone (me) is giving you the following file, called yummies.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Catalogues>
<Vegies>
<Vegie name="Broccoli"/>
<Vegie name="Zuchini"/>
<Vegie name="Carrot"/>
</Vegies>
<Fruits>
<Fruit name="Raspberrie" validTo="2007-11-08"/>
<Fruit name="Date" validTo=""/>
<Fruit name="Peach" validTo="2008-04-29"/>
<Fruit name="Pear" validTo="2008-11-23"/>
<Fruit name="Mango"/>
</Fruits>
<Candies>
</Candies>
<Nuts>
<Nut name="Wallnut" validTo="2006-12-31"/>
<Nut name="Pecan" validTo="2008-01-15"/>
</Nuts>
</Catalogues>

The task is to generate out of it a series of xml files, each with the name of the catalogue, and containing only unexpired records. Ideally, files that would be empty won't be created at all.

So, the best solution will be:
Vegies.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Vegies>
<Vegie name="Broccoli"/>
<Vegie name="Zuchini"/>
<Vegie name="Carrot"/>
</Vegies>

Fruits.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Fruits>
<Fruit name="Date" validTo=""/>
<Fruit name="Pear" validTo="2008-11-23"/>
<Fruit name="Mango"/>
</Fruits>

The second best will also include:
Nuts.xml

<?xml version="1.0" encoding="UTF-8"?>
<Nuts>
</Nuts>

And the worst, will have one more file:
Candies.xml

<?xml version="1.0" encoding="UTF-8"?>
<Candies>
</Candies>

Either one of these solution is acceptable. "Expired records" are records having the validTo attribute, and with a date (always as YYYY-MM-DD) preceding today. An empty validTo means a living record, same for records without this attribute.

I solved the problem in Python, as that's what I "speak" fluently. Problem is that the real xml is huge, and the processing time takes hours. So, I turned my attention to sed. To test that there is any significant performance improvment, I tried the following one liner:

sed -r '/validTo=".+"/d' < yummies.xml > result.xml

It is just deleting records with a non-empty validTo attribute (stating the obvious?), and it did it in a matter of seconds. The test input file had 447090 lines; after applying this filtering sed, I end up with 174652.

It looks like a very promissing path, so I tried to go further and filter all expired records. And here is where I stumbled, as I'm not versed enough to write the regex to check the date.

So is it doable? How much of it? Can sed filter out the expired records? Can it also generate the resulting files (ie split the initial xml)? Can it also skip the would-be empty resulting files?

Thanks for your advise,
SxN
 
Old 09-04-2008, 01:59 AM   #2
w3bd3vil
Senior Member
 
Registered: Jun 2006
Location: Hyderabad, India
Distribution: Fedora
Posts: 1,189

Rep: Reputation: 49
You could probably do it, but I dont think any one liner will do that. Build a shell script, it shouldnt to take alot of your time.
 
Old 09-04-2008, 02:22 AM   #3
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
Sed is the wrong tool for this. Use awk, or perl with an XML parsing module.

With only 450k lines, your Python implementation must have been pretty suboptimal to take hours.
 
Old 09-04-2008, 09:34 AM   #4
sxn
LQ Newbie
 
Registered: May 2006
Distribution: Gentoo
Posts: 8

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by Mr. C. View Post
Sed is the wrong tool for this. Use awk, or perl with an XML parsing module.

With only 450k lines, your Python implementation must have been pretty suboptimal to take hours.
I'm not looking for a one liner for this task, I'm looking for a fast working solution. I thought about awk, but as much as I know it (and, I have to admit, it's not very much), I have to rely on a certain structure, so that I'm able to access fields. That's not the case; attributes may be there or not, in one position or another, and this frustrates my little awk group of neurons. As of Perl... as foreign as ancient Greek to me...

(for the record) The Python looks like this:

Code:
#! /usr/bin/env python

from xml.etree.cElementTree import *
from Tkinter import *

r=Tk()
tl1=StringVar()
Label(r,textvariable=tl1,justify=LEFT).pack()

tree=ElementTree(file='yummies.xml')
for i in tree.getroot():
  for j in i.getchildren():
    if 'validTo' in j.attrib and j.attrib['validTo']:
      i.remove(j)
  if len(i):
    tl1.set(tl1.get()+i.tag+(" : %s records" % len(i)))
    ElementTree(i).write("%s.xml" % i.tag,'utf-8')
    tl1.set(tl1.get()+' written\n')
tl1.set(tl1.get()+'\nDone')

r.mainloop()
When written, I knew that "expired records" are those with a validTo attribute, and a value (any) in it. Now I am told that records with validTo>=today are to be kept - which means an extra test and more processing time.
The Tkinter part is needed as this script is run without a console, and there is a need for some feedback.
I don't know of a faster module for xml parsing than cElementTree... and I'm worried about the real xml files, which will be big, I'm told

SxN
 
Old 09-04-2008, 08:01 PM   #5
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,226

Rep: Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023
I suggest you ask the mods to move this to the programming forum (use the Report button).
 
Old 09-08-2008, 03:06 AM   #6
reddazz
Guru
 
Registered: Nov 2003
Location: N. E. England
Distribution: Fedora, CentOS, Debian
Posts: 16,298

Rep: Reputation: 73
Moved: This thread is more suitable in the Programming forum and has been moved accordingly to help your thread/question get the exposure it deserves.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
A Challenge??? ajeetraina Linux - Newbie 5 12-18-2007 07:20 AM
sed challenge..datamining fs11 Programming 7 01-14-2007 08:26 PM
bash script with grep and sed: sed getting filenames from grep odysseus.lost Programming 1 07-17-2006 11:36 AM
Want a challenge? TruckStuff Linux - Security 2 05-13-2005 01:39 AM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 06:12 AM


All times are GMT -5. The time now is 12:08 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration