Originally Posted by Mr. C.
Sed is the wrong tool for this. Use awk, or perl with an XML parsing module.
With only 450k lines, your Python implementation must have been pretty suboptimal to take hours.
I'm not looking for a one liner for this task, I'm looking for a fast working solution. I thought about awk, but as much as I know it (and, I have to admit, it's not very much), I have to rely on a certain structure, so that I'm able to access fields. That's not the case; attributes may be there or not, in one position or another, and this frustrates my little awk group of neurons. As of Perl... as foreign as ancient Greek to me...
(for the record) The Python looks like this:
#! /usr/bin/env python
from xml.etree.cElementTree import *
from Tkinter import *
for i in tree.getroot():
for j in i.getchildren():
if 'validTo' in j.attrib and j.attrib['validTo']:
tl1.set(tl1.get()+i.tag+(" : %s records" % len(i)))
ElementTree(i).write("%s.xml" % i.tag,'utf-8')
When written, I knew that "expired records" are those with a validTo attribute, and a value (any) in it. Now I am told that records with validTo>=today are to be kept - which means an extra test and more processing time.
The Tkinter part is needed as this script is run without a console, and there is a need for some feedback.
I don't know of a faster module for xml parsing than cElementTree... and I'm worried about the real xml files, which will be big, I'm told