ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Either one of these solution is acceptable. "Expired records" are records having the validTo attribute, and with a date (always as YYYY-MM-DD) preceding today. An empty validTo means a living record, same for records without this attribute.
I solved the problem in Python, as that's what I "speak" fluently. Problem is that the real xml is huge, and the processing time takes hours. So, I turned my attention to sed. To test that there is any significant performance improvment, I tried the following one liner:
sed -r '/validTo=".+"/d' < yummies.xml > result.xml
It is just deleting records with a non-empty validTo attribute (stating the obvious?), and it did it in a matter of seconds. The test input file had 447090 lines; after applying this filtering sed, I end up with 174652.
It looks like a very promissing path, so I tried to go further and filter all expired records. And here is where I stumbled, as I'm not versed enough to write the regex to check the date.
So is it doable? How much of it? Can sed filter out the expired records? Can it also generate the resulting files (ie split the initial xml)? Can it also skip the would-be empty resulting files?
Sed is the wrong tool for this. Use awk, or perl with an XML parsing module.
With only 450k lines, your Python implementation must have been pretty suboptimal to take hours.
I'm not looking for a one liner for this task, I'm looking for a fast working solution. I thought about awk, but as much as I know it (and, I have to admit, it's not very much), I have to rely on a certain structure, so that I'm able to access fields. That's not the case; attributes may be there or not, in one position or another, and this frustrates my little awk group of neurons. As of Perl... as foreign as ancient Greek to me...
(for the record) The Python looks like this:
#! /usr/bin/env python
from xml.etree.cElementTree import *
from Tkinter import *
for i in tree.getroot():
for j in i.getchildren():
if 'validTo' in j.attrib and j.attrib['validTo']:
tl1.set(tl1.get()+i.tag+(" : %s records" % len(i)))
ElementTree(i).write("%s.xml" % i.tag,'utf-8')
When written, I knew that "expired records" are those with a validTo attribute, and a value (any) in it. Now I am told that records with validTo>=today are to be kept - which means an extra test and more processing time.
The Tkinter part is needed as this script is run without a console, and there is a need for some feedback.
I don't know of a faster module for xml parsing than cElementTree... and I'm worried about the real xml files, which will be big, I'm told