[SOLVED] XML remove odd lines between tag

grlopes · 09-12-2012, 05:00 PM

Hi,

I have and XML like this

<itemResult>
<date>
<datex>something</datex>
</date>
<item itemname="xyz">
<a_1>85</a_1>
<a_2>62</a_2>
<a_3>48</a_3>
<a_4>78</a_4>
</item>
</itemResult>
<itemResult>
<date>
<datex>something_2</datex>
</date>
<item itemname="abc">
<a_8>85</a_8>
<a_7>62</a_7>
<a_9>48</a_9>
<a_3>78</a_3>
</item>
<item itemname="xpto">
<v_1>85</v_1>
<v_2>62</v_2>
<d_3>48</d_3>
<d_4>78</d_4>
</item>
</itemResult>

and i need delete odd lines between <item> and </item> like this

<itemResult>
<date>
<datex>something</datex>
</date>
<item itemname="xyz">
<a_2>62</a_2>
<a_4>78</a_4>
</item>
</itemResult>
<itemResult>
<date>
<datex>something_2</datex>
</date>
<item itemname="abc">
<a_7>62</a_7>
<a_3>78</a_3>
</item>
<item itemname="xpto">
<v_2>62</v_2>
<d_4>78</d_4>
</item>
</itemResult>

I tried with sed but it get the first <item> and the last </item>
Any solution using awk?
Any help?
Thanks for all

ntubski · 09-12-2012, 05:44 PM

Quote:

Originally Posted by grlopes

I have and XML like this

What you posted isn't valid XML because there is more than 1 root node. Assuming valid XML:

Code:

<results>
<itemResult>
  <date>
    <datex>something</datex>
  </date>
  <item itemname="xyz">
    <a_1>85</a_1>
    <a_2>62</a_2>
    <a_3>48</a_3>
    <a_4>78</a_4>
  </item>
</itemResult>
<itemResult>
  <date>
    <datex>something_2</datex>
  </date>
  <item itemname="abc">
    <a_8>85</a_8>
    <a_7>62</a_7>
    <a_9>48</a_9>
    <a_3>78</a_3>
  </item>
  <item itemname="xpto">
    <v_1>85</v_1>
    <v_2>62</v_2>
    <d_3>48</d_3>
    <d_4>78</d_4>
  </item>
</itemResult>
</results>

You can use XMLStarlet:

Code:

xmlstarlet ed -d '//item/*[position() mod 2 = 1]' input.xml > output.xml

grlopes · 09-12-2012, 05:54 PM

Thank you ntubski.
I know that this is not a valid xml, I only put the critical part to explain my problem.
I will check the xmlstarlet but I prefer one solution using linux standard commands.
It's possible?

markush · 09-13-2012, 12:50 AM

Quote:

Originally Posted by grlopes

...
I tried with sed but it get the first <item> and the last </item>
...

Hi,

you should post your sed-solution, maybe this is a good starting-point.

Markus

theNbomr · 09-13-2012, 09:07 AM

XML is not readily parsed with simple regex tools. That is the reason why tools like xmlstarlet and proper XML parser modules for scripting languages like Perl & Python were created. If your XML is known to always use a constant tag-per-line format, you will probably be able to solve your problem with AWK. With luck, it can be done as a one-liner suitable for embedding in a broader script.
When solving problems like yours, it is helpful to explain verbosely what pattern of matching/deleting/substitution you are trying to accomplish. Use terms that describe the target text and the relationships to surrounding text. For example "text matching one alpha character followed by an underscore and one or more numeric characters, all enclosed in '<' & '>'". Using such language will force you to unambiguously identify the patterns, and once you have done this, the translation to code will be much easier. It is something like a mental specification of the problem, and working from a specification is always much more productive than making stuff up on the fly.
I thought when you said "delete odd lines" it might mean something like "delete elements in 'item' tags where the tagname is suffixed with an odd number character". However, in your sample output, I see the tags '<a_3>78</a_3>', so my hypothesis about your intention must be incorrect. I cannot see any unambiguous pattern that could translate the input to the supplied output.

I guess it is unlikely that you have written the XML generator, but it is worth mentioning that the format is not well chosen, since the tag names appear to contain information about the content of the tag. Numeric indices applied to the tagname would be better implemented as attributes to the tag. This will reduce the complexity of any attached DTD and simplify the work of any parser. It probably makes the XML generator simpler as well.

--- rod.

firstfire · 09-13-2012, 02:15 PM

Hi.

This seems to do the job:

Code:

$ sed -r '/<item /{n; :a; N; /<\/item>/{ s/[^\n]*\n([^\n]*\n)/\1/g; b}; ba}' infile
<results>
<itemResult>
  <date>
    <datex>something</datex>
  </date>
  <item itemname="xyz">
    <a_2>62</a_2>
    <a_4>78</a_4>
  </item>
</itemResult>
<itemResult>
  <date>
    <datex>something_2</datex>
  </date>
  <item itemname="abc">
    <a_7>62</a_7>
    <a_3>78</a_3>
  </item>
  <item itemname="xpto">
    <v_2>62</v_2>
    <d_4>78</d_4>
  </item>
</itemResult>
</results>

EDIT: This is much better:

Code:

$ sed -rn '/<item /,/<\/item>/{/<item /be; /<\/item>/be; n}; :e;p' infile

grlopes · 09-13-2012, 04:00 PM

thank you firstfire
it works like a charm

I will try to understand this sed syntax

markush · 09-13-2012, 04:32 PM

Hello,

Quote:

Originally Posted by grlopes

...
I will try to understand this sed syntax

Here's an interesting link with resources http://sed.sourceforge.net/

Markus

David the H. · 09-16-2012, 05:07 AM

I highly suggest you carefully read again what ntubski and theNbomr posted. As always, use the right tool for the job.

Regex-based tools are designed for use on line-oriented text, but xml is tag-oriented. You can never be completely assured that a sed or awk solution will always parse it accurately.

FYI, xmlstarlet uses standard xpath expressions in order to match and modify entries. It's takes a bit of learning (and I'm still pretty much a novice at it), but it really is quite clean and flexible once you know what you are doing.

http://www.w3.org/TR/xpath

firstfire · 09-16-2012, 09:52 AM

Hi.

Quote:

Originally Posted by grlopes

thank you firstfire
it works like a charm

I will try to understand this sed syntax

My last command

Code:

$ sed -rn '/<item /,/<\/item>/{/<item /be; /<\/item>/be; n}; :e;p' infile

is pretty simple. Flags -rn tell sed that we want to use extended regular expressions (-r; you can omit it, it is not necessary) and that we don't want sed to print every line to standard output automatically.

/<item/,/<\/item/{commands} == "Execute given commands for each line between (inclusively) the one, matching first regular expression /<item/ and that matching the second regular expression /\/item/." It is so called address range.

/<item/be; /<\/item>/be; == "For lines matching regular expression branch (goto) to label :e (see last two commands)". This effectively prints open and close item tags, because we have the `p' (print) command after label `e'.

n; -- read next line, discarding current line from the buffer. This command does the job -- skips first, third, etc lines inside current item. Other lines get printed by the `p' command at the end.

BTW, I completely agree with previous speakers: line-oriented tools are not good for xml/html/.*ml.