LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-12-2012, 05:00 PM   #1
grlopes
LQ Newbie
 
Registered: Sep 2012
Posts: 3

Rep: Reputation: Disabled
XML remove odd lines between tag


Hi,

I have and XML like this

<itemResult>
<date>
<datex>something</datex>
</date>
<item itemname="xyz">
<a_1>85</a_1>
<a_2>62</a_2>
<a_3>48</a_3>
<a_4>78</a_4>
</item>
</itemResult>
<itemResult>
<date>
<datex>something_2</datex>
</date>
<item itemname="abc">
<a_8>85</a_8>
<a_7>62</a_7>
<a_9>48</a_9>
<a_3>78</a_3>
</item>
<item itemname="xpto">
<v_1>85</v_1>
<v_2>62</v_2>
<d_3>48</d_3>
<d_4>78</d_4>
</item>
</itemResult>



and i need delete odd lines between <item> and </item> like this

<itemResult>
<date>
<datex>something</datex>
</date>
<item itemname="xyz">
<a_2>62</a_2>
<a_4>78</a_4>
</item>
</itemResult>
<itemResult>
<date>
<datex>something_2</datex>
</date>
<item itemname="abc">
<a_7>62</a_7>
<a_3>78</a_3>
</item>
<item itemname="xpto">
<v_2>62</v_2>
<d_4>78</d_4>
</item>
</itemResult>


I tried with sed but it get the first <item> and the last </item>
Any solution using awk?
Any help?
Thanks for all
 
Old 09-12-2012, 05:44 PM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Quote:
Originally Posted by grlopes View Post
I have and XML like this
What you posted isn't valid XML because there is more than 1 root node. Assuming valid XML:
Code:
<results>
<itemResult>
  <date>
    <datex>something</datex>
  </date>
  <item itemname="xyz">
    <a_1>85</a_1>
    <a_2>62</a_2>
    <a_3>48</a_3>
    <a_4>78</a_4>
  </item>
</itemResult>
<itemResult>
  <date>
    <datex>something_2</datex>
  </date>
  <item itemname="abc">
    <a_8>85</a_8>
    <a_7>62</a_7>
    <a_9>48</a_9>
    <a_3>78</a_3>
  </item>
  <item itemname="xpto">
    <v_1>85</v_1>
    <v_2>62</v_2>
    <d_3>48</d_3>
    <d_4>78</d_4>
  </item>
</itemResult>
</results>
You can use XMLStarlet:
Code:
xmlstarlet ed -d '//item/*[position() mod 2 = 1]' input.xml > output.xml
 
1 members found this post helpful.
Old 09-12-2012, 05:54 PM   #3
grlopes
LQ Newbie
 
Registered: Sep 2012
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thank you ntubski.
I know that this is not a valid xml, I only put the critical part to explain my problem.
I will check the xmlstarlet but I prefer one solution using linux standard commands.
It's possible?
 
Old 09-13-2012, 12:50 AM   #4
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
Quote:
Originally Posted by grlopes View Post
...
I tried with sed but it get the first <item> and the last </item>
...
Hi,

you should post your sed-solution, maybe this is a good starting-point.

Markus
 
Old 09-13-2012, 09:07 AM   #5
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
XML is not readily parsed with simple regex tools. That is the reason why tools like xmlstarlet and proper XML parser modules for scripting languages like Perl & Python were created. If your XML is known to always use a constant tag-per-line format, you will probably be able to solve your problem with AWK. With luck, it can be done as a one-liner suitable for embedding in a broader script.
When solving problems like yours, it is helpful to explain verbosely what pattern of matching/deleting/substitution you are trying to accomplish. Use terms that describe the target text and the relationships to surrounding text. For example "text matching one alpha character followed by an underscore and one or more numeric characters, all enclosed in '<' & '>'". Using such language will force you to unambiguously identify the patterns, and once you have done this, the translation to code will be much easier. It is something like a mental specification of the problem, and working from a specification is always much more productive than making stuff up on the fly.
I thought when you said "delete odd lines" it might mean something like "delete elements in 'item' tags where the tagname is suffixed with an odd number character". However, in your sample output, I see the tags '<a_3>78</a_3>', so my hypothesis about your intention must be incorrect. I cannot see any unambiguous pattern that could translate the input to the supplied output.

I guess it is unlikely that you have written the XML generator, but it is worth mentioning that the format is not well chosen, since the tag names appear to contain information about the content of the tag. Numeric indices applied to the tagname would be better implemented as attributes to the tag. This will reduce the complexity of any attached DTD and simplify the work of any parser. It probably makes the XML generator simpler as well.

--- rod.
 
Old 09-13-2012, 02:15 PM   #6
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

This seems to do the job:
Code:
$ sed -r '/<item /{n; :a; N; /<\/item>/{ s/[^\n]*\n([^\n]*\n)/\1/g; b}; ba}' infile
<results>
<itemResult>
  <date>
    <datex>something</datex>
  </date>
  <item itemname="xyz">
    <a_2>62</a_2>
    <a_4>78</a_4>
  </item>
</itemResult>
<itemResult>
  <date>
    <datex>something_2</datex>
  </date>
  <item itemname="abc">
    <a_7>62</a_7>
    <a_3>78</a_3>
  </item>
  <item itemname="xpto">
    <v_2>62</v_2>
    <d_4>78</d_4>
  </item>
</itemResult>
</results>
EDIT: This is much better:
Code:
$ sed -rn '/<item /,/<\/item>/{/<item /be; /<\/item>/be; n}; :e;p' infile

Last edited by firstfire; 09-13-2012 at 02:52 PM.
 
2 members found this post helpful.
Old 09-13-2012, 04:00 PM   #7
grlopes
LQ Newbie
 
Registered: Sep 2012
Posts: 3

Original Poster
Rep: Reputation: Disabled
thank you firstfire
it works like a charm

I will try to understand this sed syntax
 
Old 09-13-2012, 04:32 PM   #8
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
Hello,
Quote:
Originally Posted by grlopes View Post
...
I will try to understand this sed syntax
Here's an interesting link with resources http://sed.sourceforge.net/

Markus
 
Old 09-16-2012, 05:07 AM   #9
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I highly suggest you carefully read again what ntubski and theNbomr posted. As always, use the right tool for the job.

Regex-based tools are designed for use on line-oriented text, but xml is tag-oriented. You can never be completely assured that a sed or awk solution will always parse it accurately.

FYI, xmlstarlet uses standard xpath expressions in order to match and modify entries. It's takes a bit of learning (and I'm still pretty much a novice at it), but it really is quite clean and flexible once you know what you are doing.

http://www.w3.org/TR/xpath
 
Old 09-16-2012, 09:52 AM   #10
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Quote:
Originally Posted by grlopes View Post
thank you firstfire
it works like a charm

I will try to understand this sed syntax
My last command
Code:
$ sed -rn '/<item /,/<\/item>/{/<item /be; /<\/item>/be; n}; :e;p' infile
is pretty simple. Flags -rn tell sed that we want to use extended regular expressions (-r; you can omit it, it is not necessary) and that we don't want sed to print every line to standard output automatically.

/<item/,/<\/item/{commands} == "Execute given commands for each line between (inclusively) the one, matching first regular expression /<item/ and that matching the second regular expression /\/item/." It is so called address range.

/<item/be; /<\/item>/be; == "For lines matching regular expression branch (goto) to label :e (see last two commands)". This effectively prints open and close item tags, because we have the `p' (print) command after label `e'.

n; -- read next line, discarding current line from the buffer. This command does the job -- skips first, third, etc lines inside current item. Other lines get printed by the `p' command at the end.

BTW, I completely agree with previous speakers: line-oriented tools are not good for xml/html/.*ml.
 
2 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to add xml-stylesheet tag in a XML File using libxml2 ? peacemission Programming 6 05-26-2012 02:20 AM
Print xml tag only if lenght > 0 frambau Programming 5 02-13-2012 05:10 AM
how to find the end tag in xml file. mariakumar Linux - General 1 12-21-2010 10:12 AM
How To get the data from a tag in XML File kingmaker2003 Programming 7 12-04-2008 11:12 PM
Remove odd lines from a text file Mr. Gone Programming 2 09-19-2005 11:16 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration