LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Remove sections of a xml file with sed (https://www.linuxquestions.org/questions/linux-software-2/remove-sections-of-a-xml-file-with-sed-720428/)

viniciusandre 04-20-2009 10:41 AM

Remove sections of a xml file with sed
 
I've been trying to remove some lines of a xml file that looks like this:

Code:

<parent>
  <child>name1</child>
  <lots_of_other tags></lots_of_other_tags>
</parent>
<parent>
  <child>name2</child>
  <lots_of_other tags></lots_of_other_tags>
</parent>
<parent>
  <child>name3</child>
  <lots_of_other tags></lots_of_other_tags>
</parent>

How can I remove the '<parent>' to '</parent>' section for 'name2' only?
Thanks in advance.

choogendyk 04-20-2009 11:11 AM

Interesting. I would have said sed can't do that, but I found http://www.grymoire.com/Unix/Sed.html#uh-47, which indicates that sed can deal with multi-line patterns. You'll have to read through that and digest it to figure out how to do it.

Alternatively, you would switch to awk or perl, depending on your own preferences.

jschiwal 04-20-2009 01:18 PM

Code:

sed -n '/<parent>/,/<\/parent>/{ H
                                /<\/parent>/{ s/.*//;x
                                              /name2/d
                                              p
                                            }
                              }' testfile
<parent>
  <child>name1</child>
  <lots_of_other tags></lots_of_other_tags>
</parent>

<parent>
  <child>name3</child>
  <lots_of_other tags></lots_of_other_tags>
</parent>

May need more testing and there are probably better ways of doing it.
The first line uses a range between two parent tags (inclusive range). The `H' command appends the line to the Hold buffer.
The second line tests whether the line read in has the closing tag. If it does, the line buffer is cleared and swapped with the Hold buffer.

At this point, the regular buffer has the entire range in it with the `\n' character between lines.
the third line tests whether it contains `name2'. If so, it is deleted. If not, it is printed.


All times are GMT -5. The time now is 11:35 PM.