LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Reformat 'pretty' xml to one-line entries (https://www.linuxquestions.org/questions/programming-9/reformat-pretty-xml-to-one-line-entries-651089/)

gnashley 06-23-2008 12:24 PM

Reformat 'pretty' xml to one-line entries
 
I have some xml files (actually aiml) which are mostly formatted in a standard xml-style with opening and closing tags which match. The content between the opening and closing tags stretches across multiple lines.
How can I reformat each tag set into one long line? I assume that 'awk' is probably the easiest way to do it, but I'm not particular about using sed or perl if they are easier.

The matching tags are <category> and </category>. It would save an extra step later if all tabs and multiple spaces were normalized to single spaces.

For instance, something like this:
Code:

<category>
    <pattern>TEXT </pattern>
    <template>TEXT  </template>
</category>

should come out like this:
Code:

<category><pattern>TEXT </pattern><template>TEXT </template></category>
I have several more operations to do on each tag-set, but they can all be done easier on line-by-line basis, so I'd like to do the above first.
Anybody know any good one-liners?

brianmcgee 06-23-2008 01:39 PM

Code:

# xmllint --noblanks file.xml > unpretty.xml
Make it pretty again:

Code:

# xmllint --format unpretty.xml > pretty.xml

gnashley 06-23-2008 04:34 PM

I hadn't heard of xmllint. I tried it, but it doesn't do exactly what I want. It strips out all white space. I need to preserve white space, but only as single spaces.
Also, my goal is to transform the whole document into non-xml code.
Let me restate what I want:

Concatenate all text between <category> and </category> into a single line. It doesn't matter if the <category> and </category> tags are stripped off as well. Because the spacing of the original documents is quite irregular, I can't seem to come up with a dependable combination of substitutions using sed. I'm pretty sure that awk is gonna be the best for this. I'm still going to do several more steps on each line which can be handles with (mostly) simple substitutions.
Here's another example of how messed up the input file can be:
Code:

<category>
    <pattern>TEXT </pattern>
    <template>TEXT  </template>
  </category>              <category>
<pattern>TEXT </pattern>
              <template>TEXT  </template>
</category>

Again, that should come out something like this:

Code:

<category> <pattern>TEXT </pattern> <template>TEXT </template> </category>
<category> <pattern>TEXT </pattern> <template>TEXT </template> </category>

tabs and multiple spaces should be reduced to single spaces, but I can do that afterwards, too.


All times are GMT -5. The time now is 12:28 PM.