I hadn't heard of xmllint. I tried it, but it doesn't do exactly what I want. It strips out all white space. I need to preserve white space, but only as single spaces.
Also, my goal is to transform the whole document into non-xml code.
Let me restate what I want:
Concatenate all text between <category> and </category> into a single line. It doesn't matter if the <category> and </category> tags are stripped off as well. Because the spacing of the original documents is quite irregular, I can't seem to come up with a dependable combination of substitutions using sed. I'm pretty sure that awk is gonna be the best for this. I'm still going to do several more steps on each line which can be handles with (mostly) simple substitutions.
Here's another example of how messed up the input file can be:
Code:
<category>
<pattern>TEXT </pattern>
<template>TEXT </template>
</category> <category>
<pattern>TEXT </pattern>
<template>TEXT </template>
</category>
Again, that should come out something like this:
Code:
<category> <pattern>TEXT </pattern> <template>TEXT </template> </category>
<category> <pattern>TEXT </pattern> <template>TEXT </template> </category>
tabs and multiple spaces should be reduced to single spaces, but I can do that afterwards, too.