Need help to strip XML & XSL tags from multiple files
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
If you're going to start by creating one big file, then you can feed the result to sed to strip out the tags. See man sed--it does take arguments for which lines to process or not process--you will have to figure out which works better for you.
In a shell script you can do all sorts of things like count lines with wc, create temp files, etc. That's what makes programming entertaining.
Thank you for your help bigearsbilly. Based on your example I was able to created my own version of xml_cleanup (included below):
Code:
# Remove all occurrences of the following lines from the merged file
s/<?xml version.*>//
s/<xsl:stylesheet.*>//
s/<\/xsl:stylesheet.*>//
# Add the following lines at the beginning of the merged file
1i\
<?xml version='1.0'?>\
<xsl:stylesheet version="1.0" xmlns:xsl="some url">
# Add the following line at the end of the merged file
$a\
</xsl:stylesheet>
# Remove leading blanks from each line (the square brackets contain a tab and a space)
s/^[ ]*//
# Reduce strings containing multiple blanks to single blanks (each pair of square brackets contain a tab and a space)
s/[ ][ ]*/ /g
# Remove DOS line breaks (^M)
s/\r//
# Delete blank lines
/^$/d
# Remove all occurrences of the following lines from the merged file
# except the 1st 2 or the last, as the case may be
1,2!s,<?xml version.*>,
1,2!s,<xsl:stylesheet.*>,,
$!s,</xsl:stylesheet.*>,,
# Reduce strings containing multiple blanks to single blanks
s,[[:blank:]]*, ,g
# Remove any leading blank from each line
s,^ ,,
# Remove DOS line breaks (^M)
s,\r,,
# Delete blank lines
/^$/d
Notes
A bang after an address range "negates" it.
'[[:blank:]]' is the same as "a tab and a space". (I sometimes find it more cumbersome to type, but it is easier to understand & just as long to read -- shorter if you give credit for the deleted, no longer needed explanation.)
Finally, I don't believe "cat" is necessary anywhere here. "sed" operates on all files given to it as arguments -- i.e. you might say it "self cats">
Your code is indeed shorter and more efficient. Thank you.
However, I still have a problem. I want to insert a comment after the 2nd line of the file, but when I uncomment the code below I get the following error message from sed: unknown command: `<'. I tested with different strings and found that whatever character appears at position 1 is automatically flagged as an "unknown command".
xslCleanup.sed
Code:
# Remove all occurrences of the following lines from the merged file
# except the 1st 2 or the last, as the case may be
1,2!s,<?xml version.*>,,
1,2!s,<xsl:stylesheet.*>,,
$!s,</xsl:stylesheet.*>,,
# Insert the following line after the 2nd line at the top of the file
##3i\##<!-- THIS FILE IS GENERATED AUTOMATICALLY. DO NOT EDIT. -->
# Reduce strings containing multiple consecutive spaces (not tabs) to single spaces
s, *, ,g
# Remove any leading blank from each line
s,^ ,,
# Remove DOS line breaks (^M)
s,\r,,
# Delete blank lines
/^$/d
Also, is it be possible to insert the current date and time in a line using 'sed'. For example:
Code:
<!-- THIS FILE WAS GENERATED AUTOMATICALLY ON <date> AT <time>. DO NOT EDIT. -->
For the benefit of all, here is the final version of my 'sed' command file:
Code:
# Remove all occurrences of the following lines from the merged file
# except the 1st 2 lines
1,2!s,<?xml version.*>,,
1,2!s,<xsl:stylesheet.*>,,
s,</xsl:stylesheet.>,,
# Insert the following line after the 2nd line at the top of the file
3i\
<!-- THIS FILE IS GENERATED AUTOMATICALLY. DO NOT EDIT. -->
# Add the following line at the end of the merged file
$a\
</xsl:stylesheet>
# Reduce strings containing multiple consecutive spaces (not tabs) to single spaces
s, *, ,g
# Remove any leading blank from each line
s,^ ,,
# Remove DOS line breaks (^M)
s,\r,,
# Remove comments inserted automatically by Stylus Studio
/<!-- Stylus Studio/,/-->/D
# Delete blank lines
/^$/d
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.