LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Need help to strip XML & XSL tags from multiple files (https://www.linuxquestions.org/questions/programming-9/need-help-to-strip-xml-and-xsl-tags-from-multiple-files-371398/)

dfrechet 10-09-2005 11:59 PM

Need help to strip XML & XSL tags from multiple files
 
Hello,

I want to write a BASH file to automatically merge multiple XSLT files together for faster upload to the client side.

Here is the logic I want to use:
1. Merge 2 or more .xsl files together using 'cat'.

2. Strip all occurrences of the following lines (they appear at the top and bottom of every .xsl file):
Code:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" xmlns:xsl="some url">
</xsl:stylesheet>

3. Add back the following lines at the top of the merged file:
Code:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" xmlns:xsl="some url">

4. Add back the following line at the end of the merged file:
Code:

</xsl:stylesheet>
I need help to write the find & replace commands (using sed, awk or whatever) needed to strip the unwanted lines.

(note: I had to remove the URL that appeared in the lines above so that this forum would accept my post)

Thank you
Daniel

sgla1 10-10-2005 12:53 AM

edit files with sed
 
If you're going to start by creating one big file, then you can feed the result to sed to strip out the tags. See man sed--it does take arguments for which lines to process or not process--you will have to figure out which works better for you.

In a shell script you can do all sorts of things like count lines with wc, create temp files, etc. That's what makes programming entertaining.

Please also see http://www.catb.org/~esr/faqs/smart-questions.html

bigearsbilly 10-10-2005 06:15 AM

cat *.xml | xml_cleanup

xml_cleanup:
Code:

#!/bin/sed -f


# add to beginning

1i\
<?xml version='1.0'?>\
<xsl:stylesheet version="1.0" xmlns:xsl="some url">

# stick at end

$a\
</xsl:stylesheet>

# remove

/<xsl:stylesheet.*>/d
/<\/xsl:stylesheet.*>/d
/<\?xml version/d
/<xsl:stylesheet.*>/d


dfrechet 10-10-2005 01:39 PM

Thank you for your help bigearsbilly. Based on your example I was able to created my own version of xml_cleanup (included below):

Code:

# Remove all occurrences of the following lines from the merged file
s/<?xml version.*>//
s/<xsl:stylesheet.*>//
s/<\/xsl:stylesheet.*>//

# Add the following lines at the beginning of the merged file
1i\
<?xml version='1.0'?>\
<xsl:stylesheet version="1.0" xmlns:xsl="some url">

# Add the following line at the end of the merged file
$a\
</xsl:stylesheet>

# Remove leading blanks from each line (the square brackets contain a tab and a space)
s/^[        ]*//

# Reduce strings containing multiple blanks to single blanks (each pair of square brackets contain a tab and a space)
s/[        ][        ]*/ /g

# Remove DOS line breaks (^M)
s/\r//

# Delete blank lines
/^$/d

Daniel

archtoad6 10-10-2005 05:29 PM

Some ways to make it shorter/prettier:
Code:

# Remove all occurrences of the following lines from the merged file
# except the 1st 2 or the last, as the case may be
1,2!s,<?xml version.*>,
1,2!s,<xsl:stylesheet.*>,,
  $!s,</xsl:stylesheet.*>,,

# Reduce strings containing multiple blanks to single blanks
s,[[:blank:]]*, ,g

# Remove any leading blank from each line
s,^ ,,

# Remove DOS line breaks (^M)
s,\r,,

# Delete blank lines
/^$/d

Notes
A bang after an address range "negates" it.

'[[:blank:]]' is the same as "a tab and a space". (I sometimes find it more cumbersome to type, but it is easier to understand & just as long to read -- shorter if you give credit for the deleted, no longer needed explanation.)

Finally, I don't believe "cat" is necessary anywhere here. "sed" operates on all files given to it as arguments -- i.e. you might say it "self cats">

dfrechet 10-10-2005 09:01 PM

Your code is indeed shorter and more efficient. Thank you.

However, I still have a problem. I want to insert a comment after the 2nd line of the file, but when I uncomment the code below I get the following error message from sed: unknown command: `<'. I tested with different strings and found that whatever character appears at position 1 is automatically flagged as an "unknown command".

xslCleanup.sed
Code:

# Remove all occurrences of the following lines from the merged file
# except the 1st 2 or the last, as the case may be
1,2!s,<?xml version.*>,,
1,2!s,<xsl:stylesheet.*>,,
  $!s,</xsl:stylesheet.*>,,

# Insert the following line after the 2nd line at the top of the file
##3i\
##<!-- THIS FILE IS GENERATED AUTOMATICALLY. DO NOT EDIT. -->

# Reduce strings containing multiple consecutive spaces (not tabs) to single spaces
s,  *, ,g

# Remove any leading blank from each line
s,^ ,,

# Remove DOS line breaks (^M)
s,\r,,

# Delete blank lines
/^$/d

Also, is it be possible to insert the current date and time in a line using 'sed'. For example:
Code:

<!-- THIS FILE WAS GENERATED AUTOMATICALLY ON <date> AT <time>. DO NOT EDIT. -->
How can this be done?

Daniel

bigearsbilly 10-11-2005 03:03 AM

hats off to archtoad!

Quote:

3i\
<!-- THIS FILE IS GENERATED AUTOMATICALLY. DO NOT EDIT. -->

this works OK for me, you haven't got a space or DOS ^M
after the \ have you?

Inserting date, hmmm, don't reckon so; not in plain old sed.

one can also delete spaces like:

Code:

#!/usr/bin/sed -nf

/./p

I.e -n = no default print then print any lines with at least 1 character.

dfrechet 10-11-2005 07:13 AM

That was it. I removed the ^M character and everything worked.

Thank you.

Daniel

dfrechet 10-11-2005 08:14 AM

For the benefit of all, here is the final version of my 'sed' command file:

Code:

# Remove all occurrences of the following lines from the merged file
# except the 1st 2 lines
1,2!s,<?xml version.*>,,
1,2!s,<xsl:stylesheet.*>,,
s,</xsl:stylesheet.>,,

# Insert the following line after the 2nd line at the top of the file
3i\
<!-- THIS FILE IS GENERATED AUTOMATICALLY. DO NOT EDIT. -->

# Add the following line at the end of the merged file
$a\
</xsl:stylesheet>

# Reduce strings containing multiple consecutive spaces (not tabs) to single spaces
s,  *, ,g

# Remove any leading blank from each line
s,^ ,,

# Remove DOS line breaks (^M)
s,\r,,

# Remove comments inserted automatically by Stylus Studio
/<!-- Stylus Studio/,/-->/D

# Delete blank lines
/^$/d


archtoad6 10-12-2005 06:52 AM

Thanks for the compliments.

Would:
Code:

4i <\!-- Created on  `date`  -->
solve your date stamping problem?


All times are GMT -5. The time now is 09:19 AM.