LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-26-2012, 04:19 AM   #1
corfuitl
Member
 
Registered: Mar 2012
Posts: 38

Rep: Reputation: Disabled
Strip HTML tags from XML file


Hi, I have multiple xml files and I want to clean-up the html tags from them. Have you got any idea? Does anyone know if there is the script?

Thank you in advance!

Last edited by corfuitl; 03-26-2012 at 01:13 PM.
 
Old 03-26-2012, 04:21 AM   #2
acid_kewpie
Moderator
 
Registered: Jun 2001
Location: UK
Distribution: Gentoo, RHEL, Fedora, Centos
Posts: 43,417

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
This doesn't really sound like a "clean up", more arbitrarily breaking it. Can you show us some sample data to clarify what you mean?
 
Old 03-26-2012, 01:13 PM   #3
corfuitl
Member
 
Registered: Mar 2012
Posts: 38

Original Poster
Rep: Reputation: Disabled
The xml file is like:

Quote:
<?xml version="1.0" encoding="UTF-8" ?>

<Content>
- <p s:text-align="center" s:margin-left="0pt" s:margin-right="0pt" s:margin-top="0pt" s:margin-bottom="6pt">
<span s:font-style="italic" />
<span s:font-style="italic">Text Text Text Text Text Text </span>
</p>
- <p s:margin-left="0pt" s:margin-right="0pt" s:margin-top="0pt" s:margin-bottom="0pt">
<img src="" alt="" />
<br />
</p>
- <p s:text-align="right" s:margin-left="0pt" s:margin-right="0pt" s:margin-top="0pt" s:margin-bottom="6pt">
<span s:font-weight="bold">JURI(2010)1122_1</span>
<br />
</p>
- <p s:text-align="center" s:margin-left="0pt" s:margin-right="0pt" s:margin-top="0pt" s:margin-bottom="12pt">
<span s:font-weight="bold"> Text Text Text Text Text Text </span>
</p>
- <p s:text-align="center" s:margin-left="0pt" s:margin-right="0pt" s:margin-top="0pt" s:margin-bottom="12pt">
<span s:font-weight="bold"> Text Text Text Text Text Text </span>
</p>
</Content>
 
Old 03-26-2012, 02:44 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Well,
Code:
awk -v 'tags=p br img span' '
    BEGIN {
        FS = "[\t\n\v\f\r ]+"
        split("", taglist)
        split(tags, temp)
        for (i in temp) {
            tag = tolower(temp[i])
            taglist[tag] = 1
            taglist["/" tag] = 1
            tag = toupper(temp[i])
            taglist[tag] = 1
            taglist["/" tag] = 1
        }
        RS = "<"
        FS = ">"
    }

    (NF > 0) {
        tag = $1
        sub(/[\t\n\v\f\r ].*$/, "", tag)
        if (tag in taglist)
            printf("%s", $2)
        else
            printf("<%s>%s", $1, $2)
    }
' input-file > output-file
will remove p, br, img, and span elements from input-file, and save the result to output-file. You can add any elements you like to the first line.

The logic of the scriptlet is simple:

The BEGIN rule generates an array of element names to skip (as keys; the value, =1, is completely irrelevant). Then it sets record (line) separator to <, and field (word) separator to >.

The (NF > 0) rule is applied to each element and optional associated immediate content. tag will contain the element string, but with everything including and after the first whitespace removed -- thus, only the element name. If it is listed as a key in the taglist array, the element will be skipped, otherwise it is printed exactly as read from the input.

Note that this will not handle comments (including < or > characters), CDATA, or DTDs correctly.
 
Old 03-26-2012, 03:44 PM   #5
corfuitl
Member
 
Registered: Mar 2012
Posts: 38

Original Poster
Rep: Reputation: Disabled
thank you very much! It will wokr for all the files in the directory? Is that python or perl?
 
Old 03-26-2012, 04:28 PM   #6
acid_kewpie
Moderator
 
Registered: Jun 2001
Location: UK
Distribution: Gentoo, RHEL, Fedora, Centos
Posts: 43,417

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
perl or python?? it's awk.

It will, as the command line highlights, process one file. If you want it to handle more, you can wrap it in a bash loop.

Code:
for FILE in *.xml
do
   awk [...] $FILE > ${FILE}_output
done
for example

Last edited by acid_kewpie; 03-26-2012 at 04:31 PM.
 
Old 03-26-2012, 04:39 PM   #7
corfuitl
Member
 
Registered: Mar 2012
Posts: 38

Original Poster
Rep: Reputation: Disabled
Thank you! I will try it and i will send you feedback.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash script to strip some content from XML file. musther Programming 17 04-30-2021 11:35 AM
using tr strip away html tags Micky12345 Linux - Newbie 5 03-17-2012 01:03 PM
Does there a software which read RNG format to auto complete tags in XML file? nadavvin Linux - Software 0 11-02-2006 12:49 PM
Need help to strip XML & XSL tags from multiple files dfrechet Programming 9 10-12-2005 06:52 AM
strip html tags rblampain Programming 6 08-07-2005 06:22 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration