LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-03-2009, 03:23 AM   #1
brixtoncalling
Member
 
Registered: Jul 2008
Location: British Columbia
Distribution: Slackware current
Posts: 403

Rep: Reputation: 67
script to remove text from file


Hello,
Can anyone help me with a scripting problem?

I'd really really like to have a script that would remove everything between two xml tags but only if a certain string appears in the text between them.

tag 1: <w:fldChar w:fldCharType="begin"/>
tag 2: <w:fldChar w:fldCharType="end"/>
text somewhere between: XE

I've written scripts before but I'm useless when it comes to regexp and text manipulation... Help will be much much appreciated.

Cheers.
 
Old 10-03-2009, 03:35 AM   #2
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
I'd start with something like this:
Code:
not_tag="[^<>]*"
open_tag="<$not_tag[^/]>"
close_tag="<$not_tag/>"

to_match="($open_tag)${not_tag}FINDME!$not_tag($close_tag)"

sed -r "s@$to_match@\1\2@"
Kevin Barry

PS This, of course, doesn't account for nested sections, which is quite a bit more complicated (might require a more capable language than shell.)

Last edited by ta0kira; 10-03-2009 at 03:37 AM.
 
Old 10-03-2009, 05:01 AM   #3
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 101Reputation: 101
Quote:
Originally Posted by brixtoncalling View Post
Hello,
Can anyone help me with a scripting problem?

I'd really really like to have a script that would remove everything between two xml tags but only if a certain string appears in the text between them.

tag 1: <w:fldChar w:fldCharType="begin"/>
tag 2: <w:fldChar w:fldCharType="end"/>
text somewhere between: XE

I've written scripts before but I'm useless when it comes to regexp and text manipulation... Help will be much much appreciated.

Cheers.
What language? I ask because this shouldn't be attempted in Bash.

Quote:
Originally Posted by brixtoncalling View Post
I'm useless when it comes to regexp and text manipulation
In that case, maybe it's too soon for you to be trying to do this. Do you really expect to get people in discussion groups to write your programs for you, line by line?
 
Old 10-03-2009, 06:19 AM   #4
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Where are the sed gurus ? I'm sure one of them can whip out a huge, obscure sed one-liner to do this ...
 
Old 10-03-2009, 06:32 AM   #5
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
The script submitted by ta0kira uses sed in a bash script and is not obscure.
The variable assignments hid the obscure portions into small understandable bits.

There is an O'Reily book on regular expressions. There is another "SED & AWK".
They would make excellent editions to add to one's programming library collection.
 
Old 10-03-2009, 07:04 AM   #6
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by ta0kira View Post
I'd start with something like this:
Code:
not_tag="[^<>]*"
open_tag="<$not_tag[^/]>"
close_tag="<$not_tag/>"

to_match="($open_tag)${not_tag}FINDME!$not_tag($close_tag)"

sed -r "s@$to_match@\1\2@"
Kevin Barry

PS This, of course, doesn't account for nested sections, which is quite a bit more complicated (might require a more capable language than shell.)
Regarding nested sections, I sense that it could probably be solved with a recursive call to a function in awk.

Last edited by konsolebox; 10-03-2009 at 07:06 AM.
 
Old 10-03-2009, 07:16 AM   #7
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
If done using a bash script, using programs such as xsltproc would be a better choice for manipulating xml files. That's what they are there for. Perl or Python xml libraries exist. Using bash probably isn't the best choice for dealing with complex XML files.

The OP should indicate the scripting language is being used. And I hope this isn't a homework problem.
 
Old 10-03-2009, 07:27 AM   #8
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,557
Blog Entries: 28

Rep: Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178
Quote:
Originally Posted by konsolebox View Post
Regarding nested sections, I sense that it could probably be solved with a recursive call to a function in awk.
There are probably a zillion ways to skin this particular cat; I'd do it using awk without a recursive function in a somewhat plodding style:
  1. in /BEGIN/ set array_index and string_found to 0.
  2. On finding any line, if array_index is not 0, increment array_index and copy line to array[array_index].
  3. On finding opening tag, set array counter to 1 and copy line to array[1]; set string_found to 0.
  4. On finding the target string in a line, set string_found to 1.
  5. On finding end tag:
    1. if "string_found" is 1 then print array[1] and array[array_index].
    2. if "string_found" is 0 then print array.
    3. Set array_index to 0.
ta0kira's sed solution is a lot more elegant!
 
Old 10-03-2009, 07:35 AM   #9
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Quote:
What language? I ask because this shouldn't be attempted in Bash.
Why not?
Quote:
I've written scripts before but I'm useless when it comes to regexp and text manipulation...
Then allow me to suggest learning it:
http://tldp.org ..get the Bash Guide for Beginners
http://www.grymoire.com/Unix/ ..there's a section on Regexes + the best SED tutorial on the planet
Quote:
Where are the sed gurus ? I'm sure one of them can whip out a huge, obscure sed one-liner to do this ...
I'll get right on it....

My first reaction is to do this in two passes:

1. Use something like SED address range to see if the keyword exists between the two tags.

2. Another pass (also with SED) to remove the range of lines.
 
Old 10-03-2009, 07:36 AM   #10
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by catkin View Post
There are probably a zillion ways to skin this particular cat; I'd do it using awk without a recursive function in a somewhat plodding style:
  1. in /BEGIN/ set array_index and string_found to 0.
  2. On finding any line, if array_index is not 0, increment array_index and copy line to array[array_index].
  3. On finding opening tag, set array counter to 1 and copy line to array[1]; set string_found to 0.
  4. On finding the target string in a line, set string_found to 1.
  5. On finding end tag:
    1. if "string_found" is 1 then print array[1] and array[array_index].
    2. if "string_found" is 0 then print array.
    3. Set array_index to 0.
ta0kira's sed solution is a lot more elegant!
Just don't know any other way . Using that, will it also work with at least more than 2 nests?
 
Old 10-03-2009, 10:09 AM   #11
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,557
Blog Entries: 28

Rep: Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178
Quote:
Originally Posted by konsolebox View Post
Just don't know any other way . Using that, will it also work with at least more than 2 nests?
Oops! No but it could be modified to use a two dimensional array with the second subscript used for the nesting level.
 
Old 10-03-2009, 10:31 AM   #12
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
It's pretty well-known that parsing things like XML or HTML with regular expressions is (1) hard, (2) error prone, (3) very liable to edge-case bugs, (4) often grossly inefficient. As a result, parsing via regular expressions is generally the wrong way to do it.

There are hundreds of well-made parsers for a variety of scripting and programming languages.

And yet, every time this comes up, people say the same things:
  1. I know exactly what the data looks like, and there aren't nested or mal-formed tags
  2. I just need to do this one thing quickly
  3. You can do it with regular expressions (in many cases), so why not?

I tend to think regexes are not the best tool for the job, and there are lots of excellent, easily available tools. Better to use one of them.

Edit: A good link with some discussion: http://stackoverflow.com/questions/701166/

Last edited by Telemachos; 10-03-2009 at 11:41 AM. Reason: Adjust second-to-last paragraph; add link
 
Old 10-03-2009, 12:24 PM   #13
brixtoncalling
Member
 
Registered: Jul 2008
Location: British Columbia
Distribution: Slackware current
Posts: 403

Original Poster
Rep: Reputation: 67
Quote:
Originally Posted by lutusp View Post
Do you really expect to get people in discussion groups to write your programs for you, line by line?
Hey, I'm not going to say no to anyone who wants to write me the script. But I think ta0kira gave me enough information to accomplish this one thing quickly (... I sure hope there are no nested tags). But I'm here to learn and I appreciate Telemachos's warnings, so let me ask, what XML tools should I be using?

Quote:
Originally Posted by jschiwal
And I hope this isn't a homework problem.
Not exactly. MS Word -- which I'm being forced to use -- corrupts .docx files under certain conditions too dull to describe here. This script will give me back my thesis.

Thanks for the help everyone.
 
Old 10-03-2009, 12:47 PM   #14
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
Quote:
Originally Posted by brixtoncalling View Post
But I'm here to learn and I appreciate Telemachos's warnings, so let me ask, what XML tools should I be using.
It's hard to say in the abstract. Do you know a programming language? Perl provides a ton of XML (and HTML) parsing modules at CPAN, for example.
 
Old 10-03-2009, 12:53 PM   #15
brixtoncalling
Member
 
Registered: Jul 2008
Location: British Columbia
Distribution: Slackware current
Posts: 403

Original Poster
Rep: Reputation: 67
Quote:
Originally Posted by Telemachos View Post
It's hard to say in the abstract. Do you know a programming language? Perl provides a ton of XML (and HTML) parsing modules at CPAN, for example.
I don't program anymore, but I knew several fairly well and I'll hack at code occasionally to make a minor change or two. I've no Perl, though, and probably wouldn't go there at this point.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to remove string in the text file ? Bash script dlugasx Linux - Server 9 06-05-2009 12:40 PM
Remove lines in a text file based on another text file asiandude Programming 10 01-29-2009 11:59 AM
Shell script to remove certain portion of the text kushalkoolwal Programming 4 08-25-2008 12:17 AM
remove text from file with script paul_mat Linux - Software 3 11-17-2005 01:21 PM
How to find and change a specific text in a text file by using shell script Bassam Programming 1 07-18-2005 08:15 PM


All times are GMT -5. The time now is 01:51 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration