Advanced Search and Replace

Tinkster · 07-06-2006, 03:09 AM

Quote:

Originally Posted by patrokov

6. Token searching should be an addition to and complement to regular expression searching. It already works flawlessly in Notetab. Why should FOSS software be more limited? Well the answer resides in the ignorance expressed in the quote below.

Um... no offense, but that's like comparing a toddlers tricycle
to a Kawasaki and saying the Kawa is limited because one has
to learn how to shift gears.

/me shakes the head.

Cheers,
Tink

patrokov · 07-07-2006, 01:47 AM

Quote:

Originally Posted by Tinkster

that's like comparing a toddlers tricycle to a Kawasaki and saying the Kawa is limited because one has to learn how to shift gears.

Just smile at the nice moderator with the faulty analogy trying to argue against the straw man...
But seriously, what you're telling is me is, "You already have a blowtorch; why do you want a soldering iron?"

Well, it's simpler to operate, less dangerous, and more appropriate for some tasks. But to go back to my original point, the regular expression search and replace in KDE and OpenOffice cannot do what simple little token searching can: replace text across multiple lines. Does no one see this as a problem other than me? Does it not bother anyone that you have to run a sed script just to do a multi-line search and replace? And even then, you have to use the 'N'ext command and build up a multiline pattern space (jschwial).

Perhaps a concrete example will help. Some times we get test banks from textbook publishers that look something like this:

Code:

____	1.	Which kidney function is most affected by the administration of diuretics?
a.
Cleansing of extracellular fluid (ECF)
b.
Excretion of metabolic wastes
c.
Maintenance of extracellular (ECF) volume
d.
Control of acid-base balance


____	2.	With the knowledge of where each class of diuretics works in the kidney, which agent would the nurse expect to produce the greatest volume of diuresis?
a.
Hydrochlorothiazide
b.
Furosemide
c.
Spironolactone
d.
Triamterene

In order to import them into our online testing system, I need to strip out the line numbers, remove the extra line breaks after a., b., c., d., and add an extra line break between the question and the first answer. In notetab, I would use a regular expression to remove the line numbers. Then I would search with tokens for "^pa.^p" and replace it with "^p^pa. " Then search for "^pb.^p" replace "^pb. " Repeat with c and d.

In less than three minutes I would have the entire file formatted for import looking like:

Code:

Which kidney function is most affected by the administration of diuretics?

a. Cleansing of extracellular fluid (ECF)
b. Excretion of metabolic wastes
c. Maintenance of extracellular (ECF) volume
d. Control of acid-base balance


With the knowledge of where each class of diuretics works in the kidney, which agent would the nurse expect to produce the greatest volume of diuresis?

a. Hydrochlorothiazide
b. Furosemide
c. Spironolactone
d. Triamterene

How would you suggest that I replicate this functionality? because so far, it's thwarted all my efforts both trial and error and searching for an existing answer.

Tinkster · 07-07-2006, 02:55 AM

Code:

$ sed -e 's/\(^__*\t*[0-9][0-9]*\.\t*\)\(.*\)/\2\n/g' -e '/[a-d]\./ {
N
s/ *\n/ /}' testing
Which kidney function is most affected by the administration of diuretics?

a. Cleansing of extracellular fluid (ECF)
b. Excretion of metabolic wastes
c. Maintenance of extracellular (ECF) volume
d. Control of acid-base balance


With the knowledge of where each class of diuretics works in the kidney, which agent would the nurse expect to produce the greatest volume of diuresis?

a. Hydrochlorothiazide
b. Furosemide
c. Spironolactone
d. Triamterene

Something like that? I admit, it took me 5 minutes to write that up and
test it (2 minutes of that I spent on trying to figure out why my original
" *" didn't match the whitespace in the first expression until I realised
that you were using TABs).
But it sits in my bash-history (or I could put it in a script or bash-function
if I felt so inclined) and the next hundred times I'd be finished before you
managed to start your first editing adventure in goatpad, which, as far as
I'm concerned, is a tremendous gain of efficiency.

And

Quote:

Just smile at the nice moderator with the faulty analogy trying to argue against the straw man...
But seriously, what you're telling is me is, "You already have a blowtorch; why do you want a soldering iron?"

who are you to talk of bad analogies?! :D

Cheers,
Tink

theNbomr · 07-07-2006, 10:06 AM

Using nedit, which is a text editor, not a word processor, since this is just plain text:

search & replace
search for: ^([abcd]\.)\s*\n(.+$)
replace with: \1 \2

Once again, search & replace
search for: ^_+\s[0-9]+\.\s+(.+$)
replace with: \1\n

Result:

Which kidney function is most affected by the administration of diuretics?

a. Cleansing of extracellular fluid (ECF)
b. Excretion of metabolic wastes
c. Maintenance of extracellular (ECF) volume
d. Control of acid-base balance

With the knowledge of where each class of diuretics works in the kidney, which agent would the nurse expect to produce the greatest volume of diuresis?

a. Hydrochlorothiazide
b. Furosemide
c. Spironolactone
d. Triamterene

Okay, I needed two separate steps, so not as elegant as Tinkster's solution, but using regular expressions reduced the repetition of "replace it with "^p^pa. " Then search for "^pb.^p" replace "^pb. " Repeat with c and d." down to a single step, and could have been generalized to include steps 'e', 'f', ... 'z'.

Once I determined exactly what your objective was (a step that was automatic for you), this took all of 2 minutes, including cutting and pasting between browser and editor.

I will admit that I did try this experiment using the OpenOffice word processor, and the regex implementation there is, uh, primitive. Having now done that, I hope that someone does indeed lobby the developers to build in full regex search & replace support. I would hope that they would do that in lieu of any kind of orphan 'token-search' functionality, which is a non-standard idiom in Unix land.

I have used other visual style editors, the names of which escape me, that could also have done this. Certainly vi could also have been used, and being as ubiquitous as it is, might be the definitive tool, in terms of editors. A command line perl script would probably have been my weapon of choice, but that's just me.

--- rod.

edit: fix [benign] error in regex

patrokov · 07-07-2006, 05:15 PM

Quote:

Originally Posted by Tinkster

And who are you to talk of bad analogies?!

Hey, as long as we're all guilty...

I guess my main problem was expecting kate and openoffice's implementations of regular expressions to be more robust than they are. Now, I'll have to take half an hour to grok the respective solutions (unless SundialCVS wants to explain to the gentle readers again) and then several hours looking at text editor alternatives. (If only my parents had gotten me a Unix workstation instead of a Vic-20 as a kid.)

webazoid · 07-07-2006, 05:20 PM

on a sidenote, in msword, if u wanted to delete all instances of a word, such as "car" in a list, how do you set it up so that it inserts a backspace so that the entire line is deleted and not just the word and doesn't leave a blank line? i.e.

1. cat
2. car
3. cog
4. cup

how do i get it so the new list becomes:
1. cat
2. cog
3. cup
?

This is what i'd generally get if I leave the replace column blank:
1. cat
2.
3. cog
4. cup

or, how do i optimze a page to remove empty lines w/o text, such as if someone triple spaced between paragraphs and I want to make it single spaced instead?

===
line1
line2
line3
return (empty space)
return (empty space)
return (empty space)
line4
line5
line6

desired format:
line1
line2
line3
return (empty space)--deleted two empty returns/paragraphs.
line4
line5
line6

theNbomr · 07-07-2006, 05:56 PM

Quote:

Now, I'll have to take half an hour to grok the respective solutions

Here, I'll help, and explain mine...

Code:

search for: ^([abcd]\.)\s*\n(.+$)

Anchoring at beginning of line                    ^ 
find any lower case a,b,c or d                    [abcd] 
followed by a period                              '.' 
and save all of that as component 1               () 
Continue to find zero or more whitespace char's   \s* 
followed by a newline                             \n 
then save as component 2,                         () 
everything (non-empty) up to the end-of-line      .+$


replace with: \1 \2

Whatever was found as component 1                 \1
space
Whatever was found as component 2                 \2

That does the joining of lines. Next do the left alignment.

Code:

search for: ^_+\s[0-9]+\.\s+(.+$)

Anchoring at beginniong of line                   ^
Find 1 or more underscores                        _+
Followed by a single whitespace (hmmm)            \s
Followed by 1 or more digits                      [0-9]+
Followed by one period                            \.
Followed by one or more whitespace char's         \s+
Save as component 1...                            ()
....the rest of the line                          .+$

replace with: \1\n

Component 1, newline

Hope this is tasting like cool-aid

--- rod.

Tinkster · 07-07-2006, 07:29 PM

Quote:

Originally Posted by patrokov

Now, I'll have to take half an hour to grok the respective solutions (unless SundialCVS wants to explain to the gentle readers again) and then several hours looking at text editor alternatives.

Code:

  's/\(^__*\t*[0-9][0-9]*\.\t*\)\(.*\)/\2\n/g'

$^__*\t*[0-9][0-9]*\.\t*$
Search for lines that begin with at least one underscore,
followed by none or more tabs, followed by a one or
more digit-number, a literal period and a tab,

$.*$
mark the rest of the line to the newline and put it in memory;

\2\n
replace whole line with the bit in memory and add a newline

Code:

  '/[a-d]\./ {
 N
 s/ *\n/ /}' testing

If a line has a,b,c or d followed by a literal period, print the
pattern space replacing the trailing spaces and a newline with
a space (effectively stripping the newline from [a-d].

I think that your greatest problem here is that you think
of interactive solutions to repetitive tasks, a very common
problem of windows-victims. They think they have freedom
if they are allowed to do chores 9reasonably quickly). I
think I have freedom if I can solve issues like that with little
thought once off and have the computer to the rest ;}

Quote:

Originally Posted by patrokov

(If only my parents had gotten me a Unix workstation instead of a Vic-20 as a kid.)

Heh - I've only been doing Linux since 97, grew up on programmable
calculators and a C-64 ;P

Cheers,
Tink

theNbomr · 07-07-2006, 07:45 PM

Tink:

I notice your tendency to use '__*' and '[0-9][0-9]*' as way of expressing 'one or more', where I always use '_+', which amounts to the same thing. Is there some desireable side effect that I don't know about, using your method, or is it just a personal tendency? Nice one-liner, BTW.

--- rod.

Tinkster · 07-07-2006, 08:37 PM

The effect you don't know about is that sed doesn't know about '+' ;}

Cheers,
Tink

spirit receiver · 07-08-2006, 12:41 AM

GNU sed does know about it:

Code:

 `\+'
      As `*', but matches one or more.  It is a GNU extension.

(from info sed)

Tinkster · 07-08-2006, 12:58 AM

Quote:

Originally Posted by spirit receiver

GNU sed does know about it:

Code:

 `\+'
      As `*', but matches one or more.  It is a GNU extension.

(from info sed)

Which goes to show that one should read man and info pages
after an update ;} ... even if the tool is well familiar.

Cheers,
Tink

patrokov · 07-08-2006, 12:22 PM

Quote:

Originally Posted by Tinkster

I think that your greatest problem here is that you think
of interactive solutions to repetitive tasks, a very common
problem of windows-victims. They think they have freedom
if they are allowed to do chores 9reasonably quickly). I
think I have freedom if I can solve issues like that with little
thought once off and have the computer to the rest ;}

In all fairness though, in Word, it's relatively easy to write/record a macro that does the repetitive tasks with a single keystroke. The real problem in my mind is that I'm used to the Microsoft mindset where I am literally searching for exact text including the nonprinting escape codes and then replacing those codes with new ones. In the *nix, you're searching for conceptual patterns of strings and not replacing them, but manipulating them. Unfortunately, for me, my first big foray into the *nix way of doing things used programs that are not fully cooked in their implementations. (For example, if you put \n into the replace box in kate with regex turned on, you get "\n" not a new line.)

But this conversation has definitely been enlightening. Thanks to everyone for the help.

And to rod...I always liked my cool-aid double strength, because if you make it according to the recipe, it tastes watered down. ;-)

Tinkster · 07-08-2006, 04:01 PM

Quote:

Originally Posted by patrokov

In all fairness though, in Word, it's relatively easy to write/record a macro that does the repetitive tasks with a single keystroke. The real problem in my mind is that I'm used to the Microsoft mindset where I am literally searching for exact text including the nonprinting escape codes and then replacing those codes with new ones.

I understand that :}

But in the unix way we could take this a notch further, and
write a milter; so if you received files of the same type from
the same people all the time you could have a program modify
the text, and add the modified version to the mail you receive ;}

NO interaction at all.

Cheers,
Tink

theNbomr · 07-08-2006, 04:31 PM

sed -e 's/milter/filter/g'

yes?

---