Advanced Search and Replace

patrokov · 07-02-2006, 02:05 AM

I've been using Linux at home for about a year now, and have recently come up against a limitation that frustrates me to no end: the lack of token searching and replacing in Linux text editors and Office applications.

What's a token? It's kind of like a regular expression except simpler. If you want to search for or insert a tab character, you simply type "^t". For a paragraph break "^p". For a line break "^l", etc. I have not been able to duplicate this kind of functionality simply in Linux. So far I've tried KWrite/Kate/Kedit, KWord, and OpenOffice Writer. After spending three hours trying to find an easy way to automatically insert a blank paragraph into repetitive text blocks, I found myself reading a tutorial on sed. It was at this point I rebelled. Why should I have to learn to program sed to duplicate what I could do in MS Word or Notetab with just Search "^pa.^t^p" Replace "^p^pa. "? (Replace the single paragraph return with two paragraph returns, replace the tab with a space, and get rid of the last paragraph break.) Moreover, sometimes I need to perform such operations on already formatted text, so sending it to plain text to be read by sed is not always an option.

And don't tell me that regular expressions will solve my problems, because they don't. The main reason is that none of the KDE programs will let you replace with regular expression character codes. Try searching for "a\t" with regular expressions turned on and replacing with "a\t\t". You will get "a\t\t" not "a{tab}{tab}". OpenOffice Writer will let you do it. Unfortunately, Writer has some strange quirks. Paragraph returns are found with "$" but replaced with "\n". Interestingly, to search for a line break you search for "\n" (tell me that's not crufty). I haven't figured out what code to use to replace with line breaks yet. Of course you can't search for "$text$text" or even "$$". That would be too simple.

From my reading, it seems to be a problem with the way that sed and other such programs work. They evaluate a file line by line, so trying to evaluate text that spans lines is impossible. (That's a KDE developer's statement, not mine, but I've definitely confirmed that it's a true statement in all the above programs.)

A secondary, albeit more important reason that regular expressions are not the answer is that no one should have to become a programmer just to get rid of or add in extra tabs, paragraph breaks, or line breaks. It's just absurd.

So, in conclusion, when are OpenOffice, KDE developers, etc. going to realize that to be accepted by MS powerusers (or heck, even the simplest Notetab user) that we need token searching?

jschiwal · 07-02-2006, 03:42 AM

I don't see how entering \t and selecting the regular expression to locate tabs in kate is that much harder then entering [ctrl]t.

In vim, or the bash shell, you can enter a control character by entering [cntrl]v[cntrl]I or [cntrl]v[TAB KEY].

It is possible to edit across line boundries in sed. You need to use the 'N'ext command and build up a multiline pattern space. Yes, this isn't very easy. Sed is useful however when you need to automate changes in a large number of documents.
Suppose that to clear up space on your harddrive, you backed up a large number of documents to DVD using k3b. Saving the backup.k3b file, you can unzip it and use sed on the maindata.xml to obtain a list of the backup files and remove them on the same line:
ex: sed -e '/^<url>/!d' -e 's/<url>$.*$<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -n 1000 -0 rm

I installed Cygwin on my windows machine at work, so I would have access to tools like sed. Converting DVD directories of backups and producing a master pdf catalog from them took only a few lines of bash script using tools like sed, cut, sort, enscript and ps2pdf.

Some programs will be different, and you may find some a little harder to use. But some things will be easier in linux than using Windows.

sundialsvcs · 07-02-2006, 10:36 AM

Quote:

Originally Posted by jschiwal

It is possible to edit across line boundries in sed. You need to use the 'N'ext command and build up a multiline pattern space. Yes, this isn't very easy. Sed is useful however when you need to automate changes in a large number of documents.
Suppose that to clear up space on your harddrive, you backed up a large number of documents to DVD using k3b. Saving the backup.k3b file, you can unzip it and use sed on the main.xml to obtain a list of the backup files and remove them on the same line:

sed -e '/^<url>/!d' -e 's/<url>$.*$<\/url>/\1/' | tr '\n' '\000' | xargs -n 1000 -0 rm

It might be useful for the Gentle Readers if we break down exactly what that cryptic bit of Unix-speak actually means, and does...

(0) What we actually have here are three commands that are piped together using the symbol "|". What that means is that all three commands will be "running" at the same time. The output of the first command will be "piped into" the second to become the second command's input, and likewise the output of the second command will be piped to the third. Whatever output there might be from the final command will appear on your console.

(Note: I would assume that there ought to be four commands here, the first being cat main.xml, to provide some input to be consumed by sed, but for clarity let's work right now with just what is provided...)

(1) The sed command is a stream editor. It takes an input-stream, uses (for instance) regular expressions to grab pieces out of it, and produces an output stream. Regular expression syntax is covered in places like man 7 regex.

(1a) Regular expressions are basically text patterns, which are "matched" against an input. When a particular chunk of input-bytes is found to match the given pattern, pieces of matching text can be extracted (using constructs like "\1") and used.

(2) The output of sed is then piped into the tr command, which translates characters, one to another. Any occurrences of "newline" (\n) become nulls.

(3) The output of tris then once again piped into the command xargs, which takes each line that it receives (from the input-pipe), adds it to the end of whatever command is specified (in this case, rm), and executes that command, once for each line. (rm is, of course, "remove.")

(4) So what this sequence is ultimately going to do is to build and execute a slew of rm commands.

Uh huh, it's "write-only code," at least until you become used to it, and Unix-wizards like jschiwal can usually still dazzle you with what it can do. But that's really one of the unique things about the Unix environment that has always attracted people .. okay, okay, geeks

.. to it: you can do an amazing amount of work without "writing a program" at all.

jschiwal · 07-04-2006, 04:34 PM

Thank you for that great explanation and for spotting that I left out "maindata.xml" from the sed command.

I would add the the reason for converting new lines to nulls is to avoid problems if a filename contains whitespace. It will still have a problem with some "evil characters" like '&'. Some characters have special meanings in xml and may be escaped. Another sed command may be in order.

Linux/Unix programs tend to be more transparent then Windows programs. For example, the DVD backup program I use a work, to back up encoded videos, will save the backup selection in some unknown binary format. K3b saves the same information in an xml file. This makes it possible to extract information from this file and act on it. This is one of philosophical differences between *Nix and Windows. This difference arises in part because Linux is built using C. Also the history of Unix is a text (console) oriented system where everything is a file. Windows applications are built using a C++ framework. K3b uses a number of smaller programs in the background that are console based. A program like Word or even Open Office, rely on a C++ framework and tend to grow very large in size. At least Open Office uses a transparent method for saving it's files.

I would highly recommend the book "The Art of Unix Programming" by Eric S. Raymond. ISBN: 0-13-142901-9
It explains the philosophy and culture of Unix very well.

theNbomr · 07-04-2006, 06:41 PM

What's a paragraph return? Sounds like some contrivance dreamed up by Microsoft. Maybe the original poster should drink the cool-aid and come all the way over to the light side. Just because you learned a concept that applies to Windows and it's spawn, doesn't mean anything that uses different concepts is broken.

Different progams use different methods. Get over it.

Oh, and once you've achieved some mastery of the usual open source tools, you'll find that, in fact, you won't need to know much about the likes of sed and it's kin.

--- rod.

noranthon · 07-04-2006, 10:36 PM

Is 'sed' an acronym, an abbreviation, a nickname or a name and where does sed live? Can it be used with software like OpenOffice?

And for finding tabs, spaces and the like, what are

Quote:

the ususal open source tools

?

Has anyone had more than occasional success with OpenOffice's "regular expressions"?

gilead · 07-05-2006, 12:13 AM

sed is an abbreviation for stream editor. It's a utility that lives in /usr/bin on my system.

For finding tabs, spaces, etc. you can use sed, awk and grep (and probably others as well) - depending on how fine grained your search needs to be you can use a character class to find them; grep '[[:space:]]' /etc/* would search everything in in /etc for files containing white space (I realise it's not a practical example).

I haven't used Open Office's regular expressions, but if you will be using a lot of regular expressions with several tools I'd recommend Mastering Regular Expressions from O'Reilly (http://www.amazon.com/gp/product/059...lance&n=283155)

jschiwal · 07-05-2006, 01:24 AM

If I use global replace in a work processor, I will alway leave the verify option set so I can make sure I don't replace something I didn't want to. The shortcut is handy.

The global replace in vim can be handy. You can enter a line range. The command ":100,130s/^/ /" will insert 3 spaces at the beginning of each line.

The command ":%s/^M//" will convert a dos/windows text file to a unix text file.

Sed gets complicated when you may need to replace something that might be split up across two or more lines. The sed command 's/International Business Machines/IBM/' will replace the expression only if it's on the same line.

noranthon · 07-05-2006, 02:15 AM

Thanks for the information.

vim - an encounter with that had me scurrying back to Midnight Commander's internal editor yesterday. (I apologise for mentioning a utility like MC in such company.)

Others may see the likes of vim, sed, awk and grep as challenges. To me, for the few times they are relevant, they are a deterrent. All that learning for such a small return.

I can, however, add something for anyone else looking for information:
http://www.regular-expressions.info/

patrokov · 07-05-2006, 04:12 PM

Most of the discussion on my post is quite intelligent** although it does not address the points I raised in my post.

1. Searching isn't so much the problem as replacing. Of all the programs I tested, only OpenOffice would replace with regular expressions as well as find, and even then it's implementation is far from consistent, logical, and intuitive.
2. Token searching is simpler to perform than regular expression. Once you select that magic "regular expression" button, you have to escape every special character or you may get spectacularly wrong results.
3. Sometimes the document is already formatted, and you can't send it to sed or awk or other programs without losing hours of work.
4. Yes, I'm well aware of resources such as www.regular-expressions.info. Why do you think that my original post says that after three hours of searching and half an hour into a sed tutorial I rebelled? Not to mention that regular expressions often have unintended results and troubleshooting them can take a while. That's why resources such as regexlib.com/Default.aspx exist.
5. In this case, it is Windows that "doesn't get in the way" and "just let's me get my work done". Why should I (or anyone for that matter) have to read The Art of Unix Programming just to duplicate a simple search and replace for for non-printing character
6. Token searching should be an addition to and complement to regular expression searching. It already works flawlessly in Notetab. Why should FOSS software be more limited? Well the answer resides in the ignorance expressed in the quote below.

Quote:

Originally Posted by theNbomr

What's a paragraph return? Sounds like some contrivance dreamed up by Microsoft. Maybe the original poster should drink the cool-aid and come all the way over to the light side. Just because you learned a concept that applies to Windows and it's spawn, doesn't mean anything that uses different concepts is broken.

Different progams use different methods. Get over it.

Oh, and once you've achieved some mastery of the usual open source tools, you'll find that, in fact, you won't need to know much about the likes of sed and it's kin.

--- rod.

So as not to rub your nose in it too much, I'll just say that Unix systems traditionally format plain text with line feeds (CHR(10)). Windows and MS-DOS use line feeds plus a carriage return (CHR(10) + CHR(13)). Note that vim and vi have traditionally parsed the "extra" carriage return as "^M", which as jschiwal was kind enough to point out, you can search for in vi/vim and strip out to convert a DOS text file to a Unix file. Hmmm looks a lot like the token searching I was telling you about. The controversy over which way is better has waged for more than twenty years. I don't care which way is better, I just want to be able to search for them wherever and however they appear as a non-printing character and replace them with another non-printing character.

Now perhaps, you will stop drowning and pull your head out of the cool-aid long enough to read some tutorials that show your Unix works the same way on text that Windows does. Here's a head start: http://www.itworld.com/Comp/2378/swol-0799-unix101/

As for, "once I've mastered of the usual open source tools.." Would that be vi or emacs? Oops, sorry, no one seems to know which one is worthy of my time. I'll just refer you to noranthon's post:

Quote:

Originally Posted by noranthon

vim - an encounter with that had me scurrying back to Midnight Commander's internal editor yesterday...Others may see the likes of vim, sed, awk and grep as challenges. To me, for the few times they are relevant, they are a deterrent. All that learning for such a small return.

I don't know if he was trying to be literal or figurative, but given the topic of conversation, I thought the "such a small return" comment hilarious.

**I couldn't agree more with jschiwal's comment that Unix programs tend to be a little more transparent. For years I've wondered (okay, I actually stopped caring years ago) why Windows music programs insisted on storing playlists as binary files that are completely unable to be ported to other computers, or even broken by something as simple as renaming the music folder.

jschiwal · 07-05-2006, 06:21 PM

There is another advantage to transparency. The documents or spreadsheets that you save today, may be unusable in 10 or 20 years. Imagine if you need to recover old lotus123 spreadsheets. Even if you have the program itself backed up, you can't run it on a Pentium or newer machine. I've read that there are some excell spreadsheets that a newer version of excell can't load. Also, word documents may contain embedded COM objects that rely on connecting to another machine that has the right dll library (via rpc) registered. In 5 years, such a library may no longer exist.

theNbomr · 07-05-2006, 06:54 PM

So, what's the difference between a 'token' and any other set of 1 or more characters that can be unambiguously described using a prescribed notation?

If you want to search or search & replace, you are going to have to describe what to search for and, what to replace it with. If you use regular expressions to do this, then that is just as valid as any other scheme. Moreover, once one has learned what is now a de-facto standard, you will have learned something that can be used in a whole plethora of tools ranging from full-on programming languages, to scripting languages, to editing tools, to word processing and desktop publishing, and probably more. It is true that without understanding the operation of the tool, one can cause unexpected results. Such is the case with any tool.

Since you seemed to be referring to more of a word processing issue (paragraph breaks, OpenOffice...), I will mention that there is a distinct difference between what a word processor and a text editor does. Because much of the purpose of word processors is related to formatting and other visual aspects, the internal formatting of a document contains much more that the plain text content. It is probably inappropriate to attempt to use tools such as sed, awk, etc. for modifying such documents. Without knowing the specifics of a given file format, one cannot assume the purpose of any particular bytes within a document. Modifying it without retaining the format expected by the application that uses it, may render it unreadable, or have far reaching and unpredictable effects.

There are many office oriented applications, OpenOffice among them as you have noted, and each has their relative merits, I suppose. Some are intended to be compatible on some level with Microsoft/Windows based tools. If their behavior is not adequate to your purpose, you do have some options. You can use the source, and make the requisite modifications (either yourself literally, or a hired gun), or you can contact the author(s) of said program and request that they add or modify a future revision according to your spec's. You can also use the tool of your choice in a Windows environment, possibly without even leaving Linux, by using one of the Windows [non]emulators.

You probably don't need to digest any programmer's bible to gain a useful understanding of concepts such as regular expressions, or any of a handful of other Unix-centric concepts. Most people who use Unix in a purposeful way, and who have invested a small amount of time to learn these concepts have found it time well spent. How one does this can range from using online help systems that are built into most mature applications or are part of most Linux distributions, to web-based tutorials and references, to buying or borrowing printed literature. This is a little different from being lead around by a dancing paperclip, but maybe if someone is motivated enough, we Unix users will be able to enjoy that treat, just like the Windows crowd.

--- rod.

patrokov · 07-05-2006, 08:29 PM

Quote:

Originally Posted by theNbomr

So, what's the difference between a 'token' and any other set of 1 or more characters that can be unambiguously described using a prescribed notation?

Precisely. There is no difference. It's just another tool. One that's simpler and easier to use than regular expressions. The main difference is that token searching is more literal, while regular expressions are more conceptual (more powerful and more complicated). With tokens, if I search for "^pa." then I will find every instance of "a." after a paragraph break (whether it's denoted in the Windows or the Unix way). With regular expressions, "$pa." will bring up vastly different results...and there's nothing wrong with that. I just want the ability to choose which one according to the appropriate situation. I have that choice in a closed souce free "as in beer" text editor called NoteTab. Why should I expect less from the Unix/Linux world which gives me seven different shell alternatives (at least).

Rather than have Unix adepts tell me complicated workarounds because that kind of simple tool doesn't exist in the Unix world, it would be better to discuss the relative merits, shortcomings, and appropriate uses of "token" searching/replacing vs. regular expressions. (Another drawback to using regular expressions is that each engine is slightly different.)

As for how useful this exchange has been, when I make my requests to the KDE developers, I have learned that I should couch my request in terms of searching and replacing non-printing characters using tokens or "escape codes" in oldspeak. I should also make my request for regular expression replacing (as opposed to searching) as a separate issue. I've also learned that the next time I need to do this, vim will probably do what I need it to do, but that means learning vim. My time is probably better spent (i.e., it will take less time) trying to get Notetab to work on Wine. Then one day, at leisure, I can try and learn vim.

BTW, Notetab will automatically convert copied text into the appropriate regular expression for you if you paste it into the search box. And if it's internal engine isn't strong enough, you can also run Gawk and Perl scripts on it from within the editor. Yes, the dark side does have some things to teach us. And take a look at Notetab's documentation. Definitely can teach the lightside a thing or two about documentation. Perhaps I should start another thread about that...www.fookes.com/ftp/other/notetabpdf.zip

noranthon · 07-05-2006, 11:45 PM

Be careful, patrokov. Anger leads to the dark side. You must be master of your thoughts and emotions.

theNbomr · 07-06-2006, 12:08 AM

Quote:

BTW, Notetab will automatically convert copied text into the appropriate regular expression for you if you paste it into the search box.

That's truly amazing. So if I coped the text '555-1212', it would know that I want to find &/or replace everything that is an arithmetic subtraction of integers, and NOT a simple phone number? Astounding. I wonder how it knows what regular expression is appropriate for me?

Quote:

And if it's internal engine isn't strong enough, you can also run Gawk and Perl scripts on it from within the editor. Yes, the dark side does have some things to teach us.

Actually, regex 'strength' is usually pretty consistant. Perl invented a few obscure extensions that are helpful for context-sensitive searching. But to get a really strong engine, it needs to quaff a potion...

Quote:

And take a look at Notetab's documentation. Definitely can teach the lightside a thing or two about documentation. Perhaps I should start another thread about that...www.fookes.com/ftp/other/notetabpdf.zip

You have a point there. Linux documentation often sucks, if it even exists. OTOH, much of it is very good.

--- rod.