search replace multiple strings in a file

fundoo.code · 09-22-2015, 07:18 PM

I bcped a table into a txt file. It have 8 GB of data, almost 48 million rows.
In each row of data I need to replace one 8 character text delimited by | with a six character text.
Once the line is replaced, I dont need to recheck that line for replacement.

I found the following command which is taking almost 4 minutes for each replacement.
so it will take me almost 10 hours to run the sed command in a loop or something.
I have 400 1-1 mapping to replace.

One thing I was thinking is to remove the line once replaced and copy to different file so that each pass with take less and less time.

But I am not able to figure out how to do that.

Any help is appreciated.

sed -i 's/JAM/BUTTER/g;s/BREAD/CRACKER/g;s/SCOOP/FORK/g;s/SPREAD/SPLAT/g' test.txt

theAdmiral · 09-22-2015, 07:44 PM

You might be able to do this more easily in LibreOffice Base. Databases in applications like Base can have tables, and you can perform operations like what you are describing on those tables. You might be able to start with your original table, then import it into a new database, then work with it in the program interface. I have never used Base, but I have used Micro$oft Access, and those two programs are similar...not the same, but similar. When I was using Access for work I remember how once I got used to using the program I was able to do all kinds of stuff with it.

What format was your original table in?

fundoo.code · 09-22-2015, 08:23 PM

my database is in sybase. With all indexes it is still taking 15 mins per update which is far more than sed (3 mins).
so using database is not giving in good performance for this as one would assume.

bcp out is taking 15 mins and bcp in will take another 1.5 hours. still I will save lots of time with file processing.
if we can figure out to eliminate the processed rows in file, that would give very good timing.

theAdmiral · 09-22-2015, 08:39 PM

If you are in linux, and you do an

Code:

$ info sed

toward the bottom of that article (in mine, second line from the bottom) there is an entry for

* uniq -u:: Remove all duplicated lines

Would something like that help?

fundoo.code · 09-22-2015, 08:44 PM

its not duplicate per say, as I am just replace 8 characters. if after replacing, i can move that line to a different file that would help.
But how to achieve that, I am not able to figure out.

chrism01 · 09-22-2015, 08:49 PM

Try the neat hack by 'Sherlock' here http://codeverge.com/sybase.ase.unix...le-scan/958771
Basically pulls a single scan and splits into multiple files in one go.

You can then parallel process the sed cmds.
Of course if you can arrange the data in a known order/groups, you would then not have to match all sed's against all the files.

Alternatively you write a program in eg Perl that pulls out a subset and does the substitution at the same time (Perl regexes are fast). Run multiple copies to parallelize the performance.

syg00 · 09-22-2015, 11:00 PM

I understood the OP has said the substitutions are (in a sense) short-circuiting. Once one matches, go to the next line. How about

Code:

awk '{gsub(/JAM/ , "BUTTER") || gsub(/BREAD/ , "CRACKER") || gsub(/SCOOP/ ' "FORK") || gsub(/SPREAD/ , "SPLAT"}1' test.txt

fundoo.code · 09-24-2015, 08:46 PM

That awk did not do what I was expecting. I have to kill it after 1 hour for 4 entries.
I am thinking of trying split and merge the files and see how that perform.

if you guys have similar thing in shell script, sed, awk. for some funny reason our production servers does not have perl installed.

wpeckham · 09-24-2015, 09:08 PM

Might I suggest that you look into gsar? It is not installed by default, and MAY not be in your base repository, but it is worth looking up. Insane fast, it has never failed me!

chrism01 · 09-24-2015, 09:12 PM

In that case, yes do the split & parallel process the sed's.

Re Perl: if this is a one-off, then 2 options:

1. if you have wkstns that can remotely access the DB directly, so you could do it that way
2. you could also download the file, the split and parallel process it in Perl (if you find Perl easier than sed - I would).

If this is going to be a regular requirement, consider writing a program in the locally approved language eg C that can run on the DB server.

Rinndalir · 09-24-2015, 09:31 PM

open the orig file
open a new file
search and replace the text in each line of orig
write each line to new file