LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-22-2015, 07:18 PM   #1
fundoo.code
LQ Newbie
 
Registered: Oct 2011
Posts: 5

Rep: Reputation: Disabled
search replace multiple strings in a file


I bcped a table into a txt file. It have 8 GB of data, almost 48 million rows.
In each row of data I need to replace one 8 character text delimited by | with a six character text.
Once the line is replaced, I dont need to recheck that line for replacement.

I found the following command which is taking almost 4 minutes for each replacement.
so it will take me almost 10 hours to run the sed command in a loop or something.
I have 400 1-1 mapping to replace.

One thing I was thinking is to remove the line once replaced and copy to different file so that each pass with take less and less time.

But I am not able to figure out how to do that.

Any help is appreciated.

sed -i 's/JAM/BUTTER/g;s/BREAD/CRACKER/g;s/SCOOP/FORK/g;s/SPREAD/SPLAT/g' test.txt
 
Old 09-22-2015, 07:44 PM   #2
theAdmiral
Member
 
Registered: Oct 2008
Location: Boise, Idaho
Distribution: Debian GNU/Linux (Jessie) + KDE
Posts: 168

Rep: Reputation: 4
You might be able to do this more easily in LibreOffice Base. Databases in applications like Base can have tables, and you can perform operations like what you are describing on those tables. You might be able to start with your original table, then import it into a new database, then work with it in the program interface. I have never used Base, but I have used Micro$oft Access, and those two programs are similar...not the same, but similar. When I was using Access for work I remember how once I got used to using the program I was able to do all kinds of stuff with it.

What format was your original table in?
 
Old 09-22-2015, 08:23 PM   #3
fundoo.code
LQ Newbie
 
Registered: Oct 2011
Posts: 5

Original Poster
Rep: Reputation: Disabled
my database is in sybase. With all indexes it is still taking 15 mins per update which is far more than sed (3 mins).
so using database is not giving in good performance for this as one would assume.

bcp out is taking 15 mins and bcp in will take another 1.5 hours. still I will save lots of time with file processing.
if we can figure out to eliminate the processed rows in file, that would give very good timing.
 
Old 09-22-2015, 08:39 PM   #4
theAdmiral
Member
 
Registered: Oct 2008
Location: Boise, Idaho
Distribution: Debian GNU/Linux (Jessie) + KDE
Posts: 168

Rep: Reputation: 4
If you are in linux, and you do an

Code:
$ info sed
toward the bottom of that article (in mine, second line from the bottom) there is an entry for

* uniq -u:: Remove all duplicated lines

Would something like that help?
 
Old 09-22-2015, 08:44 PM   #5
fundoo.code
LQ Newbie
 
Registered: Oct 2011
Posts: 5

Original Poster
Rep: Reputation: Disabled
its not duplicate per say, as I am just replace 8 characters. if after replacing, i can move that line to a different file that would help.
But how to achieve that, I am not able to figure out.

Last edited by fundoo.code; 09-22-2015 at 08:46 PM.
 
Old 09-22-2015, 08:49 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Try the neat hack by 'Sherlock' here http://codeverge.com/sybase.ase.unix...le-scan/958771
Basically pulls a single scan and splits into multiple files in one go.

You can then parallel process the sed cmds.
Of course if you can arrange the data in a known order/groups, you would then not have to match all sed's against all the files.

Alternatively you write a program in eg Perl that pulls out a subset and does the substitution at the same time (Perl regexes are fast). Run multiple copies to parallelize the performance.
 
Old 09-22-2015, 11:00 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,128

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
I understood the OP has said the substitutions are (in a sense) short-circuiting. Once one matches, go to the next line. How about
Code:
awk '{gsub(/JAM/ , "BUTTER") || gsub(/BREAD/ , "CRACKER") || gsub(/SCOOP/ ' "FORK") || gsub(/SPREAD/ , "SPLAT"}1' test.txt
 
Old 09-24-2015, 08:46 PM   #8
fundoo.code
LQ Newbie
 
Registered: Oct 2011
Posts: 5

Original Poster
Rep: Reputation: Disabled
That awk did not do what I was expecting. I have to kill it after 1 hour for 4 entries.
I am thinking of trying split and merge the files and see how that perform.

if you guys have similar thing in shell script, sed, awk. for some funny reason our production servers does not have perl installed.
 
Old 09-24-2015, 09:08 PM   #9
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,627

Rep: Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695
Try something new....

Might I suggest that you look into gsar? It is not installed by default, and MAY not be in your base repository, but it is worth looking up. Insane fast, it has never failed me!
 
Old 09-24-2015, 09:12 PM   #10
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
In that case, yes do the split & parallel process the sed's.

Re Perl: if this is a one-off, then 2 options:

1. if you have wkstns that can remotely access the DB directly, so you could do it that way
2. you could also download the file, the split and parallel process it in Perl (if you find Perl easier than sed - I would).

If this is going to be a regular requirement, consider writing a program in the locally approved language eg C that can run on the DB server.
 
Old 09-24-2015, 09:31 PM   #11
Rinndalir
Member
 
Registered: Sep 2015
Posts: 733

Rep: Reputation: Disabled
open the orig file
open a new file
search and replace the text in each line of orig
write each line to new file
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Search and replace strings with file paths in vim geeksquads Linux - Software 2 01-30-2015 02:03 AM
Problem using grep to search for multiple strings listed in a file scruffbag Programming 3 09-25-2013 08:22 AM
[SOLVED] sed help. Search and replace multiple strings with one command. dbrazeau Programming 4 02-13-2013 11:45 AM
Search and Replace with multiple-line strings ChristianNerds.com Programming 4 08-21-2005 02:32 PM
Looking for a multiple file search and replace tool, prefer graphical haimeltjnfg Linux - Software 6 02-02-2005 10:53 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:06 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration