Execution of the bash script is too Slower.

sunil21oct · 04-27-2017, 03:39 AM

I have used following sed command to replace character

sed 's/"/\"/g' $CSVDIR/$tableName.csv > $CSVDIR/$tableName.tmp && mv $CSVDIR/$tableName.tmp $CSVDIR/$tableName.csv
sed 's/\~\^/"/g' $CSVDIR/$tableName.csv > $CSVDIR/$tableName.tmp && mv $CSVDIR/$tableName.tmp $CSVDIR/$tableName.csv
sed 's/,""/,"\\N"/g' $CSVDIR/$tableName.csv > $CSVDIR/$tableName.tmp && mv $CSVDIR/$tableName.tmp $CSVDIR/$tableName.csv
sed 's/"[ \t]*/"/g' $CSVDIR/$tableName.csv > $CSVDIR/$tableName.tmp && mv $CSVDIR/$tableName.tmp $CSVDIR/$tableName.csv
sed 's/,"N"/,"\\N"/g' $CSVDIR/$tableName.csv > $CSVDIR/$tableName.tmp && mv $CSVDIR/$tableName.tmp $CSVDIR/$tableName.csv

in my script but it is taking more time if i am having bulk amount of data in the csv file.

how to optimized execution speed of this script file

Turbocapitalist · 04-27-2017, 04:03 AM

sed can do several actions in one pass in several ways Here are two of them:

Code:

sed -e '...; ...; ...; ...;' file.txt > newfile.txt
sed -e '...;' -e '...; -e '...; -e '...;' file.txt > newfile.txt

So, try combining your formulas so that the rename only has to happen once.

sunil21oct · 04-27-2017, 04:27 AM

Quote:

Originally Posted by Turbocapitalist

sed can do several actions in one pass in several ways Here are two of them:

Code:

sed -e '...; ...; ...; ...;' file.txt > newfile.txt
sed -e '...;' -e '...; -e '...; -e '...;' file.txt > newfile.txt

So, try combining your formulas so that the rename only has to happen once.

i have combined these in to a single line
sed 's/"/\\"/g;s/\~\^/"/g;s/,""/,"\\N"/g;s/"[ \t]*/"/g;s/,"N"/,"\\N"/g' $CSVDIR/$tableName.csv > $CSVDIR/$tableName.tmp && mv $CSVDIR/$tableName.tmp $CSVDIR/$tableName.csv

but still it is taking time because that csv is having bulk amount of data.
is there any other way to do this operation faster.

grail · 04-27-2017, 06:27 AM

I am not sure that combining is going to help much if the data is so large that it is slowing the transaction down, but another alternative could be to combine in a file and call that.
You can also do away with the mv by using sed's -i switch (in testing I would also pass a backup name option):

Code:

$ cat changes
's/"/\\"/g'
's/\\~/^/g'
's/(,""|,"N")/,"\\\\N"/g'
's/"[ \t]*/"/g'
$ sed -r -i.bak -f changes "$CSVDIR/$tableName.csv"

Two things to note:
1. I changed some of your commands as the previous ones were not doing what you expected (namely you need to escape \ with \ to have a single \ appear)

2. On completion of the above you will see no output, but you will now have 2 files, "$CSVDIR/$tableName.csv" will be the one with all the changes in it and "$CSVDIR/$tableName.csv.bak" will be a backup of the original file (just in case something went wrong)

r3sistance · 04-27-2017, 06:59 AM

There seems to be something fundamentally wrong here. Why are you using sed to a separate file called .tmp and then overwriting the original file as oppose to doing in-place updates?

Code:

man sed
...
       -i[SUFFIX], --in-place[=SUFFIX]

              edit files in place (makes backup if SUFFIX supplied)
...

not knowing your system it is hard to make recommendations, put perhaps you could drop new records into a .new file. When you need to update, rotate the .new file out and process the .new file with sed and using >> to append to the end of the existing .csv.

Realistically, the best option would probably to use an actual database system like mysql or postgres and capture/change the characters using a stored procedure or trigger but that maybe a far bigger change than wanted.

sunil21oct · 04-27-2017, 07:29 AM

Quote:

Originally Posted by r3sistance

There seems to be something fundamentally wrong here. Why are you using sed to a separate file called .tmp and then overwriting the original file as oppose to doing in-place updates?

Code:

man sed
...
       -i[SUFFIX], --in-place[=SUFFIX]

              edit files in place (makes backup if SUFFIX supplied)
...

not knowing your system it is hard to make recommendations, put perhaps you could drop new records into a .new file. When you need to update, rotate the .new file out and process the .new file with sed and using >> to append to the end of the existing .csv.

Realistically, the best option would probably to use an actual database system like mysql or postgres and capture/change the characters using a stored procedure or trigger but that maybe a far bigger change than wanted.

This script is executing in solaris system , in solaris -i is not working thats why i have to store it in temp file.

Is there any alternate solution to make it faster?

sunil21oct · 04-27-2017, 07:30 AM

Is there any alternate solution to make it faster?

sundialsvcs · 04-27-2017, 07:41 AM

Quote:

Originally Posted by sunil21oct

Is there any alternate solution to make it faster?

Sure! Use a real programming language!

I'm writing this off the top of my head, but in PHP for instance, something like this sketch:

Code:

#!/usr/bin/php

# The preceding ("#!shebang") line tells Bash that this "shell script" 
#    is written in PHP.
# It should be the first line in the file.

# Slurp the entire file into a string ...
$str = file_get_contents($filename);

$str = preg_replace('/\~\^/', "\"", $str);
$str = preg_replace('/,"N"/', ",\\N\"", $str);
# etc.

# Write it all out from the string.
file_put_contents($filename, $str);

rename( ... );

One "gotcha" to be aware of in some languages is that you might need to use double quotes to enclose the string if you want "interpolation" (of things like \n) to take place. Therefore you must "escape" any double-quote literals by preceding them with a backslash. (The LQ forum software apparently won't let me show you an example.)

Nevertheless: you are doing all of the string-twiddling in memory, then writing out the file-content once. Right now, you are laboriously reading the entire file, over and over and over again, just to do one thing to it.

If the file is too large to read into memory all at once (fairly unlikely, these days ...) you can also process the file "line by line" (into a different file), but once again applying all of the string-manipulations to each line all-at-once.

- - - - -

Generally speaking: While "bash scripting" is a sort-of-okay thing to do now and then, IMHO it usually isn't the right way to do "real" work. That scripting tool was designed for "knock-off work," at best. And, through the #!shebang feature, Bash allows you to write your scripts in any language of your choice. In Linux, you have "an embarrassment of riches™" of languages to choose from.

grail · 04-27-2017, 08:28 AM

Have to agree with sundialsvcs. Perl would have been my thought as it too is really good at processing large amounts of data.

Turbocapitalist · 04-27-2017, 08:37 AM

perl can be done as a formal script or as a one-liner. The -i option can do in-place editing. The -p option wraps a loop around the code you put in with the -e option. See the manual page(s) for details.

Code:

man perlrun
man perlre
man perlfunc

So it could look like this:

Code:

perl -p -e 's/a/b/g; s/c/d/g; ...' file.txt > newfile.txt

perl -p -i.orig -e 's/a/b/g; s/c/d/g; ...' file.txt

However, perl also has proper modules for processing CSV and similar flat-files, such as Text::CSV_XS.

Laserbeak · 04-27-2017, 08:37 AM

Quote:

Originally Posted by sundialsvcs

Sure! Use a real programming language!

I'm writing this off the top of my head, but in PHP for instance, something like this sketch:

Yep, but I'd suggest Perl over PHP. It was MADE for stuff like this (Perl = "Practical Extraction and Report Language").

syg00 · 04-27-2017, 08:46 AM

I fail to see how simple substitution like this would be any faster in perl.

r3sistance · 04-27-2017, 08:50 AM

The answer, as I think I advised above but put into different words, is that the data should be pre-processed prior to being added to the CSV. Could be done with any language tho, just requires applying data to a different location and then processing it, a perl cronjob would be better than BASH for this. While it could be done with BASH, perl should be more consistent and cross platform.

Laserbeak · 04-27-2017, 09:19 AM

Quote:

Originally Posted by syg00

I fail to see how simple substitution like this would be any faster in perl.

First, Perl is more efficient at regular expressions, and, probably more importantly, the script would be run in one process. You wouldn't have to fork() for every call to sed. You wouldn't need mv since with Perl you can just save the altered data in a new directory and optionally delete the original with unlink().

wpeckham · 04-27-2017, 09:42 AM

To extend upon the good advice above:
You are not using bash to do the substitutions.
There is one disk read for the script, then one for every sed command, and on efor every reference to the file. This adds up to a lot of I/O. You are using bash to call sed MULTIPLE times to do the grunt work.

Engines like Perl do not need to make external calls to outside programs, so they only load from disk ONCE. If the data can all fit in memory, they also need only one massive read and one massive write for all of the file I/O. This reduces the total I/O delay greatly. while both SED and PERL are highly optimized for this, PERL is the more general and efficient choice for this particular case.

Even if you have to work on a block (or line) at a time, the principal of doing one read, make all of the substitutions on the buffer, then write that out and read the next would speed things up.

Other general languages with characteristics abound, but the principal is to use each tool for what it is best at. This case is just not optimal for bash and sed together.