Splitting the source file as per the code(1st substring) in 4th(AcctID) column

bishnumnnit2006 · 02-26-2018, 09:42 AM

Hi Everyone, need your help in solving one of the issue that i am facing

i)I have a source file(Src.txt) with 4 column, Column 4 contain the data as substring and is separated by '-' as shown below.

ii)The first substring(like AAR,ABC,TRP..) in AcctID(cloumn 4)is the client code.

ERUID|GroupID|GroupName|AcctID

6|ERU_MCSD_POS|ERU_MCSD_POS|AAR-IAS-AR8F92001T2-C

6|ERU6_ARCGEN|generic Accounts|ABC-IAS-AR8F92001T2-C

6|ERU6_ARCGEN|Archive Accounts|TRP-IAS-AR8F92001T2-C

6|ERU6_ARCGEN|Archive Generic|TIA-IAS-AR8F92001T2-C

6|ERU6_ARCGEN|Archive Generic Accounts|DEF-IAS-AR8F92001T2-C

also i have a lookup files(lkp1.txt and lkp2.txt ) with the client code separated by space as shown below

cat lkp1.txt

AAR ABC TIA

cat lkp2.txt

TRP

now i want to read the lookup files and main file and compare the client code.

below is what the logic is

1) from the main file Split the acctid column into 5 substrings and take the first substring. This is the client code

2) If Client code matches the client codes from lkp.txt file then send the data to one target file

3) If Client code matches the client codes from lkp2.txt file then send the data to other target file.

I am developed the code, but it is taking much time to compare and split.

i know that we can do it via AWK and its pretty fast, but i am new to awk, please help with the solution.

sharing my code below:

while read -r LINE

do

CODE=`echo $LINE |awk -F '|' '{ split($4,a,"-"); print a[1]}'`

while read -r LINE1

do

LKP_AMR1=`echo $LINE1`

if [[ $CODE = $LKP_AMR1 ]]

then

`touch <Path>/<File_Name>

`echo $LINE >> <Path>/<File_Name>

fi

done < ${XmrEtlSrcFiles}/lkp1.txt

while read -r LINE2

do

LKP_AMR2=`echo $LINE2`

if [[ $CODE = $LKP_AMR2 ]]

then

`touch <Path>/<File_Name>

`echo $LINE >> <Path>/<File_Name>

fi

done < ${XmrEtlSrcFiles}/lkp2.txt

done < ${XmrEtlSrcFiles}/Src.txt

rtmistler · 02-26-2018, 09:47 AM

This is more of a programming question, so moving to the Programming forum. There really are a larger number of members who browse that forum that can help a great deal.

Please edit your original post and put the code within [code][/code] tags to separate it from your text and preserve the formatting to make it easier to read.

Do you know what section of this is taking the most time where you wish to improve the performance?

bishnumnnit2006 · 02-26-2018, 10:35 AM

Actually my source file contains millions of record, so while reading the data from while loop, it is taking time, AWK is very fast in reading data, but I don't know much about AWK..Please help

scasey · 02-26-2018, 11:02 AM

Please do edit your OP to use code tags.
Am I reading it right here:
Loop the big file, parse the compare string into $CODE
Loop the lkp1.txt file (1), see if there's a match,
If a match, write to the output file (2)
Loop the lkp2.txt file (1), see if there's a match
If a match, write to the output file (2)
???
My comments:
(1)Why loop the lookup files? If they are only one line each, just read them. I'd guess the loop requires two passes to find out it's at the end...or better, read the values into an array and test against the array; removes millions of disk i/o's
(2) Why touch the output file before appending to it? The echo is going to update the date and create the file if it doesn't exist.
Those extra steps on "millions of records" will add considerably to processing time, I'd think.

Just some thoughts.
Re awk: What have you tried?

BW-userx · 02-26-2018, 11:10 AM

this is assignment not comparison

Code:

if [[ $CODE = $LKP_AMR1 ]]

so it fails

finding substrings

Code:

# sting1 is in the hat without boots
# string2 is a cow witout milk
substring=is

if [[ string1 =~ 'is' ]] || [[ string2 =~ 'is' ]] ;

both strings will return a find. whereas a direct comparison to strings looking for a subset within the string will never get you the substring you're looking for.

example

Code:

 
userx@slackwhere101:~
$ string="my hat is lost in the field of dreams"
 
 
$ if [[ "$string" == 'in' ]] ; then echo "hat" ; fi

// returned nothing because it is a direct comparison. 
//whereas substings 

 
$ if [[ "$string" =~ 'in' ]] ; then echo "hat" ; fi
hat

returns a hit.

keefaz · 02-26-2018, 11:17 AM

Quote:

Originally Posted by BW-userx

this is assignment not comparison

Code:

if [[ $CODE = $LKP_AMR1 ]]

so it fails

No, it's a comparison in shell language

Code:

a=2
[[ $a = 1 ]] || echo "nope..."

BW-userx · 02-26-2018, 11:37 AM

Quote:

Originally Posted by keefaz

No, it's a comparison in shell language

Code:

a=2
[[ $a = 1 ]] || echo "nope..."

are you serious?

Code:

userx@slackwhere101:~
$ a=1
 
$ [[ $a = 4 ]] && echo "poop" || echo "nope"
nope
 
$ [[ $a = 1 ]] && echo "poop" || echo "nope"
poop
 
$ [[ $string = 'os' ]] && echo "$string"

 
$ [[ $string =~ 'os' ]] && echo "$string"
my hat is lost in the field of dreams

I guess I have to stand corrected.

substrings stays the same though.

thanks .. saves my finger tips a little.

rtmistler · 02-26-2018, 12:11 PM

There's a difference between integer comparisons and string comparisons. And some other complications.

The recommendation I always go with is to do exactly what you've done BW-userx, which is to test/prove what I'm scripting, so as to be sure I'm making the correct comparison.

Even reading the bash documentation is confusion (for me) when it comes to this topic.

pan64 · 02-26-2018, 12:20 PM

would be nice to post the real code you tried to execute. also please use [code]here comes your script[/code] tags to keep formatting.

Code:

`touch <Path>/<File_Name>

is syntactically incorrect, <Path> is invalid, and also backtick is missing.
Use $( command ) instead of ` command ` (backtick).

Code:

LKP_AMR2=`echo $LINE2`
# is extremely inefficient, use
LKP_AMR2="$LINE2"
# instead

also please try to use shellcheck

by the way both requirements (2 and 3 in your first post) can be implemented with a single grep.

bishnumnnit2006 · 02-26-2018, 12:29 PM

Thanks guys for responding, just to let you know that the code that I have shared is tested and working fine but for few records,for large volume of data it is taking much time to split, please help me in improving the performance or any other way to solve that..

rtmistler · 02-26-2018, 12:30 PM

I agree with pan64 that I'd like to see more code.

The problem statement may be incorrectly typed in. Says to split the AcctID into 5 fields. But there are not 5 fields in that term.

Why not search for the client code in every line, and filter to destination files first, and then reprocess the resultant files to eliminate any outlying terms that shouldn't have been copied?

pan64 · 02-26-2018, 12:36 PM

again, that is a single grep, nothing more:
fgrep '|AAR' inputfile > outfile
you need to modify a bit if you want to read expressions from file, see man bash

bishnumnnit2006 · 02-26-2018, 12:37 PM

Please ignore the first points where it is saying to split into 5 substring, just think like I am extracting the first substring from the 4th column that is the code and comparing it with the 1st lookup file, if it matches, send it to one target and again compare with the 2nd lookup file and if matches send it to other target file

bishnumnnit2006 · 02-26-2018, 12:40 PM

@pan64 code is not static, we can get any type of code, there are around 20 client code, so we have a single files with multiple client codes, lookup files can contain any code randomly do out code should be smart enough to handle those

BW-userx · 02-26-2018, 12:40 PM

perhaps even post a chunk of real data so one knows what you're really working with here.