Splitting the source file as per the code(1st substring) in 4th(AcctID) column
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
This is more of a programming question, so moving to the Programming forum. There really are a larger number of members who browse that forum that can help a great deal.
Please edit your original post and put the code within [code][/code] tags to separate it from your text and preserve the formatting to make it easier to read.
Do you know what section of this is taking the most time where you wish to improve the performance?
Actually my source file contains millions of record, so while reading the data from while loop, it is taking time, AWK is very fast in reading data, but I don't know much about AWK..Please help
Please do edit your OP to use code tags.
Am I reading it right here:
Loop the big file, parse the compare string into $CODE
Loop the lkp1.txt file (1), see if there's a match,
If a match, write to the output file (2)
Loop the lkp2.txt file (1), see if there's a match
If a match, write to the output file (2)
???
My comments:
(1)Why loop the lookup files? If they are only one line each, just read them. I'd guess the loop requires two passes to find out it's at the end...or better, read the values into an array and test against the array; removes millions of disk i/o's
(2) Why touch the output file before appending to it? The echo is going to update the date and create the file if it doesn't exist.
Those extra steps on "millions of records" will add considerably to processing time, I'd think.
# sting1 is in the hat without boots
# string2 is a cow witout milk
substring=is
if [[ string1 =~ 'is' ]] || [[ string2 =~ 'is' ]] ;
both strings will return a find. whereas a direct comparison to strings looking for a subset within the string will never get you the substring you're looking for.
example
Code:
userx@slackwhere101:~
$ string="my hat is lost in the field of dreams"
$ if [[ "$string" == 'in' ]] ; then echo "hat" ; fi
// returned nothing because it is a direct comparison.
//whereas substings
$ if [[ "$string" =~ 'in' ]] ; then echo "hat" ; fi
hat
There's a difference between integer comparisons and string comparisons. And some other complications.
The recommendation I always go with is to do exactly what you've done BW-userx, which is to test/prove what I'm scripting, so as to be sure I'm making the correct comparison.
Even reading the bash documentation is confusion (for me) when it comes to this topic.
Thanks guys for responding, just to let you know that the code that I have shared is tested and working fine but for few records,for large volume of data it is taking much time to split, please help me in improving the performance or any other way to solve that..
I agree with pan64 that I'd like to see more code.
The problem statement may be incorrectly typed in. Says to split the AcctID into 5 fields. But there are not 5 fields in that term.
Why not search for the client code in every line, and filter to destination files first, and then reprocess the resultant files to eliminate any outlying terms that shouldn't have been copied?
again, that is a single grep, nothing more:
fgrep '|AAR' inputfile > outfile
you need to modify a bit if you want to read expressions from file, see man bash
Please ignore the first points where it is saying to split into 5 substring, just think like I am extracting the first substring from the 4th column that is the code and comparing it with the 1st lookup file, if it matches, send it to one target and again compare with the 2nd lookup file and if matches send it to other target file
@pan64 code is not static, we can get any type of code, there are around 20 client code, so we have a single files with multiple client codes, lookup files can contain any code randomly do out code should be smart enough to handle those
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.