Is that limitation of shell script?
3 Attachment(s)
Hi Guys,
I think that this may be the limitation of shell script. But I believe in unix and shell script so taking to a next level - meaning asking on this forum. Task is to look at file1 (attached here) and compare with file2(attached) , interested_nos should be that file whose nos are in file1 and not in file2 after comaprison. This task is accomplised in MS Excel in fraction of a seconds then why 2.5 hrs for shell script, moreover it doesn't do as it is expected to do so. Please do note that to meet requirements of attachements (no more than 256KB) I have truuncated approx 3K reocrds form file1. I have also attached the shell script which does the job but it is futile as after 2.5 hrs it did not as it was expected to. I am curious to know why is that the case :(:( Your assistance would be highly appreciated. P.S: I have attached file1 and file 2 after being sorted. |
Looks to me like you're reading the entire file2 for every value of file1 which is 22k times.
There's better ways to do this but you could read both files into arrays and then go through them numerically. Since your references 2 files that have "sorted" in the filename, you have no reason to read low values of file2. Each time you find a match, short circuit back to file1 and work file2 from where you left off. Here's a link for part of what I said in case you need help with that. http://www.linuxquestions.org/questi...newbie-545840/ If you can, just use the diff program or look at some implementations for that for all kinds of inspiration. |
File2 has a return carriage (^M) at the end, because it has copied from Windows system.
So, first remove all return carriage from file2 and then invoke script again: Code:
~$ awk '{gsub(/\r/,"",$0);print $0}' file2.sorted > /tmp/sorted.txt Code:
#!/bin/bash |
Does this give the nums you expect
Code:
comm -23 file*|wc -l Basically, test with known file diffs to check, but I think you'll find this works. Note that I had to cvt file2.txt to Unix end-of-line format. MS uses \r\n, *nix uses \n. Best to use the dos2unix http://linux.die.net/man/1/dos2unix cmd on both/all the files before using any *nix tools. Your method is slow because its a lot of compares and shell script is an interpreted lang. comm is compiled C Code:
ldd /usr/bin/comm |
I would agree that diff or comm are probably a better choice, but should you need to do more with the data, the following returned on the current data
instantaneously: Code:
awk 'FNR==NR{_[$0];next}$0 in _' file2.txt file1.txt that too if you will be working on Windows based files regularly (let me know if required) |
@chrism01
I did try comm but the other day(after files were sorted and converted to unix based by using dos2unix command) but that didnt work. However today I did the same and worked wonderful ! There is essentially no need for that program with comm it works beautiful. Thank you very much for enlightment :) @grail May I ask you what that command is doing? Code:
awk 'FNR==NR{_[$0];next}$0 in _' file2.txt file1.txt I redirected the output to a third file and compared with interested_nos where interested_nos was obtained using following command. Code:
comm -23 file.sorted file2.sorted >> interested_nos Thank you. |
Glad it helped; add that to you bookmarks/mental list :)
|
All times are GMT -5. The time now is 12:25 AM. |