LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Is that limitation of shell script? (https://www.linuxquestions.org/questions/linux-newbie-8/is-that-limitation-of-shell-script-4175450202/)

sysmicuser 02-14-2013 09:08 PM

Is that limitation of shell script?
 
3 Attachment(s)
Hi Guys,

I think that this may be the limitation of shell script. But I believe in unix and shell script so taking to a next level - meaning asking on this forum.

Task is to look at file1 (attached here) and compare with file2(attached) , interested_nos should be that file whose nos are in file1 and not in file2 after comaprison.

This task is accomplised in MS Excel in fraction of a seconds then why 2.5 hrs for shell script, moreover it doesn't do as it is expected to do so.

Please do note that to meet requirements of attachements (no more than 256KB) I have truuncated approx 3K reocrds form file1.

I have also attached the shell script which does the job but it is futile as after 2.5 hrs it did not as it was expected to.


I am curious to know why is that the case :(:(

Your assistance would be highly appreciated.

P.S: I have attached file1 and file 2 after being sorted.

stormpunk 02-14-2013 10:24 PM

Looks to me like you're reading the entire file2 for every value of file1 which is 22k times.

There's better ways to do this but you could read both files into arrays and then go through them numerically. Since your references 2 files that have "sorted" in the filename, you have no reason to read low values of file2. Each time you find a match, short circuit back to file1 and work file2 from where you left off.

Here's a link for part of what I said in case you need help with that.
http://www.linuxquestions.org/questi...newbie-545840/

If you can, just use the diff program or look at some implementations for that for all kinds of inspiration.

shivaa 02-14-2013 11:17 PM

File2 has a return carriage (^M) at the end, because it has copied from Windows system.
So, first remove all return carriage from file2 and then invoke script again:
Code:

~$ awk '{gsub(/\r/,"",$0);print $0}' file2.sorted  > /tmp/sorted.txt
-------- OR --------
~$ sed -e 's/\r//g' file2.sorted > /tmp/sorted.txt
~$ cat /tmp/sorted.txt > file2.sorted; rm /tmp/sorted.txt

Then invoke your script:
Code:

#!/bin/bash
#set -xv
while read -r rline
        do
        f=0
        while read -r cline
        do
                if [ $rline -eq $cline ] ; then
                f=1
                break
                fi
        done < file2.sorted
        if [ $f != 1 ] ; then
        echo $rline >> interested_nos
        fi
done < file1.sorted


chrism01 02-14-2013 11:22 PM

Does this give the nums you expect
Code:

comm -23 file*|wc -l
13636

comm -13 file*|wc -l
1981

comm http://linux.die.net/man/1/comm

Basically, test with known file diffs to check, but I think you'll find this works.

Note that I had to cvt file2.txt to Unix end-of-line format.
MS uses \r\n, *nix uses \n.
Best to use the dos2unix http://linux.die.net/man/1/dos2unix cmd on both/all the files before using any *nix tools.

Your method is slow because its a lot of compares and shell script is an interpreted lang.

comm is compiled C
Code:

ldd /usr/bin/comm
        linux-vdso.so.1 =>  (0x00007fff2ef59000)
        libc.so.6 => /lib64/libc.so.6 (0x00000033ece00000)
        /lib64/ld-linux-x86-64.so.2 (0x00000033eca00000)


grail 02-15-2013 12:42 AM

I would agree that diff or comm are probably a better choice, but should you need to do more with the data, the following returned on the current data
instantaneously:
Code:

awk 'FNR==NR{_[$0];next}$0 in _' file2.txt file1.txt
I chose this order of the files so the smaller was read into the array. This was after the assumed change of line ending was fixed but awk can cope with
that too if you will be working on Windows based files regularly (let me know if required)

sysmicuser 02-17-2013 06:05 PM

@chrism01

I did try comm but the other day(after files were sorted and converted to unix based by using dos2unix command) but that didnt work.

However today I did the same and worked wonderful ! There is essentially no need for that program with comm it works beautiful. Thank you very much for enlightment :)

@grail

May I ask you what that command is doing?
Code:

awk 'FNR==NR{_[$0];next}$0 in _' file2.txt file1.txt

I redirected the output to a third file and compared with interested_nos where interested_nos was obtained using following command.
Code:

comm -23 file.sorted file2.sorted >> interested_nos
Please help me undertsand this.

Thank you.

chrism01 02-17-2013 06:14 PM

Glad it helped; add that to you bookmarks/mental list :)


All times are GMT -5. The time now is 12:25 AM.