LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How to match element in a file using Bash (https://www.linuxquestions.org/questions/programming-9/how-to-match-element-in-a-file-using-bash-606274/)

ahjiefreak 12-12-2007 07:20 AM

How to match element in a file using Bash
 
Hi,

I tried to simulate some example in Shell particularly finding item and matching in another file.I have two files initially; file A and file B.

A very simple example:-

In file A, I have columns of fields such that:-

aaa 107
bbb 108
ccc 109

In a file B, I have columns of fields such that:-
101 2 1
102 3 1
107 2 1
108 3 1
109 2 1

I would like to know, if I would like to extract let say first element of file A and compare with file B elements.

If found, I would like to have the position of element 107 in file B in this case it is on 3rd line. From the elements found in file B, I would like to perform some computation on their fields.

Next, I would read the second element of file A which is "bbb 108" and open another file B which is quite similar to the file B.

Currently, I assume I have only one file A and one file B to compare.

I tried to do below:-


#!/bin/bash

cat a.txt| while read LINE
do
#grab the second element of a.txt one line at a time
char=`echo "${LINE}"| awk '{print $2}'`

echo $char
#grab the line number of the element found in b.txt
i=`grep -n "^$char" b.txt|tr ":" " "|awk '{ print $1}'`
echo $i

#grab the particular line number fields in b.txt to do computation
cat b.txt|awk -v h=$i '{

count[$h]=$2+$3;

}
{
printf("Count position %d is %d\n",h,count[$h]);
}'
done

However weird thing is, the output i get is the sum of the last $2 and $3 for all the entries; which is

Coutn position 2 is 6 # which is 2+4
Coutn position 1 is 6 #which is 2+4
Coutn position 3 is 6 #which is 2+4

My desired output for each time comparison would be something like:-

Coutn position 2 is 5 #which is 2+3

And the same goes for other matching patterns when I open again another b.txt to compare with second line of a.txt.

Anyone could tell me what is wrong with the above data structure?
maybe I missed out some important structures for the above.

Thanks.


-ahjiefreak

matthewg42 12-12-2007 07:33 AM

You can use the join command to do a line-by-line lookup of values between two files. You just have to specify which field is the join field using the -1 and -2 options:
For example, this command outputs a concatenation of fields where field number 2 in the first file is the same as field number 1 in the second file. (using your example files)
Code:

join -1 2 -2 1 fileA fileB
The output is:
Code:

107 aaa 2 1
108 bbb 3 1
109 ccc 2 1

You can read this into a shell "while read" loop and perform whatever operations you like:
Code:

join -1 2 -2 1 fileA fileB | while read a b c d; do
    echo "for join field $a : $c + $d = $(($c + $d))"
done

And the output:
Code:

for join field 107 : 2 + 1 = 3
for join field 108 : 3 + 1 = 4
for join field 109 : 2 + 1 = 3

If there is a lot of input data you would be better off doing any line-by-line operations in Awk or Perl because the shell's read and arithmetic operators are not very efficient:
Code:

join -1 2 -2 1 fileA fileB |
  awk '{ print "for join field " $1 " : " $3 " + " $4 " = " $3 + $4; }'


ahjiefreak 12-12-2007 04:30 PM

Hi Matthew,

I agree with you. But the problem is in the first file (A.txt); first element is looked and compare with second file namely B1.txt.

Then, next element in A.txt (second element) is looked and compare again with another file namely B2.txt.

If we use join, that would means we need to join two files. Can it be done in this case where while I read line by lnie of A.txt,
I join the first element to the first field of B1.txt. Then, I can perform operation on that.

I doubt we can do that because when we join, it still joins and match the whole element of first file with B1.txt. But the desired thing I would like to do is just get one element from A.txt at a time and join them (match).

Please advise.
Thanks.

-ahjiefreak

matthewg42 12-12-2007 06:24 PM

Aha, I mis-read the OP a little.

Do you have files names B1.txt, B2.txt B3.txt etc, where the numerical component increments by 1 each time, and presumably you have as many B files as there are lines in A.txt?

Well, you could do it with something like the approach you took in the OP. However, I think this will be pretty bad performance if you have a lot of lines in A.txt because you will have to invoke several new processes per line of A.txt. Personally I'd switch to Perl for something like this, although awk is also a good choice.

Here's how I'd do it:
Code:

#!/usr/bin/perl

use strict;
use warnings;

my $n = 1;

open(A, "<A.txt") || die "cannot open A.txt : $!\n";
while(<A>) {
    chomp;
    my @a = split(/\s+/);
    my $bfile = "B$n.txt";
    open(B, "<$bfile") || die "cannot open $bfile : $!\n";
    while(<B>) {
        chomp;
        my @b = split(/\s+/);
        if ( $b[0] eq $a[1] ) {
            printf "found %s in %s at line %d. %d + %d = %d\n",
                $a[1], $bfile, $., $b[1], $b[2], $b[1]+$b[2];
        }
    }
    close(B);
    $n++;
}
close(A);


ahjiefreak 12-13-2007 12:56 AM

Hi Matthew.

Thanks for the reply.

I tried a silly method that for this kinda problem by having:-


#!/bin/sh -x

cat a.txt|while read LINE
do

char=`echo "${LINE}"| awk '{print $2}'`

#echo $char
i=`grep -n "^$char" b.txt|awk '{print $2}'`
j=`grep -n "^$char" b.txt|awk '{print $3}'`
k=`grep -n "$char" c.txt|awk '{print $1}'`


q=`echo $j/\( $i +$j\) | bc`

echo $i
echo $j
echo $k
echo $q

But I still face problem where:-

in k=`grep -n "$char" c.txt|awk '{print $1}'`

it could not grep only the exact number;

For example; when I try grep number which is 108;

++ awk '{print $1}'
+ k='2:
3:'

It gives me two values.


Do you or anyone know how we can use awk (instead of echo) to simplify the whole process? I am kinda confused and headache thinking of this problem for the couple of days.


Please advise. Thanks.

-Jason

ghostdog74 12-13-2007 02:07 AM

let's say that I roughly understood the requirement...
sample input:
Code:

# more file
aaa 107
bbb 108
ccc 109
# more file1
101 2 1
102 3 1
107 2 1
108 3 1
107 10 1
# more file2
101 2 1
102 3 1
107 5 1
108 3 1
109 2 1
# more file3
101 2 1
102 3 1
107 5 1
108 3 1
109 6 1

GNU awk
Code:

awk  'BEGIN{ i=0 }
NR==FNR{
      store[++c] = $2
      next
    }
{
    ++i
    while ( (getline line < FILENAME )> 0 ) {
      if  ( line ~ store[i] ) {
            print "Now I can do something with this line:  " line  " from file: " FILENAME
      }
    }
    nextfile
}
' file file1 file2 file3


output:
Code:

# ./test.sh
Now I can do something with this line:  107 2 1 from file: file1
Now I can do something with this line:  107 10 1 from file: file1
Now I can do something with this line:  108 3 1 from file: file2
Now I can do something with this line:  109 6 1 from file: file3


ahjiefreak 12-13-2007 03:57 AM

Hi,

Thanks for the input. I havent tried it on my Linux Box currently as I am using currently using my friends pc.

However, I do not quite understood from first glance. Do I have to open any file at the first place? Or just start with awk 'BEGIN...?(because from my understanding, the second field of first file is been read and store in store array.

Second, for the FILENAME i assume it should be my second file? How do you deal with different number at the back for different file1,2,3...etc. to open and compare?

And one more thing, is it that in the if(line~store[i]) when compared with element two of first file, the whole one line at a time is able to automatically compared with store[i]?

Sorry as I am quite new to bash shell and it seems complicated for me to understand in details of the bits. If you dont mind, could you either clarify my doubts or comment on the code?

Thanks alot.Really appreciate it. Will let you know once I try it out. Thanks.



-ahjiefreak
Code:
awk 'BEGIN{ i=0 }
NR==FNR{
store[++c] = $2
next
}
{
++i
while ( (getline line < FILENAME )> 0 ) {
if ( line ~ store[i] ) {
print "Now I can do something with this line: " line " from file: " FILENAME
}
}
nextfile
}
' file file1 file2 file3


All times are GMT -5. The time now is 11:40 AM.