[Solved] search for exact strings from one file to another and write if not exist

pedropt · 07-19-2017, 02:05 AM

Hi everyone , i am building a script that requires a code to start reading every line from file A and check if that exact string exists on File B , in case exact string does not exists on B then write on it at the end of last line .

ex: file A : temp.txt

Quote:

Sophia
Jacob
Isabella
Emma
Olivia
Ava
Emily
Abigail
Madison
Alexander

ex: file B : ok.txt

Quote:

Mia
Aiden
Chloe
Daniel
Elizabeth
Isabella
Anthony
Abigail

I need a code that reads temp.txt line by line and checks if that string exists on ok.txt .
In case string from temp.txt does not exists on ok.txt then add it to ok.txt .

Anyone have any idea how to do this .

pan64 · 07-19-2017, 02:21 AM

I think grep has that feature already, check man page

Turbocapitalist · 07-19-2017, 03:03 AM

If the files are both sorted you might look at comm. See the options -1 or -2, to find lines unique to one of the files.

Code:

man comm

You might also have to clean up the output with sed to remove leading spaces. Or use --output-delimiter to suppress the column delimiter completely.

pedropt · 07-19-2017, 08:03 AM

Well , i was expecting an example to guide me .

however , the best way to do it , for those that may need it is :

write all your data from 1st file to last file (no matter if it is repeated

echo (your data) >> (final file)

then use awk to search for duplicate lines and remove them

awk '!a[$0]++' final file > temp file

after this point all you have to do is to remove your final file and rename your temp file as final file

rm final file | mv temp file final file

Quote:

finalf=foo.txt
tempf=temp.txt
echo $tempf >> $finalf
awk '!a[$0]++' $finalf > temp1.txt
rm -f $finalf && mv temp1.txt $finalf

pan64 · 07-19-2017, 08:07 AM

how is this related to your original question (start reading every line from file A and check if that exact string exists on File B, in case exact string does not exists on B then write on it at the end of last line)?

Turbocapitalist · 07-19-2017, 08:18 AM

Quote:

Originally Posted by pedropt

then use awk to search for duplicate lines and remove them

That's one way. Another would be to use sort with the -u option to remove duplicates. See "man sort" Most methods I can think of rely on sorted data. Here's one with comm, which also needs sorted data:

Code:

comm -2 --output-delimiter='' <(sort a.data) <(sort b.data);

It works in the higher shells : bash, ksh, zsh
In other shells, you would have to pre-sort the data files.

Sefyir · 07-19-2017, 10:02 AM

If the order of the original data is not important, sort -u is the best answer.

If the order of the original data is important and there is a large amount of it, you can try this with python:
It will take the first argument (sys.argv[1]) and use that as the file you are comparing others too.
For each other file(s) (either in argument form or with stdin), it will read each line, if not in the set of the comparer file, print it out.
output and processing interweaves

Code:

#!/usr/bin/env python3

import sys

if len(sys.argv) < 2:
    raise ValueError('1st Argument required to compare files to!')
with open(sys.argv[1]) as compare_file:
    compare_data = {line.strip() for line in compare_file.readlines()}
 
if len(sys.argv) > 2:
    data = (line.strip() 
            for data in sys.argv[2:] 
            for line in open(data)
           )
else:
    data = (line.strip()
            for line in sys.stdin
           )

for line in data:
    if line not in compare_data:
        print(line)
        compare_data.add(line)

Code:

$ cat temp.txt | ./line_compare.py ok.txt 
Sophia
Jacob
Emma
Olivia
Ava
Emily
Madison
Alexander

$ ./line_compare.py ok.txt temp.txt 
Sophia
...

$ ./line_compare.py ok.txt < temp.txt 
Sophia
...

$ ./line_compare.py ok.txt < temp.txt >> ok.txt  # Very convenient!
$ cat ok.txt
Mia
Aiden
Chloe
Daniel
Elizabeth
Isabella
Anthony
Abigail 
Sophia
Jacob
Emma
Olivia
Ava
Emily
Madison
Alexander

Extra:

I examined some of the other methods. While with smaller sets of data it would be fine however, there are issues when you have larger ones.
I duplicated temp.txt to be 3.2gb in size (redundant data) and I ran these on a tmpfs.

Code:

time comm -2 --output-delimiter='' <(sort ok.txt) <(sort temp.txt); # Failed due to large (> 1gb) writes to /tmp and Memory usage > 7.8Gb
sort: write failed: /tmp/sortVOihEE: No space left on device
sort: write failed: /tmp/sort1hnMAt: No space left on device

Code:

time ./line_compare.py ok.txt temp.txt # Memory usage 3.9M
real	1m55.708s

Code:

cat ok.txt temp.txt > ok2.txt
time sort -u ok2.txt # Memory usage > 9Gb
Abigail
Abigail # Interesting double entry?
Aiden
Alexander
real	1m39.887s

Code:

time awk '!a[$0]++' ok2.txt # 3.2k Memory usage
real	1m5.616s

Awk seems to of come out on top, however copying a file (especially larger ones) can be slow.

Laserbeak · 07-19-2017, 10:42 AM

Perl is very good at such things:

Code:

#!/usr/bin/perl

$file1 = "temp.txt";
$file2 = "ok.txt";

open FILE, $file1 or die "Can't open file $file1: $!\n";

while ($_ = <FILE>) {
   chomp;
   $hash{$_} = 1;
}

close FILE;

open FILE2, $file2 or die "Can't open file $file2: $!\n";

while ($_ = <FILE2>) {
   chomp;
   $hash2{$_} = 1;
}

close FILE2;

open FILE2, ">>$file2" or die "Can't open file $file2 for writing: $!\n";

for (keys %hash) {
   if (! exists $hash2{$_}) {
      print FILE2 $_, "\n";
   }
}

close FILE2;
exit(0);

danielbmartin · 07-19-2017, 11:11 AM

Quote:

Originally Posted by pedropt

Well , i was expecting an example to guide me .

I will say, tactfully, that you will get better answers when you post better questions.

Your original post contained two sample inputs (that's good) but no sample output.

Daniel B. Martin

danielbmartin · 07-19-2017, 11:19 AM

Quote:

Originally Posted by pedropt

Code:

finalf=foo.txt
tempf=temp.txt
echo $tempf >> $finalf
awk '!a[$0]++' $finalf > temp1.txt
rm -f $finalf && mv temp1.txt $finalf

Consider this as a streamlined version of your method...

Code:

awk '!a[$0]++' $Temp $OK >$OutFile

Daniel B. Martin

rtmistler · 07-19-2017, 11:31 AM

Quote:

Originally Posted by danielbmartin

I will say, tactfully, that you will get better answers when you post better questions.

Your original post contained two sample inputs (that's good) but no sample output.

Daniel B. Martin

I agree and support this feedback.

Post #4 seems to give the information which should have been offered in the first post.

I also get the impression that there are a number of questions from pedropt which are highly similar related to scripts searching files to qualify text for substitution, replacement, or deletion.

@pedropt: It might also be more helpful to truly mark the thread as solved versus changing the thread title. I see former threads marked as solved, therefore know this can be done by you. One thread you edited the name to call it closed. Please use the thread tools menu to properly mark your threads as Solved. And if you desire a closed option, please use the LQ Questions and Feedback forum to ask Jeremy for that type of feature.

Laserbeak · 07-19-2017, 12:00 PM

Yes, some more guidance on the output and the handling of double (or more) entries would have been helpful.

My Perl program ignores duplicates, but could easily be modified to handle them if necessary.

It also would be memory intensive if you have huge files, but that could also be handled by tying a hash to a database like a simple Berkeley DB or even a SQL database like mySQL or even ORACLE.

MadeInGermany · 07-19-2017, 12:59 PM

The following works for me

Code:

fgrep -vxf ok.txt temp.txt >> ok.txt

No intermediate file. Can there by a race condition with an unpredictable result??
--
After some brainstorming:
I think this always works, because the fgrep needs to read the ok.txt in full before it processes temp.txt and eventually writes something.

Laserbeak · 07-19-2017, 01:30 PM

Quote:

Originally Posted by MadeInGermany

The following works for me

Code:

fgrep -vxf ok.txt temp.txt >> ok.txt

No intermediate file. Can there by a race condition with an unpredictable result??
--
After some brainstorming:
I think this always works, because the fgrep needs to read the ok.txt in full before it processes temp.txt and eventually writes something.

Very clever one-liner! Congrats...

The only problem I see is with huge files, there seems to be no way to back it with database files if necessary. But for the example given, it's great.