LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-19-2017, 02:05 AM   #1
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Rep: Reputation: Disabled
[Solved] search for exact strings from one file to another and write if not exist


Hi everyone , i am building a script that requires a code to start reading every line from file A and check if that exact string exists on File B , in case exact string does not exists on B then write on it at the end of last line .

ex: file A : temp.txt
Quote:
Sophia
Jacob
Isabella
Emma
Olivia
Ava
Emily
Abigail
Madison
Alexander
ex: file B : ok.txt

Quote:
Mia
Aiden
Chloe
Daniel
Elizabeth
Isabella
Anthony
Abigail
I need a code that reads temp.txt line by line and checks if that string exists on ok.txt .
In case string from temp.txt does not exists on ok.txt then add it to ok.txt .

Anyone have any idea how to do this .

Last edited by pedropt; 07-19-2017 at 08:07 AM.
 
Old 07-19-2017, 02:21 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,910

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
I think grep has that feature already, check man page
 
Old 07-19-2017, 03:03 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,328
Blog Entries: 3

Rep: Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726
If the files are both sorted you might look at comm. See the options -1 or -2, to find lines unique to one of the files.

Code:
man comm
You might also have to clean up the output with sed to remove leading spaces. Or use --output-delimiter to suppress the column delimiter completely.
 
Old 07-19-2017, 08:03 AM   #4
pedropt
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 345

Original Poster
Rep: Reputation: Disabled
Well , i was expecting an example to guide me .

however , the best way to do it , for those that may need it is :

write all your data from 1st file to last file (no matter if it is repeated

echo (your data) >> (final file)

then use awk to search for duplicate lines and remove them

awk '!a[$0]++' final file > temp file

after this point all you have to do is to remove your final file and rename your temp file as final file

rm final file | mv temp file final file

Quote:
finalf=foo.txt
tempf=temp.txt
echo $tempf >> $finalf
awk '!a[$0]++' $finalf > temp1.txt
rm -f $finalf && mv temp1.txt $finalf

Last edited by pedropt; 07-19-2017 at 08:10 AM.
 
Old 07-19-2017, 08:07 AM   #5
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,910

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
how is this related to your original question (start reading every line from file A and check if that exact string exists on File B, in case exact string does not exists on B then write on it at the end of last line)?
 
1 members found this post helpful.
Old 07-19-2017, 08:18 AM   #6
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,328
Blog Entries: 3

Rep: Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726
Quote:
Originally Posted by pedropt View Post
then use awk to search for duplicate lines and remove them
That's one way. Another would be to use sort with the -u option to remove duplicates. See "man sort" Most methods I can think of rely on sorted data. Here's one with comm, which also needs sorted data:

Code:
comm -2 --output-delimiter='' <(sort a.data) <(sort b.data);
It works in the higher shells : bash, ksh, zsh
In other shells, you would have to pre-sort the data files.
 
Old 07-19-2017, 10:02 AM   #7
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 634

Rep: Reputation: 316Reputation: 316Reputation: 316Reputation: 316
If the order of the original data is not important, sort -u is the best answer.

If the order of the original data is important and there is a large amount of it, you can try this with python:
It will take the first argument (sys.argv[1]) and use that as the file you are comparing others too.
For each other file(s) (either in argument form or with stdin), it will read each line, if not in the set of the comparer file, print it out.
output and processing interweaves

Code:
#!/usr/bin/env python3

import sys

if len(sys.argv) < 2:
    raise ValueError('1st Argument required to compare files to!')
with open(sys.argv[1]) as compare_file:
    compare_data = {line.strip() for line in compare_file.readlines()}
 
if len(sys.argv) > 2:
    data = (line.strip() 
            for data in sys.argv[2:] 
            for line in open(data)
           )
else:
    data = (line.strip()
            for line in sys.stdin
           )

for line in data:
    if line not in compare_data:
        print(line)
        compare_data.add(line)
Code:
$ cat temp.txt | ./line_compare.py ok.txt 
Sophia
Jacob
Emma
Olivia
Ava
Emily
Madison
Alexander

$ ./line_compare.py ok.txt temp.txt 
Sophia
...

$ ./line_compare.py ok.txt < temp.txt 
Sophia
...

$ ./line_compare.py ok.txt < temp.txt >> ok.txt  # Very convenient!
$ cat ok.txt
Mia
Aiden
Chloe
Daniel
Elizabeth
Isabella
Anthony
Abigail 
Sophia
Jacob
Emma
Olivia
Ava
Emily
Madison
Alexander
Extra:

I examined some of the other methods. While with smaller sets of data it would be fine however, there are issues when you have larger ones.
I duplicated temp.txt to be 3.2gb in size (redundant data) and I ran these on a tmpfs.

Code:
time comm -2 --output-delimiter='' <(sort ok.txt) <(sort temp.txt); # Failed due to large (> 1gb) writes to /tmp and Memory usage > 7.8Gb
sort: write failed: /tmp/sortVOihEE: No space left on device
sort: write failed: /tmp/sort1hnMAt: No space left on device

Code:
time ./line_compare.py ok.txt temp.txt # Memory usage 3.9M
real	1m55.708s
Code:
cat ok.txt temp.txt > ok2.txt
time sort -u ok2.txt # Memory usage > 9Gb
Abigail
Abigail # Interesting double entry?
Aiden
Alexander
real	1m39.887s
Code:
time awk '!a[$0]++' ok2.txt # 3.2k Memory usage
real	1m5.616s
Awk seems to of come out on top, however copying a file (especially larger ones) can be slow.

Last edited by Sefyir; 07-19-2017 at 10:04 AM.
 
1 members found this post helpful.
Old 07-19-2017, 10:42 AM   #8
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 143Reputation: 143
Perl is very good at such things:

Code:
#!/usr/bin/perl

$file1 = "temp.txt";
$file2 = "ok.txt";

open FILE, $file1 or die "Can't open file $file1: $!\n";

while ($_ = <FILE>) {
   chomp;
   $hash{$_} = 1;
}

close FILE;

open FILE2, $file2 or die "Can't open file $file2: $!\n";

while ($_ = <FILE2>) {
   chomp;
   $hash2{$_} = 1;
}

close FILE2;

open FILE2, ">>$file2" or die "Can't open file $file2 for writing: $!\n";

for (keys %hash) {
   if (! exists $hash2{$_}) {
      print FILE2 $_, "\n";
   }
}

close FILE2;
exit(0);
 
1 members found this post helpful.
Old 07-19-2017, 11:11 AM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by pedropt View Post
Well , i was expecting an example to guide me .
I will say, tactfully, that you will get better answers when you post better questions.

Your original post contained two sample inputs (that's good) but no sample output.

Daniel B. Martin
 
Old 07-19-2017, 11:19 AM   #10
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by pedropt View Post
Code:
finalf=foo.txt
tempf=temp.txt
echo $tempf >> $finalf
awk '!a[$0]++' $finalf > temp1.txt
rm -f $finalf && mv temp1.txt $finalf
Consider this as a streamlined version of your method...
Code:
awk '!a[$0]++' $Temp $OK >$OutFile
Daniel B. Martin
 
Old 07-19-2017, 11:31 AM   #11
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by danielbmartin View Post
I will say, tactfully, that you will get better answers when you post better questions.

Your original post contained two sample inputs (that's good) but no sample output.

Daniel B. Martin
I agree and support this feedback.

Post #4 seems to give the information which should have been offered in the first post.

I also get the impression that there are a number of questions from pedropt which are highly similar related to scripts searching files to qualify text for substitution, replacement, or deletion.

@pedropt: It might also be more helpful to truly mark the thread as solved versus changing the thread title. I see former threads marked as solved, therefore know this can be done by you. One thread you edited the name to call it closed. Please use the thread tools menu to properly mark your threads as Solved. And if you desire a closed option, please use the LQ Questions and Feedback forum to ask Jeremy for that type of feature.
 
Old 07-19-2017, 12:00 PM   #12
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 143Reputation: 143
Yes, some more guidance on the output and the handling of double (or more) entries would have been helpful.

My Perl program ignores duplicates, but could easily be modified to handle them if necessary.

It also would be memory intensive if you have huge files, but that could also be handled by tying a hash to a database like a simple Berkeley DB or even a SQL database like mySQL or even ORACLE.
 
Old 07-19-2017, 12:59 PM   #13
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,805

Rep: Reputation: 1206Reputation: 1206Reputation: 1206Reputation: 1206Reputation: 1206Reputation: 1206Reputation: 1206Reputation: 1206Reputation: 1206
The following works for me
Code:
fgrep -vxf ok.txt temp.txt >> ok.txt
No intermediate file. Can there by a race condition with an unpredictable result??
--
After some brainstorming:
I think this always works, because the fgrep needs to read the ok.txt in full before it processes temp.txt and eventually writes something.

Last edited by MadeInGermany; 07-19-2017 at 01:12 PM.
 
1 members found this post helpful.
Old 07-19-2017, 01:30 PM   #14
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 143Reputation: 143
Quote:
Originally Posted by MadeInGermany View Post
The following works for me
Code:
fgrep -vxf ok.txt temp.txt >> ok.txt
No intermediate file. Can there by a race condition with an unpredictable result??
--
After some brainstorming:
I think this always works, because the fgrep needs to read the ok.txt in full before it processes temp.txt and eventually writes something.
Very clever one-liner! Congrats...

The only problem I see is with huge files, there seems to be no way to back it with database files if necessary. But for the example given, it's great.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[Close] search for exact string on file and delete that line pedropt Programming 3 07-11-2017 03:36 PM
search replace multiple strings in a file fundoo.code Linux - Newbie 10 09-24-2015 09:31 PM
Search and replace strings with file paths in vim geeksquads Linux - Software 2 01-30-2015 02:03 AM
Problem using grep to search for multiple strings listed in a file scruffbag Programming 3 09-25-2013 08:22 AM
Search for exact repetitions in text file unihiekka Linux - Newbie 3 11-24-2010 10:23 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:50 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration