[Solved] search for exact strings from one file to another and write if not exist
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
[Solved] search for exact strings from one file to another and write if not exist
Hi everyone , i am building a script that requires a code to start reading every line from file A and check if that exact string exists on File B , in case exact string does not exists on B then write on it at the end of last line .
ex: file A : temp.txt
Quote:
Sophia
Jacob
Isabella
Emma
Olivia
Ava
Emily
Abigail
Madison
Alexander
ex: file B : ok.txt
Quote:
Mia
Aiden
Chloe
Daniel
Elizabeth
Isabella
Anthony
Abigail
I need a code that reads temp.txt line by line and checks if that string exists on ok.txt .
In case string from temp.txt does not exists on ok.txt then add it to ok.txt .
how is this related to your original question (start reading every line from file A and check if that exact string exists on File B, in case exact string does not exists on B then write on it at the end of last line)?
then use awk to search for duplicate lines and remove them
That's one way. Another would be to use sort with the -u option to remove duplicates. See "man sort" Most methods I can think of rely on sorted data. Here's one with comm, which also needs sorted data:
Code:
comm -2 --output-delimiter='' <(sort a.data) <(sort b.data);
It works in the higher shells : bash, ksh, zsh
In other shells, you would have to pre-sort the data files.
If the order of the original data is not important, sort -u is the best answer.
If the order of the original data is important and there is a large amount of it, you can try this with python:
It will take the first argument (sys.argv[1]) and use that as the file you are comparing others too.
For each other file(s) (either in argument form or with stdin), it will read each line, if not in the set of the comparer file, print it out.
output and processing interweaves
Code:
#!/usr/bin/env python3
import sys
if len(sys.argv) < 2:
raise ValueError('1st Argument required to compare files to!')
with open(sys.argv[1]) as compare_file:
compare_data = {line.strip() for line in compare_file.readlines()}
if len(sys.argv) > 2:
data = (line.strip()
for data in sys.argv[2:]
for line in open(data)
)
else:
data = (line.strip()
for line in sys.stdin
)
for line in data:
if line not in compare_data:
print(line)
compare_data.add(line)
Code:
$ cat temp.txt | ./line_compare.py ok.txt
Sophia
Jacob
Emma
Olivia
Ava
Emily
Madison
Alexander
$ ./line_compare.py ok.txt temp.txt
Sophia
...
$ ./line_compare.py ok.txt < temp.txt
Sophia
...
$ ./line_compare.py ok.txt < temp.txt >> ok.txt # Very convenient!
$ cat ok.txt
Mia
Aiden
Chloe
Daniel
Elizabeth
Isabella
Anthony
Abigail
Sophia
Jacob
Emma
Olivia
Ava
Emily
Madison
Alexander
Extra:
I examined some of the other methods. While with smaller sets of data it would be fine however, there are issues when you have larger ones.
I duplicated temp.txt to be 3.2gb in size (redundant data) and I ran these on a tmpfs.
Code:
time comm -2 --output-delimiter='' <(sort ok.txt) <(sort temp.txt); # Failed due to large (> 1gb) writes to /tmp and Memory usage > 7.8Gb
sort: write failed: /tmp/sortVOihEE: No space left on device
sort: write failed: /tmp/sort1hnMAt: No space left on device
Code:
time ./line_compare.py ok.txt temp.txt # Memory usage 3.9M
real 1m55.708s
Code:
cat ok.txt temp.txt > ok2.txt
time sort -u ok2.txt # Memory usage > 9Gb
Abigail
Abigail # Interesting double entry?
Aiden
Alexander
real 1m39.887s
Code:
time awk '!a[$0]++' ok2.txt # 3.2k Memory usage
real 1m5.616s
Awk seems to of come out on top, however copying a file (especially larger ones) can be slow.
#!/usr/bin/perl
$file1 = "temp.txt";
$file2 = "ok.txt";
open FILE, $file1 or die "Can't open file $file1: $!\n";
while ($_ = <FILE>) {
chomp;
$hash{$_} = 1;
}
close FILE;
open FILE2, $file2 or die "Can't open file $file2: $!\n";
while ($_ = <FILE2>) {
chomp;
$hash2{$_} = 1;
}
close FILE2;
open FILE2, ">>$file2" or die "Can't open file $file2 for writing: $!\n";
for (keys %hash) {
if (! exists $hash2{$_}) {
print FILE2 $_, "\n";
}
}
close FILE2;
exit(0);
I will say, tactfully, that you will get better answers when you post better questions.
Your original post contained two sample inputs (that's good) but no sample output.
Daniel B. Martin
I agree and support this feedback.
Post #4 seems to give the information which should have been offered in the first post.
I also get the impression that there are a number of questions from pedropt which are highly similar related to scripts searching files to qualify text for substitution, replacement, or deletion.
@pedropt: It might also be more helpful to truly mark the thread as solved versus changing the thread title. I see former threads marked as solved, therefore know this can be done by you. One thread you edited the name to call it closed. Please use the thread tools menu to properly mark your threads as Solved. And if you desire a closed option, please use the LQ Questions and Feedback forum to ask Jeremy for that type of feature.
Yes, some more guidance on the output and the handling of double (or more) entries would have been helpful.
My Perl program ignores duplicates, but could easily be modified to handle them if necessary.
It also would be memory intensive if you have huge files, but that could also be handled by tying a hash to a database like a simple Berkeley DB or even a SQL database like mySQL or even ORACLE.
No intermediate file. Can there by a race condition with an unpredictable result??
--
After some brainstorming:
I think this always works, because the fgrep needs to read the ok.txt in full before it processes temp.txt and eventually writes something.
Last edited by MadeInGermany; 07-19-2017 at 01:12 PM.
No intermediate file. Can there by a race condition with an unpredictable result??
--
After some brainstorming:
I think this always works, because the fgrep needs to read the ok.txt in full before it processes temp.txt and eventually writes something.
Very clever one-liner! Congrats...
The only problem I see is with huge files, there seems to be no way to back it with database files if necessary. But for the example given, it's great.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.