Bash script to find and remove similar lines from multiple files
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Bash script to find and remove similar lines from multiple files
Hello Folks!
I want to remove duplicate or multiple similar lines from multiple files. I.e. if I have four files file1.txt file2.txt file3.txt and file4.txt and would like to find and remove similar lines from all these files keeping only one line from these similar lines.
I only that uniq can be used to remove similar lines from a sorted file. Anything like this if I have multiple files?
Try the comm command. It can compare two sorted files at a time, so that you have to store the temporary results somewhere until you compare the last file. See man comm for details.
Try the comm command. It can compare two sorted files at a time, so that you have to store the temporary results somewhere until you compare the last file. See man comm for details.
But comm command is useful only for two files and then it will not remove the multiple lines from the files. Is there some script ( sed and awk ? ) to find and remove similar lines from multiple (more than 2) files?
Where do you want to store the unique lines? If you want to keep them in one of the file in which they appear, you have to establish a rule to decide from which file you want to delete them. If you want to store all the lines together in a new file, you can try something like this:
Code:
awk '{array[$0]=1} END{ for (i in array) print i}' file?.txt > output_file
You need to explain more, I don't get what you want, try an example or something. I would think use sort on all files, then pipe to uniq ... but I'm still not sure what you really want.
I can't show a script yet but I think I can tell you my concept.
The simple solution I know is that you'll just have to walk to every line of all the files and anything that matches after that line (within the current file and all the following files) gets deleted. Note that you no longer have to compare the previous lines since they are already unique. That's all I know.
There are many ways to apply this concept. You can use bash (using arrays and =) or you can use sed or many other languages but you can also use awk. Awk is best for the solution I guess.
Last edited by konsolebox; 06-07-2009 at 09:38 PM.
Reason: meaning
Where do you want to store the unique lines? If you want to keep them in one of the file in which they appear,
The unique lines may be stored in any of the original files. BUT, a line in one file should not be repeated in the other files and all such
repeat lines should be deleted from all files except any one file.
I do not want to create a single output file as it would be too large to hold all the data and would make later handling quite difficult.
If perl does it for you you could make a script like this
Code:
my %linehash
open FILE ## Create a file with filelist
while <FILE>
open CURRENT_FILE,$_
while <CURRENT_FILE> {
if !($linehash{$_} ) {
defile $linehash{$_};
print OUTPUTFILE $_;
}
}
}
Syntax etc., is not proper, but you get the drift.
#! /bin/bash
# get all lines which appear more than once
sort file*.txt | \
awk ' $0 == prev_line { print $0; } { prev_line = $0 }; ' | \
sort -u >dups.txt
# loop over the files
for f in file*.txt; do
# remove dups from the file
sort -u $f >${f}.tmp
# are there any dups done already?
if test -f dups_done.txt; then
# the "new" (not already done) dups
comm -23 dups.txt dups_done.txt >dups_new.txt
# the lines only appearing in file
comm -23 ${f}.tmp dups.txt >${f}.srt
# add the "new" dups to file
comm -12 ${f}.tmp dups_new.txt >>${f}.srt
# unique sort of file
sort -u ${f}.srt >${f}.out
# all dups in this file are done now
comm -12 ${f}.out dups.txt >dups_done.tmp
# add former done dups
cat dups_done.txt >>dups_done.tmp
# create a sorted, unique done file
sort -u dups_done.tmp >dups_done.txt
else
# no former done dups: create the done file
comm -12 ${f}.tmp dups.txt >dups_done.txt
mv ${f}.tmp ${f}.out
fi
done
# a little bit cleanup
rm dups* *.tmp *.srt 2>/dev/null
exit 0
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.