Bash script to find and remove similar lines from multiple files

linuxquestion1 · 06-05-2009, 11:19 AM

Hello Folks!

I want to remove duplicate or multiple similar lines from multiple files. I.e. if I have four files file1.txt file2.txt file3.txt and file4.txt and would like to find and remove similar lines from all these files keeping only one line from these similar lines.

I only that uniq can be used to remove similar lines from a sorted file. Anything like this if I have multiple files?

Thanks in advance for your help.

--
A.

colucix · 06-05-2009, 11:22 AM

Try the comm command. It can compare two sorted files at a time, so that you have to store the temporary results somewhere until you compare the last file. See man comm for details.

linuxquestion1 · 06-05-2009, 12:00 PM

Quote:

Originally Posted by colucix

Try the comm command. It can compare two sorted files at a time, so that you have to store the temporary results somewhere until you compare the last file. See man comm for details.

But comm command is useful only for two files and then it will not remove the multiple lines from the files. Is there some script ( sed and awk ? ) to find and remove similar lines from multiple (more than 2) files?

Thanks again!

--
A.

colucix · 06-05-2009, 01:31 PM

Quote:

Originally Posted by linuxquestion1

keeping only one line from these similar lines.

Where do you want to store the unique lines? If you want to keep them in one of the file in which they appear, you have to establish a rule to decide from which file you want to delete them. If you want to store all the lines together in a new file, you can try something like this:

Code:

awk '{array[$0]=1} END{ for (i in array) print i}' file?.txt > output_file

H_TeXMeX_H · 06-05-2009, 02:56 PM

You need to explain more, I don't get what you want, try an example or something. I would think use sort on all files, then pipe to uniq ... but I'm still not sure what you really want.

konsolebox · 06-07-2009, 09:33 PM

I can't show a script yet but I think I can tell you my concept.

The simple solution I know is that you'll just have to walk to every line of all the files and anything that matches after that line (within the current file and all the following files) gets deleted. Note that you no longer have to compare the previous lines since they are already unique. That's all I know.

There are many ways to apply this concept. You can use bash (using arrays and =) or you can use sed or many other languages but you can also use awk. Awk is best for the solution I guess.

linuxquestion1 · 06-08-2009, 05:15 AM

Quote:

Originally Posted by colucix

Where do you want to store the unique lines? If you want to keep them in one of the file in which they appear,

The unique lines may be stored in any of the original files. BUT, a line in one file should not be repeated in the other files and all such
repeat lines should be deleted from all files except any one file.

I do not want to create a single output file as it would be too large to hold all the data and would make later handling quite difficult.

Thanks in advance for your help.

--
A.

tsk1979 · 06-08-2009, 05:21 AM

If perl does it for you you could make a script like this

Code:

my %linehash

open FILE  ## Create a file with filelist

while <FILE>
  open CURRENT_FILE,$_
  while <CURRENT_FILE> {
    if !($linehash{$_} ) {
        defile $linehash{$_};
        print OUTPUTFILE $_;
      }
   }
}

Syntax etc., is not proper, but you get the drift.

jan61 · 06-08-2009, 03:16 PM

Moin,

a little bit long, but I think it works:

Code:

#! /bin/bash
# get all lines which appear more than once
sort file*.txt | \
  awk ' $0 == prev_line { print $0; } { prev_line = $0 }; ' | \
  sort -u >dups.txt
# loop over the files
for f in file*.txt; do
  # remove dups from the file
  sort -u $f >${f}.tmp
  # are there any dups done already?
  if test -f dups_done.txt; then
    # the "new" (not already done) dups
    comm -23 dups.txt dups_done.txt >dups_new.txt
    # the lines only appearing in file
    comm -23 ${f}.tmp dups.txt >${f}.srt
    # add the "new" dups to file
    comm -12 ${f}.tmp dups_new.txt >>${f}.srt
    # unique sort of file
    sort -u ${f}.srt >${f}.out
    # all dups in this file are done now
    comm -12 ${f}.out dups.txt >dups_done.tmp
    # add former done dups
    cat dups_done.txt >>dups_done.tmp
    # create a sorted, unique done file
    sort -u dups_done.tmp >dups_done.txt
  else
    # no former done dups: create the done file
    comm -12 ${f}.tmp dups.txt >dups_done.txt
    mv ${f}.tmp ${f}.out
  fi
done
# a little bit cleanup
rm dups* *.tmp *.srt 2>/dev/null
exit 0

Jan

sswam · 07-13-2011, 01:45 AM

I had already written a few useful tools (in perl) which can help to solve this problem.

jf - joins text files into a single file
sf - splits them apart again
uniqo - like uniq, but works on unsorted files, and preserves the order of lines

Using these, my solution is:

Code:

jf all.txt file1.txt file2.txt file3.txt file4.txt
<all.txt uniqo >all-uniq.txt
sf all-uniq.txt
rm all.txt all-uniq.txt

sf renames the original files like file1.txt~.

Those tools are here among several others:

http://sam.ai.ki/code/nipl-tools/bin/

Specific links:

http://sam.ai.ki/code/nipl-tools/bin/jf
http://sam.ai.ki/code/nipl-tools/bin/sf
http://sam.ai.ki/code/nipl-tools/bin/uniqo