LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-05-2009, 12:19 PM   #1
linuxquestion1
LQ Newbie
 
Registered: Jun 2009
Posts: 3

Rep: Reputation: 0
Bash script to find and remove similar lines from multiple files


Hello Folks!

I want to remove duplicate or multiple similar lines from multiple files. I.e. if I have four files file1.txt file2.txt file3.txt and file4.txt and would like to find and remove similar lines from all these files keeping only one line from these similar lines.

I only that uniq can be used to remove similar lines from a sorted file. Anything like this if I have multiple files?

Thanks in advance for your help.

--
A.
 
Old 06-05-2009, 12:22 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981
Try the comm command. It can compare two sorted files at a time, so that you have to store the temporary results somewhere until you compare the last file. See man comm for details.
 
Old 06-05-2009, 01:00 PM   #3
linuxquestion1
LQ Newbie
 
Registered: Jun 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by colucix View Post
Try the comm command. It can compare two sorted files at a time, so that you have to store the temporary results somewhere until you compare the last file. See man comm for details.
But comm command is useful only for two files and then it will not remove the multiple lines from the files. Is there some script ( sed and awk ? ) to find and remove similar lines from multiple (more than 2) files?

Thanks again!

--
A.
 
Old 06-05-2009, 02:31 PM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981Reputation: 1981
Quote:
Originally Posted by linuxquestion1 View Post
keeping only one line from these similar lines.
Where do you want to store the unique lines? If you want to keep them in one of the file in which they appear, you have to establish a rule to decide from which file you want to delete them. If you want to store all the lines together in a new file, you can try something like this:
Code:
awk '{array[$0]=1} END{ for (i in array) print i}' file?.txt > output_file
 
Old 06-05-2009, 03:56 PM   #5
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292
You need to explain more, I don't get what you want, try an example or something. I would think use sort on all files, then pipe to uniq ... but I'm still not sure what you really want.
 
Old 06-07-2009, 10:33 PM   #6
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
I can't show a script yet but I think I can tell you my concept.

The simple solution I know is that you'll just have to walk to every line of all the files and anything that matches after that line (within the current file and all the following files) gets deleted. Note that you no longer have to compare the previous lines since they are already unique. That's all I know.

There are many ways to apply this concept. You can use bash (using arrays and =) or you can use sed or many other languages but you can also use awk. Awk is best for the solution I guess.

Last edited by konsolebox; 06-07-2009 at 10:38 PM. Reason: meaning
 
Old 06-08-2009, 06:15 AM   #7
linuxquestion1
LQ Newbie
 
Registered: Jun 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by colucix View Post
Where do you want to store the unique lines? If you want to keep them in one of the file in which they appear,
The unique lines may be stored in any of the original files. BUT, a line in one file should not be repeated in the other files and all such
repeat lines should be deleted from all files except any one file.

I do not want to create a single output file as it would be too large to hold all the data and would make later handling quite difficult.

Thanks in advance for your help.

--
A.
 
Old 06-08-2009, 06:21 AM   #8
tsk1979
LQ Newbie
 
Registered: Jun 2009
Posts: 9

Rep: Reputation: 1
If perl does it for you you could make a script like this
Code:
my %linehash

open FILE  ## Create a file with filelist

while <FILE>
  open CURRENT_FILE,$_
  while <CURRENT_FILE> {
    if !($linehash{$_} ) {
        defile $linehash{$_};
        print OUTPUTFILE $_;
      }
   }
}
Syntax etc., is not proper, but you get the drift.
 
Old 06-08-2009, 04:16 PM   #9
jan61
Member
 
Registered: Jun 2008
Posts: 235

Rep: Reputation: 47
Moin,

a little bit long, but I think it works:

Code:
#! /bin/bash
# get all lines which appear more than once
sort file*.txt | \
  awk ' $0 == prev_line { print $0; } { prev_line = $0 }; ' | \
  sort -u >dups.txt
# loop over the files
for f in file*.txt; do
  # remove dups from the file
  sort -u $f >${f}.tmp
  # are there any dups done already?
  if test -f dups_done.txt; then
    # the "new" (not already done) dups
    comm -23 dups.txt dups_done.txt >dups_new.txt
    # the lines only appearing in file
    comm -23 ${f}.tmp dups.txt >${f}.srt
    # add the "new" dups to file
    comm -12 ${f}.tmp dups_new.txt >>${f}.srt
    # unique sort of file
    sort -u ${f}.srt >${f}.out
    # all dups in this file are done now
    comm -12 ${f}.out dups.txt >dups_done.tmp
    # add former done dups
    cat dups_done.txt >>dups_done.tmp
    # create a sorted, unique done file
    sort -u dups_done.tmp >dups_done.txt
  else
    # no former done dups: create the done file
    comm -12 ${f}.tmp dups.txt >dups_done.txt
    mv ${f}.tmp ${f}.out
  fi
done
# a little bit cleanup
rm dups* *.tmp *.srt 2>/dev/null
exit 0
Jan
 
Old 07-13-2011, 02:45 AM   #10
sswam
LQ Newbie
 
Registered: Dec 2009
Posts: 10

Rep: Reputation: 1
a simple tool-based solution

I had already written a few useful tools (in perl) which can help to solve this problem.

jf - joins text files into a single file
sf - splits them apart again
uniqo - like uniq, but works on unsorted files, and preserves the order of lines

Using these, my solution is:

Code:
jf all.txt file1.txt file2.txt file3.txt file4.txt
<all.txt uniqo >all-uniq.txt
sf all-uniq.txt
rm all.txt all-uniq.txt
sf renames the original files like file1.txt~.

Those tools are here among several others:

http://sam.ai.ki/code/nipl-tools/bin/

Specific links:

http://sam.ai.ki/code/nipl-tools/bin/jf
http://sam.ai.ki/code/nipl-tools/bin/sf
http://sam.ai.ki/code/nipl-tools/bin/uniqo
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
commands for bash script that handles files of varying number of lines BBFeltham Linux - Newbie 1 07-26-2008 11:18 AM
bash script to remove the blank lines in the file naveensankineni Programming 7 03-25-2008 09:34 PM
awk: remove similar lines from logfile peos Programming 7 06-19-2006 08:13 AM
Script: splitting lines in multiple files and joining them timmay9162 Programming 28 04-14-2006 09:52 AM
How to grep similar lines in bash? bruno buys Linux - Software 2 12-03-2005 12:56 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:27 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration