LinuxQuestions.org - Remove duplicates from file

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Remove duplicates from file (https://www.linuxquestions.org/questions/linux-newbie-8/remove-duplicates-from-file-4175412123/)

Remove duplicates from file

I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using

Code:

awk '!_[$1]++' /tmp/fileList.tmp

I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d

it will exclude directories, or:

Code:

find . -type f

will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$

to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

Quote:

Originally Posted by sandikaxp (Post 4706383)

Code:

awk '!_[$1]++' /tmp/fileList.tmp

I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

And I don't know anything about your files, but if they all have an
extension of some sort ...

Code:

awk '/\./ && !_[$1]++' dupes 

A/B/C/D/E.txt

A/B/C1/D/E1.txt

A/B/C1/D/E2.txt

Cheers,
Tink

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

Quote:

Originally Posted by kakaka (Post 4706403)

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d

it will exclude directories, or:

Code:

find . -type f

will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$

to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

This file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN file path and the directory path, need to get rid of the DIR paths.

Quote:

Originally Posted by Tinkster (Post 4706417)

And I don't know anything about your files, but if they all have an
extension of some sort ...

Code:

awk '/\./ && !_[$1]++' dupes 

A/B/C/D/E.txt

A/B/C1/D/E1.txt

A/B/C1/D/E2.txt

Cheers,
Tink

The problem is this file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN full file path and the directory path(as two entries), need to get rid of the DIR paths. I am not reading this from a file, trying to filter from the query command it self by piping.

Quote:

Originally Posted by John VV (Post 4706438)

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

Thanks for sharing let me try on this one...

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

Quote:

Originally Posted by grail (Post 4706807)

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

Actual output something like below and I have no way of predefined the DIR names, since these are code changes from a SVN repo.

Code:

branches/upgrade

branches/upgrade/Build

branches/upgrade/Build/scripts

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

above highlighted are the only three files and other lines are duplicate entries and and the directories, for some wired reason Fish-eye treat the directories and a another file and displays it in the query output.

What I'm trying to accomplish here is write automated script to merge the SVN changes from one branch to another by referring a JIRA ticket.

Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.

Quote:

Originally Posted by grail (Post 4707015)

can we use some string filtering,
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

Thanks for Anuradha I got this solved, posting the answer for others.

Code:

#!/usr/bin/perl



@files = <>;

foreach $tomatch (@files) {

 chomp($tomatch);

 $matchfound = 0;

 foreach $fl (@files) {

  chomp($fl);

  if ($fl =~ /^$tomatch/ && length($fl) != length($tomatch)) {

  $matchfound = 1;

  break;

  }

 }

 print "$tomatch\n" if (! $matchfound);

}

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:

Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

Quote:

can we use some string filtering,

I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Quote:

branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

Not only will a file always contain a dir path but so will the directory??

Quote:

Originally Posted by grail (Post 4707561)

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:

Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Not only will a file always contain a dir path but so will the directory??

Below is how it worked...parse.pl contains the Perl code, by combining the awk I was able to removed all the duplicates and the Directories...

Code:

[san@san1 tmp]$ cat t.txt 

branches/upgrade/Build/scripts

branches/upgrade/Build/scripts/svnscripts

branches/upgrade/Build/scripts/toolmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

[san@san1 tmp]$ cat t.txt | awk '!_[$1]++'

branches/upgrade/Build/scripts

branches/upgrade/Build/scripts/svnscripts

branches/upgrade/Build/scripts/toolmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/build

[san@san1 tmp]$ cat t.txt | awk '!_[$1]++' | ./parse.pl 

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile/build

[san@san1 tmp]$

Thanks for helping me out...

Well I think it is important to note for people who might search and find this solution that it works incorrectly on the assumption that the longest match for the same path
will end in a file name. An easy example, if we assume that directory blah is as follows:

Code:

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/blah

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

Your solution will return this as a valid file path when only by manually viewing will we know that it is in fact a directory.