Remove duplicates from file

sandikaxp · 06-18-2012, 03:39 PM

I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using

Code:

awk '!_[$1]++' /tmp/fileList.tmp

I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

rigor · 06-18-2012, 04:08 PM

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d

it will exclude directories, or:

Code:

find . -type f

will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$

to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

Tinkster · 06-18-2012, 04:33 PM

Quote:

Originally Posted by sandikaxp

I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using

Code:

awk '!_[$1]++' /tmp/fileList.tmp

I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

And I don't know anything about your files, but if they all have an
extension of some sort ...

Code:

awk '/\./ && !_[$1]++' dupes 
A/B/C/D/E.txt
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

Cheers,
Tink

John VV · 06-18-2012, 05:20 PM

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

sandikaxp · 06-18-2012, 11:39 PM

Quote:

Originally Posted by kakaka

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d

it will exclude directories, or:

Code:

find . -type f

will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$

to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

This file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN file path and the directory path, need to get rid of the DIR paths.

sandikaxp · 06-18-2012, 11:44 PM

Quote:

Originally Posted by Tinkster

And I don't know anything about your files, but if they all have an
extension of some sort ...

Code:

awk '/\./ && !_[$1]++' dupes 
A/B/C/D/E.txt
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

Cheers,
Tink

The problem is this file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN full file path and the directory path(as two entries), need to get rid of the DIR paths. I am not reading this from a file, trying to filter from the query command it self by piping.

sandikaxp · 06-18-2012, 11:54 PM

Quote:

Originally Posted by John VV

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

Thanks for sharing let me try on this one...

grail · 06-19-2012, 05:03 AM

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

sandikaxp · 06-19-2012, 10:35 AM

Quote:

Originally Posted by grail

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

Actual output something like below and I have no way of predefined the DIR names, since these are code changes from a SVN repo.

Code:

branches/upgrade
branches/upgrade/Build
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu

above highlighted are the only three files and other lines are duplicate entries and and the directories, for some wired reason Fish-eye treat the directories and a another file and displays it in the query output.

What I'm trying to accomplish here is write automated script to merge the SVN changes from one branch to another by referring a JIRA ticket.

grail · 06-19-2012, 11:00 AM

Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.

sandikaxp · 06-19-2012, 01:31 PM

Quote:

Originally Posted by grail

Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.

can we use some string filtering,
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

sandikaxp · 06-19-2012, 02:42 PM

Thanks for Anuradha I got this solved, posting the answer for others.

Code:

#!/usr/bin/perl

@files = <>;
foreach $tomatch (@files) {
 chomp($tomatch);
 $matchfound = 0;
 foreach $fl (@files) {
  chomp($fl);
  if ($fl =~ /^$tomatch/ && length($fl) != length($tomatch)) {
   $matchfound = 1;
   break;
  }
 }
 print "$tomatch\n" if (! $matchfound);
}

grail · 06-20-2012, 04:09 AM

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:

Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

Quote:

can we use some string filtering,

I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Quote:

branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

Not only will a file always contain a dir path but so will the directory??

sandikaxp · 06-20-2012, 10:02 PM

Quote:

Originally Posted by grail

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:

Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Not only will a file always contain a dir path but so will the directory??

Below is how it worked...parse.pl contains the Perl code, by combining the awk I was able to removed all the duplicates and the Directories...

Code:

[san@san1 tmp]$ cat t.txt 
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
[san@san1 tmp]$ cat t.txt | awk '!_[$1]++'
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
[san@san1 tmp]$ cat t.txt | awk '!_[$1]++' | ./parse.pl 
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
[san@san1 tmp]$

Thanks for helping me out...

grail · 06-21-2012, 08:55 AM

Well I think it is important to note for people who might search and find this solution that it works incorrectly on the assumption that the longest match for the same path
will end in a file name. An easy example, if we assume that directory blah is as follows:

Code:

branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/blah
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu

Your solution will return this as a valid file path when only by manually viewing will we know that it is in fact a directory.