LinuxQuestions.org - Remove duplicates from file

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Remove duplicates from file (https://www.linuxquestions.org/questions/linux-newbie-8/remove-duplicates-from-file-4175412123/)

sandikaxp

06-18-2012 03:39 PM

Remove duplicates from file

I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using

Code:

awk '!_[$1]++' /tmp/fileList.tmp

I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

rigor

06-18-2012 04:08 PM

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d

it will exclude directories, or:

Code:

find . -type f

will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$

to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

Tinkster

06-18-2012 04:33 PM

Quote:

Originally Posted by sandikaxp (Post 4706383)

Code:

awk '!_[$1]++' /tmp/fileList.tmp

I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

And I don't know anything about your files, but if they all have an
extension of some sort ...

Code:

awk '/\./ && !_[$1]++' dupes 

A/B/C/D/E.txt

A/B/C1/D/E1.txt

A/B/C1/D/E2.txt

Cheers,
Tink

John VV

06-18-2012 05:20 PM

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

sandikaxp

06-18-2012 11:39 PM

Quote:

Originally Posted by kakaka (Post 4706403)

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d

it will exclude directories, or:

Code:

find . -type f

will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$

to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

This file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN file path and the directory path, need to get rid of the DIR paths.

sandikaxp

06-18-2012 11:44 PM

Quote:

Originally Posted by Tinkster (Post 4706417)

And I don't know anything about your files, but if they all have an
extension of some sort ...

Code:

awk '/\./ && !_[$1]++' dupes 

A/B/C/D/E.txt

A/B/C1/D/E1.txt

A/B/C1/D/E2.txt

Cheers,
Tink

The problem is this file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN full file path and the directory path(as two entries), need to get rid of the DIR paths. I am not reading this from a file, trying to filter from the query command it self by piping.

sandikaxp

06-18-2012 11:54 PM

Quote:

Originally Posted by John VV (Post 4706438)

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

Thanks for sharing let me try on this one...

grail

06-19-2012 05:03 AM

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

sandikaxp

06-19-2012 10:35 AM

Quote:

Originally Posted by grail (Post 4706807)

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

Actual output something like below and I have no way of predefined the DIR names, since these are code changes from a SVN repo.

Code:

branches/upgrade

branches/upgrade/Build

branches/upgrade/Build/scripts

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

above highlighted are the only three files and other lines are duplicate entries and and the directories, for some wired reason Fish-eye treat the directories and a another file and displays it in the query output.

What I'm trying to accomplish here is write automated script to merge the SVN changes from one branch to another by referring a JIRA ticket.

grail

06-19-2012 11:00 AM

Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.

sandikaxp

06-19-2012 01:31 PM

Quote:

Originally Posted by grail (Post 4707015)

can we use some string filtering,
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

sandikaxp

06-19-2012 02:42 PM

Thanks for Anuradha I got this solved, posting the answer for others.

Code:

#!/usr/bin/perl



@files = <>;

foreach $tomatch (@files) {

 chomp($tomatch);

 $matchfound = 0;

 foreach $fl (@files) {

  chomp($fl);

  if ($fl =~ /^$tomatch/ && length($fl) != length($tomatch)) {

  $matchfound = 1;

  break;

  }

 }

 print "$tomatch\n" if (! $matchfound);

}

grail

06-20-2012 04:09 AM

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:

Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

Quote:

can we use some string filtering,

I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Quote:

branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

Not only will a file always contain a dir path but so will the directory??

sandikaxp

06-20-2012 10:02 PM

Quote:

Originally Posted by grail (Post 4707561)

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:

Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Not only will a file always contain a dir path but so will the directory??

Below is how it worked...parse.pl contains the Perl code, by combining the awk I was able to removed all the duplicates and the Directories...

Code:

[san@san1 tmp]$ cat t.txt 

branches/upgrade/Build/scripts

branches/upgrade/Build/scripts/svnscripts

branches/upgrade/Build/scripts/toolmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/toolmenu/buildmenu

[san@san1 tmp]$ cat t.txt | awk '!_[$1]++'

branches/upgrade/Build/scripts

branches/upgrade/Build/scripts/svnscripts

branches/upgrade/Build/scripts/toolmenu

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/build

[san@san1 tmp]$ cat t.txt | awk '!_[$1]++' | ./parse.pl 

branches/upgrade/Build/scripts/svnscripts/svnsbupdater

branches/upgrade/Build/scripts/toolmenu/buildmenu

branches/upgrade/Build/scripts/compile/build

[san@san1 tmp]$

Thanks for helping me out...

grail

06-21-2012 08:55 AM

Well I think it is important to note for people who might search and find this solution that it works incorrectly on the assumption that the longest match for the same path
will end in a file name. An easy example, if we assume that directory blah is as follows:

Code:

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/blah

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

Your solution will return this as a valid file path when only by manually viewing will we know that it is in fact a directory.

sandikaxp

06-21-2012 09:42 AM

Quote:

Originally Posted by grail (Post 4708544)

Code:

branches/upgrade/Build/scripts/compile

branches/upgrade/Build/scripts/compile/blah

branches/upgrade/Build/scripts/compile/build

branches/upgrade/Build/scripts/toolmenu/buildmenu

Your solution will return this as a valid file path when only by manually viewing will we know that it is in fact a directory.

fortunately for me in my scenario, directory path will be created only if there is file associated with that, since we are quarrying for SVN file changes. saying that now I see a bug with this, if someone created a directory only and record that change I will get that as a file, thanks for pointing that. any thoughts how to avoid the same

Tinkster

06-21-2012 04:36 PM

There is no programmatical way based on the output. The only solution
would be to do that reporting on the file-system, and generate the report from
there. ...

rigor

06-21-2012 06:58 PM

Rather than limiting the program to working just within the list of files and directories obtained from the JIRA ticket, I'd go the extra mile to be sure I had it right. I'd compare the list against the SVN repository. That would enable you to differentiate between files and directories. For example, comparing a line ( or a portion of a line ) from the ticket, against the output from something like:

Code:

svn list --depth files ...

which excludes directories. Naturally, the ... would be replaced with the proper args and repository designation, to suit the specifics of the situation.

All times are GMT -5. The time now is 06:46 AM.