LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Remove duplicates from file (https://www.linuxquestions.org/questions/linux-newbie-8/remove-duplicates-from-file-4175412123/)

sandikaxp 06-18-2012 03:39 PM

Remove duplicates from file
 
I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using
Code:

awk '!_[$1]++' /tmp/fileList.tmp
I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

rigor 06-18-2012 04:08 PM

I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d
it will exclude directories, or:

Code:

find . -type f
will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$
to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

Tinkster 06-18-2012 04:33 PM

Quote:

Originally Posted by sandikaxp (Post 4706383)
I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using
Code:

awk '!_[$1]++' /tmp/fileList.tmp
I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.

And I don't know anything about your files, but if they all have an
extension of some sort ...
Code:

awk '/\./ && !_[$1]++' dupes
A/B/C/D/E.txt
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt



Cheers,
Tink

John VV 06-18-2012 05:20 PM

is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

sandikaxp 06-18-2012 11:39 PM

Quote:

Originally Posted by kakaka (Post 4706403)
I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:

find . \! -type d
it will exclude directories, or:

Code:

find . -type f
will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:

(.+)\.(.+)$
to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.

This file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN file path and the directory path, need to get rid of the DIR paths.

sandikaxp 06-18-2012 11:44 PM

Quote:

Originally Posted by Tinkster (Post 4706417)
And I don't know anything about your files, but if they all have an
extension of some sort ...
Code:

awk '/\./ && !_[$1]++' dupes
A/B/C/D/E.txt
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt



Cheers,
Tink

The problem is this file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN full file path and the directory path(as two entries), need to get rid of the DIR paths. I am not reading this from a file, trying to filter from the query command it self by piping.

sandikaxp 06-18-2012 11:54 PM

Quote:

Originally Posted by John VV (Post 4706438)
is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt

Thanks for sharing let me try on this one...

grail 06-19-2012 05:03 AM

As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

sandikaxp 06-19-2012 10:35 AM

Quote:

Originally Posted by grail (Post 4706807)
As has already been stated, you would need to provide information about to tell the difference between a file and a directory.

Actual output something like below and I have no way of predefined the DIR names, since these are code changes from a SVN repo.

Code:

branches/upgrade
branches/upgrade/Build
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu

above highlighted are the only three files and other lines are duplicate entries and and the directories, for some wired reason Fish-eye treat the directories and a another file and displays it in the query output.

What I'm trying to accomplish here is write automated script to merge the SVN changes from one branch to another by referring a JIRA ticket.

grail 06-19-2012 11:00 AM

Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.

sandikaxp 06-19-2012 01:31 PM

Quote:

Originally Posted by grail (Post 4707015)
Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.


can we use some string filtering,
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.

sandikaxp 06-19-2012 02:42 PM

Thanks for Anuradha I got this solved, posting the answer for others.

Code:

#!/usr/bin/perl

@files = <>;
foreach $tomatch (@files) {
 chomp($tomatch);
 $matchfound = 0;
 foreach $fl (@files) {
  chomp($fl);
  if ($fl =~ /^$tomatch/ && length($fl) != length($tomatch)) {
  $matchfound = 1;
  break;
  }
 }
 print "$tomatch\n" if (! $matchfound);
}


grail 06-20-2012 04:09 AM

Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:
Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?

Quote:

can we use some string filtering,
I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:
Quote:

branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.
Not only will a file always contain a dir path but so will the directory??

sandikaxp 06-20-2012 10:02 PM

Quote:

Originally Posted by grail (Post 4707561)
Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:
Code:

branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu

Now if I am not mistaken, this has neither removed duplicates nor listing only files?


I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Not only will a file always contain a dir path but so will the directory??

Below is how it worked...parse.pl contains the Perl code, by combining the awk I was able to removed all the duplicates and the Directories...

Code:

[san@san1 tmp]$ cat t.txt
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
[san@san1 tmp]$ cat t.txt | awk '!_[$1]++'
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
[san@san1 tmp]$ cat t.txt | awk '!_[$1]++' | ./parse.pl
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
[san@san1 tmp]$

Thanks for helping me out...

grail 06-21-2012 08:55 AM

Well I think it is important to note for people who might search and find this solution that it works incorrectly on the assumption that the longest match for the same path
will end in a file name. An easy example, if we assume that directory blah is as follows:
Code:

branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/blah
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu

Your solution will return this as a valid file path when only by manually viewing will we know that it is in fact a directory.


All times are GMT -5. The time now is 12:58 PM.