LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-18-2012, 03:39 PM   #1
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Rep: Reputation: Disabled
Remove duplicates from file


I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using
Code:
awk '!_[$1]++' /tmp/fileList.tmp
I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.
 
Old 06-18-2012, 04:08 PM   #2
kakaka
Member
 
Registered: Sep 2003
Posts: 382

Rep: Reputation: 87
I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:
find . \! -type d
it will exclude directories, or:

Code:
find . -type f
will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:
(.+)\.(.+)$
to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.
 
Old 06-18-2012, 04:33 PM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by sandikaxp View Post
I need help on below,

I have a file with below file names with the directory list
cat /tmp/fileList.tmp
A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt
A/B/C1/D/E2.txt

using
Code:
awk '!_[$1]++' /tmp/fileList.tmp
I got below

A/B/C
A/B/C/D
A/B/C/D/E.txt
A/B/C1
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt

but still the Dir paths are listed, I need to remove the directory paths and keep only the file paths.
And I don't know anything about your files, but if they all have an
extension of some sort ...
Code:
awk '/\./ && !_[$1]++' dupes 
A/B/C/D/E.txt
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt


Cheers,
Tink
 
1 members found this post helpful.
Old 06-18-2012, 05:20 PM   #4
John VV
LQ Muse
 
Registered: Aug 2005
Location: A2 area Mi.
Posts: 17,093

Rep: Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474Reputation: 2474
is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt
 
1 members found this post helpful.
Old 06-18-2012, 11:39 PM   #5
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by kakaka View Post
I don't know you produced the file list, but if you can reproduce by going through a directory structure, if you use a find command such as this:

Code:
find . \! -type d
it will exclude directories, or:

Code:
find . -type f
will only include "files".

With the list you have now, as a Human, how do you recognize a non-directory? Do all files have extensions? If so, then could use a pattern such as:

Code:
(.+)\.(.+)$
to match one or more characters followed by a literal dot followed by one or more characters, at the end of the line, and so match only non-directories.
This file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN file path and the directory path, need to get rid of the DIR paths.
 
Old 06-18-2012, 11:44 PM   #6
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Tinkster View Post
And I don't know anything about your files, but if they all have an
extension of some sort ...
Code:
awk '/\./ && !_[$1]++' dupes 
A/B/C/D/E.txt
A/B/C1/D/E1.txt
A/B/C1/D/E2.txt


Cheers,
Tink
The problem is this file list is generated by the FishEye query to get the SVN change set of a JIRA ticket. query output has the SVN full file path and the directory path(as two entries), need to get rid of the DIR paths. I am not reading this from a file, trying to filter from the query command it self by piping.
 
Old 06-18-2012, 11:54 PM   #7
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by John VV View Post
is using "awk" mandatory ?
from the sed page - the " one liners "
http://sed.sourceforge.net/sed1line.txt
Thanks for sharing let me try on this one...
 
Old 06-19-2012, 05:03 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
As has already been stated, you would need to provide information about to tell the difference between a file and a directory.
 
Old 06-19-2012, 10:35 AM   #9
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
As has already been stated, you would need to provide information about to tell the difference between a file and a directory.
Actual output something like below and I have no way of predefined the DIR names, since these are code changes from a SVN repo.

Code:
branches/upgrade
branches/upgrade/Build
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
above highlighted are the only three files and other lines are duplicate entries and and the directories, for some wired reason Fish-eye treat the directories and a another file and displays it in the query output.

What I'm trying to accomplish here is write automated script to merge the SVN changes from one branch to another by referring a JIRA ticket.
 
Old 06-19-2012, 11:00 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.
 
Old 06-19-2012, 01:31 PM   #11
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Unless you can find a way to differentiate between files and directories, you will be stuck with only removing the duplicates.

It could even be as simple as the directories all having a trailing slash.

can we use some string filtering,
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.
 
Old 06-19-2012, 02:42 PM   #12
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Smile

Thanks for Anuradha I got this solved, posting the answer for others.

Code:
#!/usr/bin/perl

@files = <>;
foreach $tomatch (@files) {
 chomp($tomatch);
 $matchfound = 0;
 foreach $fl (@files) {
  chomp($fl);
  if ($fl =~ /^$tomatch/ && length($fl) != length($tomatch)) {
   $matchfound = 1;
   break;
  }
 }
 print "$tomatch\n" if (! $matchfound);
}
 
Old 06-20-2012, 04:09 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:
Code:
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
Now if I am not mistaken, this has neither removed duplicates nor listing only files?

Quote:
can we use some string filtering,
I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:
Quote:
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts/svnsbupdater

so we can remove the line "branches/upgrade/Build/scripts"

because always the file name will contain the dir path in it.
Not only will a file always contain a dir path but so will the directory??
 
Old 06-20-2012, 10:02 PM   #14
sandikaxp
LQ Newbie
 
Registered: Jun 2012
Distribution: Fedora
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Well I must say I am curious how this script has met any of your requirements??

When run on the data from post #9 I get:
Code:
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
Now if I am not mistaken, this has neither removed duplicates nor listing only files?


I do not see how any string filtering or manipulation will help as you have no way of telling the difference between files and directories.
Your own example is flawed in the fact that only a visual look at the data can let you know what is a file or directory:

Not only will a file always contain a dir path but so will the directory??
Below is how it worked...parse.pl contains the Perl code, by combining the awk I was able to removed all the duplicates and the Directories...

Code:
[san@san1 tmp]$ cat t.txt 
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/toolmenu/buildmenu
[san@san1 tmp]$ cat t.txt | awk '!_[$1]++'
branches/upgrade/Build/scripts
branches/upgrade/Build/scripts/svnscripts
branches/upgrade/Build/scripts/toolmenu
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/build
[san@san1 tmp]$ cat t.txt | awk '!_[$1]++' | ./parse.pl 
branches/upgrade/Build/scripts/svnscripts/svnsbupdater
branches/upgrade/Build/scripts/toolmenu/buildmenu
branches/upgrade/Build/scripts/compile/build
[san@san1 tmp]$
Thanks for helping me out...
 
Old 06-21-2012, 08:55 AM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Well I think it is important to note for people who might search and find this solution that it works incorrectly on the assumption that the longest match for the same path
will end in a file name. An easy example, if we assume that directory blah is as follows:
Code:
branches/upgrade/Build/scripts/compile
branches/upgrade/Build/scripts/compile/blah
branches/upgrade/Build/scripts/compile/build
branches/upgrade/Build/scripts/toolmenu/buildmenu
Your solution will return this as a valid file path when only by manually viewing will we know that it is in fact a directory.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Chemistry problem: Identify duplicates and non-duplicates within TWO sdf files robertselwyne Programming 5 12-09-2011 06:20 AM
To remove duplicates from a text file Priyabio Linux - General 5 11-11-2011 02:32 AM
[SOLVED] Duplicates in text file. crowzie Linux - Newbie 10 07-02-2011 10:42 PM
I want to keep the duplicates not remove them! ieatbunnies Linux - Software 1 01-17-2011 12:18 PM
MySQL: How-to Surgically Remove Duplicates mchirico LinuxQuestions.org Member Success Stories 0 06-11-2004 10:53 PM


All times are GMT -5. The time now is 08:40 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration