Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
11-04-2009, 07:27 PM
|
#1
|
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Rep:
|
Need to concatenate many files with the same name occurring in many subdirectories
Hi all,
I have a "master" directory containing many subdirectories. Each subdirectory contains many files named #####.fa where # is any integer. I'm trying to write a script that will search through all of the subdirectories, find all the files with the same name, and concatenate them together into one output file of the same name in the master directory. The tricky part is that each file name doesn't necessarily occur in every folder. Do I need to make a query list for this or is there a way to specify files of the same name occurring in different directories?
Thanks!
Kevin
|
|
|
11-04-2009, 07:52 PM
|
#2
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Untested ... something along those lines *should* work.
Code:
find -iname [0-9]\*.fa | xargs -i awk '{newfile=gensub( /.*\/([^\/]+)/, "\\1",1,FILENAME); print $0 >> "/path/to/aggregate/"newfile }' {}
Cheers,
Tink
|
|
|
11-04-2009, 08:09 PM
|
#3
|
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
|
Tinkster,
That concatenates the contents of all of the files of interest into one file which is very close. I'm trying to only concatenate files with the same name. I'm sorry I was unclear. I feel like the first half of the script is what the doctor ordered. I'm trying to play around with xargs.
Thanks,
Kevin
|
|
|
11-04-2009, 08:22 PM
|
#4
|
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,415
|
I'd start with using find or ls -R to generate a list of all files in all dirs, pipe through sort -u (or uniq) to get a unique list, then use that to drive the concatenate process.
|
|
|
11-04-2009, 08:38 PM
|
#5
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Quote:
Originally Posted by kmkocot
Tinkster,
That concatenates the contents of all of the files of interest into one file which is very close. I'm trying to only concatenate files with the same name. I'm sorry I was unclear. I feel like the first half of the script is what the doctor ordered. I'm trying to play around with xargs.
Thanks,
Kevin
|
Not in my installation. I just tested it (with an admittedly small
sample, two identical files of the same name each, in the main and a
sub-directory ... ) and it produces two concatenated files, each of
which matches the name of 1 pair.
Code:
$ find -name \*fa
./tmp/12345.fa
./tmp/432112345.fa
./12345.fa
./432112345.fa
My output directory is /tmp
Code:
ls -ltr /tmp/*fa
-rw-r--r-- 1 tink tink 874 2009-11-05 14:35 /tmp/432112345.fa
-rw-r--r-- 1 tink tink 1748 2009-11-05 14:35 /tmp/12345.fa
And that's running my snippet above unmodified.
What OS are you on, which version of awk are you running?
Cheers,
Tink
|
|
|
11-04-2009, 08:58 PM
|
#6
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
Code:
#!/bin/bash
find /path-type f -name "[0-9]*.fa" -printf "%f:%p\n" | awk -F":" 'BEGIN {master="/destination/"}
{
filename=$1
fullpath=$2
while( (getline line < fullpath ) > 0 ){
print line >> master filename
}
close(filename)
}'
|
|
|
11-05-2009, 07:09 PM
|
#7
|
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
|
Oops
Tinkster: My apologies... Operator error. I used FILENAME as the target of the print command at the end (instead of newfile) and it concatenated them all to that file. Now it is behaving as you said it should. Thank you!
ghostdog74: your script also works brilliantly.
Thank you, thank you, thank you!
|
|
|
12-12-2009, 08:14 PM
|
#8
|
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
|
Similar question...
I have a directory with many subdirectories each named like so: KOG0001, KOG0002, ...KOG9999.
Each of these subdirectories contain a variable number two kinds of files (nuc and prot) named like so: Capitella_sp_nuc_hits.fasta (nuc) and Capitella_sp_prot_hits.fasta (prot). The Capitella_sp part represents the name of the species and varies from file to file.
I'm trying to write a script that will go through each subdirectory and concatenate the contents of all the _prot_hits.fasta files into one file in the main directory named like KOG0001.fasta, KOG0002.fasta, and so on. I think I have it figured out except how to reference the source files that I want. Can anyone help me out?
Code:
find . -maxdepth 2 -type f -name "KOG[0-9][0-9][0-9][0-9]*_prot_hits.fasta" -printf "%p\n" | awk -F"_" '{print $1}' | sed 's/$/&.fasta/g' awk -F"/" 'BEGIN {filename=$3 while((getline line < **original_source_files** ) > 0) {print line >> filename} close(filename)}'
Thanks,
Kevin
Last edited by kmkocot; 12-14-2009 at 02:33 AM.
|
|
|
12-13-2009, 01:35 AM
|
#9
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
put your code in code tags
|
|
|
12-14-2009, 06:19 PM
|
#10
|
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
|
Sorry about that. Is there a way I reference the original source files I want to concatenate if I set up the expression like this?
|
|
|
12-15-2009, 11:23 AM
|
#11
|
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
|
Nevermind... I was making it harder than it should be.
Code:
for i in KOG????;do find $i -name $i_*_prot_hits.fasta -exec cat {} >> $i.fasta \; ;done
|
|
|
12-15-2009, 08:21 PM
|
#12
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
Quote:
Originally Posted by kmkocot
Nevermind... I was making it harder than it should be.
|
yes you are
Quote:
Code:
for i in KOG????;do find $i -name $i_*_prot_hits.fasta -exec cat {} >> $i.fasta \; ;done
|
why do you want to execute find command for every KOG???? you iterate?? Let find do it for you (or have i missed something??)
Code:
find KOG???? -iname "KOG????_*_prot_hits.fasta" .......
|
|
|
All times are GMT -5. The time now is 02:00 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|