LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-04-2009, 07:27 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
Need to concatenate many files with the same name occurring in many subdirectories


Hi all,

I have a "master" directory containing many subdirectories. Each subdirectory contains many files named #####.fa where # is any integer. I'm trying to write a script that will search through all of the subdirectories, find all the files with the same name, and concatenate them together into one output file of the same name in the master directory. The tricky part is that each file name doesn't necessarily occur in every folder. Do I need to make a query list for this or is there a way to specify files of the same name occurring in different directories?

Thanks!
Kevin
 
Old 11-04-2009, 07:52 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,066
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Untested ... something along those lines *should* work.

Code:
find -iname [0-9]\*.fa | xargs -i awk '{newfile=gensub( /.*\/([^\/]+)/, "\\1",1,FILENAME); print $0 >> "/path/to/aggregate/"newfile  }' {}


Cheers,
Tink
 
Old 11-04-2009, 08:09 PM   #3
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Tinkster,

That concatenates the contents of all of the files of interest into one file which is very close. I'm trying to only concatenate files with the same name. I'm sorry I was unclear. I feel like the first half of the script is what the doctor ordered. I'm trying to play around with xargs.

Thanks,
Kevin
 
Old 11-04-2009, 08:22 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,241

Rep: Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325
I'd start with using find or ls -R to generate a list of all files in all dirs, pipe through sort -u (or uniq) to get a unique list, then use that to drive the concatenate process.
 
Old 11-04-2009, 08:38 PM   #5
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,066
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by kmkocot View Post
Tinkster,

That concatenates the contents of all of the files of interest into one file which is very close. I'm trying to only concatenate files with the same name. I'm sorry I was unclear. I feel like the first half of the script is what the doctor ordered. I'm trying to play around with xargs.

Thanks,
Kevin
Not in my installation. I just tested it (with an admittedly small
sample, two identical files of the same name each, in the main and a
sub-directory ... ) and it produces two concatenated files, each of
which matches the name of 1 pair.

Code:
$ find -name \*fa
./tmp/12345.fa
./tmp/432112345.fa
./12345.fa
./432112345.fa
My output directory is /tmp
Code:
ls -ltr /tmp/*fa
-rw-r--r-- 1 tink    tink     874 2009-11-05 14:35 /tmp/432112345.fa
-rw-r--r-- 1 tink    tink    1748 2009-11-05 14:35 /tmp/12345.fa
And that's running my snippet above unmodified.

What OS are you on, which version of awk are you running?



Cheers,
Tink
 
Old 11-04-2009, 08:58 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Code:
#!/bin/bash
find /path-type f -name "[0-9]*.fa" -printf "%f:%p\n" | awk -F":" 'BEGIN {master="/destination/"}
{
        filename=$1
        fullpath=$2
        while( (getline line < fullpath ) > 0 ){
                print line >> master filename
        }
        close(filename)
}'
 
Old 11-05-2009, 07:09 PM   #7
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Oops

Tinkster: My apologies... Operator error. I used FILENAME as the target of the print command at the end (instead of newfile) and it concatenated them all to that file. Now it is behaving as you said it should. Thank you!

ghostdog74: your script also works brilliantly.

Thank you, thank you, thank you!
 
Old 12-12-2009, 08:14 PM   #8
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Similar question...
I have a directory with many subdirectories each named like so: KOG0001, KOG0002, ...KOG9999.

Each of these subdirectories contain a variable number two kinds of files (nuc and prot) named like so: Capitella_sp_nuc_hits.fasta (nuc) and Capitella_sp_prot_hits.fasta (prot). The Capitella_sp part represents the name of the species and varies from file to file.

I'm trying to write a script that will go through each subdirectory and concatenate the contents of all the _prot_hits.fasta files into one file in the main directory named like KOG0001.fasta, KOG0002.fasta, and so on. I think I have it figured out except how to reference the source files that I want. Can anyone help me out?

Code:
find . -maxdepth 2 -type f -name "KOG[0-9][0-9][0-9][0-9]*_prot_hits.fasta" -printf "%p\n" | awk -F"_" '{print $1}' | sed 's/$/&.fasta/g' awk -F"/" 'BEGIN {filename=$3 while((getline line < **original_source_files** ) > 0) {print line >> filename} close(filename)}'
Thanks,
Kevin

Last edited by kmkocot; 12-14-2009 at 02:33 AM.
 
Old 12-13-2009, 01:35 AM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
put your code in code tags
 
Old 12-14-2009, 06:19 PM   #10
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Sorry about that. Is there a way I reference the original source files I want to concatenate if I set up the expression like this?
 
Old 12-15-2009, 11:23 AM   #11
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Nevermind... I was making it harder than it should be.

Code:
for i in KOG????;do find $i -name $i_*_prot_hits.fasta -exec cat {} >> $i.fasta \; ;done
 
Old 12-15-2009, 08:21 PM   #12
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by kmkocot View Post
Nevermind... I was making it harder than it should be.
yes you are
Quote:
Code:
for i in KOG????;do find $i -name $i_*_prot_hits.fasta -exec cat {} >> $i.fasta \; ;done
why do you want to execute find command for every KOG???? you iterate?? Let find do it for you (or have i missed something??)
Code:
find KOG???? -iname "KOG????_*_prot_hits.fasta" .......
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to find some files and concatenate them? yumener Linux - Newbie 7 03-10-2009 04:41 PM
concatenate files sorted by date docaia Programming 5 08-16-2008 09:32 PM
Script to concatenate several files docaia Linux - General 10 02-03-2008 02:59 PM
Concatenate PDF files? mykrob Linux - Software 5 11-07-2006 06:25 AM
concatenate binary files???? justin19fl Linux - Newbie 6 05-14-2001 03:13 PM


All times are GMT -5. The time now is 07:16 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration