LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Parsing folders picking files and concatenating them (https://www.linuxquestions.org/questions/programming-9/parsing-folders-picking-files-and-concatenating-them-4175498033/)

jahndavik 03-13-2014 07:27 AM

Parsing folders picking files and concatenating them
 
Hi there,
I am writing a .sh in order to parse folders picking files with file names containing specific signatures (=part of the file name). The code I have written so far is as follows:

Code:

for i in $(ls); do                             
  #echo item: $i                               
  if [ -d $i ]; then                           
    cd $i 
    echo folder:$i                                 
    for z in *.fastq; do                         
      #echo item: $z       
      n=0
      if echo "$z" | grep -q "R1_00.";then       
          echo $z
          < here i want to append the current file to the previous file>
          < with the same signature (=R1_00. >
      fi
    done                                     
    cd ..                                       
  fi                                           
done

When I run this code, one of the folders comes up with the following print:

folder:SonNot24
31_GCCAAT_L005_R1_001.fastq
31_GCCAAT_L006_R1_001.fastq
33_ACAGTG_L003_R1_001.fastq
33_ACAGTG_L004_R1_001.fastq
35_TGACCA_L001_R1_001.fastq
35_TGACCA_L002_R1_001.fastq


Ultimately I would like to cat all these files into one with the name of the folder (SonNot24).

Any help is appreciated.

Thanks.
jahn

danielbmartin 03-13-2014 08:50 AM

Quote:

Originally Posted by jahndavik (Post 5133813)
... I am writing a .sh in order to parse folders picking files with file names containing specific signatures (=part of the file name).

Maybe you need nothing more than the cat command. Just off the top of my head, I entered this on the command line:
Code:

cat /home/daniel/Desktop/LQfiles/*m1*.bin >/home/daniel/Desktop/LQfiles/hugefile.bin
It took all files in the folder /home/daniel/Desktop/LQfiles/ with names which met certain criteria, catenated them into one new file called /home/daniel/Desktop/LQfiles/hugefile.bin. The selection criteria were these: the name contained the character string m1, and the file extension was .bin.

Daniel B. Martin

grail 03-13-2014 08:51 PM

So Daniel's suggestion is valid, so I will advise a little on your question and general coding:

1. Please use [code][/code] tags around code and data to maintain formatting

2. Do not use 'ls' to feed a for loop (or generally any type of loop), see here for more details

3. Although short and may not last long, if you try using meaningful variable names it can also assist with readability

4. Get in the practice of quoting all variables

5. grep is overkill in this scenario ... Check here and search for regex on the page

6. On regexes (regular expressions), '.' refers to any character, hence, "R1_00." from your code says to look for the string
"R1_00" followed by any single character. If you wanted the string followed by a period (.) you need to escape it
using either - "\." or "[.]"

Hope some of that helps :)

jahndavik 03-21-2014 08:12 AM

I never got that regexe to work. The grep -q thing works, though it may be overkill.

Re regex:
I look for the string 'R1.fastq.gz' in the file name and use:
Code:

if echo "$z" | grep -q "R1.fastq.gz";
I've tried
Code:

if [[ "$z" == R1"[.]"fastq"[.]"gz ]];
and
Code:

if [[ "$z" == R1[.]fastq[.]gz ]];
and
Code:

if [[ "$z" == R1"\."fastq"\."gz ]];
and
Code:

if [[ "$z" == R1\.fastq\.gz ]];
neither of them runs.

So, please, anyone :-)
Thanks.

jahn

grail 03-21-2014 10:29 AM

I found the trick to regex in bash is to assign it to a variable using full quoting ('') and then use the bare variable (one of those times when quotes do not help).
Code:

regex='R1[.]fastq[.]gz'

[[ "$z" =~ $regex ]] && echo we have a match

Pointers on your tests:

1. == - this is used to test if 2 strings are equal, which your tests clearly are not due to all the superfluous characters, ie the []

2. If you are going to test an entire string against another then you might as well use the tests you have but with the standard string:
Code:

[[ "$z" == "R1.fastq.gz" ]]


All times are GMT -5. The time now is 12:05 PM.