LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   write to file under certain conditions (https://www.linuxquestions.org/questions/linux-newbie-8/write-to-file-under-certain-conditions-4175434672/)

progchi 10-29-2012 05:25 PM

write to file under certain conditions
 
Hello,

I have a file with 2 columns looking like this:

>GENE1 ACGGTTAGAGCCCAGAGTTGAGACCCGTGGAG
>GENE2 NACCCCGATCGTACGRRSTVACCCGA
>GENE3 TGCGAGCNNTTTSSR
>GENE4 CGATGCTGCGCGATCTCTAGAGAGCCCAG

I want to obtain 2 files. One file with the rows of which column 2 contains only A's, C's, T's or G's. And another file with the rows of which column 2 contains also characters other than A's, C's, T's or G's.
So in this case:
File 1:
>GENE1 ACGGTTAGAGCCCAGAGTTGAGACCCGTGGAG
>GENE4 CGATGCTGCGCGATCTCTAGAGAGCCCAG

File 2:
>GENE2 NACCCCGATCGTACGRRSTVACCCGA
>GENE3 TGCGAGCNNTTTSSR


I really tried several things, but nothing worked :-(.

Thanks in advance!

rknichols 10-29-2012 10:38 PM

Code:

egrep '^[^ ]+ +[ACTG]+ *$'  # A, C, T, G only
egrep -v '^[^ ]+ +[ACTG]+ *$'  # lines not matching the above

The regular expression matches one or more non-space characters at the beginning of the line, followed by one or more spaces, followed by one or more of the characters ACTG, and possible trailing space characters till the end of line. The second command simply uses the "-v" option to invert the search. A shortcoming of that second command is that it would print any lines that don't match the format. A better, but more complex, command for that second case would be:
Code:

egrep '^[^ ]+ +[^ ]*[^ACTG][^ ]* *$'
That one looks for a 2nd field that consists of any number of non-space characters, followed by one character that is not ACTG, followed by any number of non-space characters, and will print only lines with exactly two fields where the 2nd field contains a character that is not ACTG.

colucix 10-30-2012 03:13 AM

Another suggestion using awk:
Code:

awk '{ if ($2 ~ /[^ACGT]/) print > "file2"; else print > "file1" }' file

progchi 10-30-2012 10:48 AM

Thank you both very much. This helped me a lot!


All times are GMT -5. The time now is 10:58 PM.