LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Shell script to search one file for contents of another and replace text? (https://www.linuxquestions.org/questions/linux-newbie-8/shell-script-to-search-one-file-for-contents-of-another-and-replace-text-860514/)

kmkocot 02-03-2011 12:51 PM

Shell script to search one file for contents of another and replace text?
 
Hi all,

Suppose I have a pair of files containing lists. The first file is called contigs.txt and it contains a list that looks like this:
Code:

Contig822
Contig826
Contig835
Contig841
Contig917
Contig968
GKUWSH001A21XE
GKUWSH001AK3B4
GKUWSH001AO6MK
GKUWSH001AQU52

The second file is called reads_in_contigs.txt. It consists of a row of *s, a Contig name, a second row of *s, and a list of the singletons that make up that contig. It looks like this:
Code:

*******************
Contig817
********************
GMNVR6W01AS7Z9
GKUWSH001D3EXR
GMNVR6W01AS7Z9
*******************
Contig818
********************
GMNVR6W01EOCOR
GMNVR6W01DFJYN
GMNVR6W01EOCOR
*******************
Contig819
********************
GMNVR6W01D7LDZ
GMNVR6W01DBITS
GMNVR6W01D7LDZ
GMNVR6W01AI41M
*******************
Contig820
********************
GMNVR6W01D351L
GMNVR6W01AJJQI
GLOIIK001C5DHC
*******************
Contig821
********************
GMNVR6W01EHWPG
GKUWSH001C70GV
GMNVR6W01EHWPG
GKUWSH001B7R9X
GMNVR6W01EHWPG
GKUWSH001B5O4D
GMNVR6W01EHWPG
GKUWSH001ALEE4
GKUWSH001DDMOQ
GKUWSH001DBEUT
GKUWSH001DDMOQ
*******************
Contig822
********************
GMNVR6W01BC1GA
GMNVR6W01BXQE5
GKUWSH001B1416
GKUWSH001A0VIX
GKUWSH001B1416
GKUWSH001DB04I
GLOIIK001B3O6Z
GKUWSH001DB04I
GKUWSH001E3NLP
GLOIIK001B3O6Z
GKUWSH001EV96A
GLOIIK001B3O6Z
GLOIIK001CAD8R
GLOIIK001B3O6Z
GLOIIK001EB88S
GLOIIK001B3O6Z
GKUWSH001DM3ZY
GLOIIK001EB88S
GLOIIK001DLAAT
GLOIIK001B3O6Z
GLOIIK001DDSHO
GLOIIK001B3O6Z
GKUWSH001BTEFS
GLOIIK001B3O6Z
GLOIIK001D9JSF
GLOIIK001CA9NU
GKUWSH001CH1QS
GKUWSH001A1T82
GKUWSH001CH1QS
GKUWSH001D4PWE
GKUWSH001EPER0
GKUWSH001D4PWE
*******************
Contig823
********************
GMNVR6W01D3V2S
GMNVR6W01EOZ66
*******************
Contig824
********************
GMNVR6W01AI64M
GLOIIK001AEI4W
GKUWSH001D18JE
GLOIIK001AEI4W

What I want to do is search through the reads_in_contigs.txt file and replace contig names with the singletons that make them up. For example, I would like to replace the text "Contig822" in the reads_in_contigs.txt file with the following:
Code:

GMNVR6W01BC1GA
GMNVR6W01BXQE5
GKUWSH001B1416
GKUWSH001A0VIX
GKUWSH001B1416
GKUWSH001DB04I
GLOIIK001B3O6Z
GKUWSH001DB04I
GKUWSH001E3NLP
GLOIIK001B3O6Z
GKUWSH001EV96A
GLOIIK001B3O6Z
GLOIIK001CAD8R
GLOIIK001B3O6Z
GLOIIK001EB88S
GLOIIK001B3O6Z
GKUWSH001DM3ZY
GLOIIK001EB88S
GLOIIK001DLAAT
GLOIIK001B3O6Z
GLOIIK001DDSHO
GLOIIK001B3O6Z
GKUWSH001BTEFS
GLOIIK001B3O6Z
GLOIIK001D9JSF
GLOIIK001CA9NU
GKUWSH001CH1QS
GKUWSH001A1T82
GKUWSH001CH1QS
GKUWSH001D4PWE
GKUWSH001EPER0
GKUWSH001D4PWE

Any suggestions would be greatly appreciated!

Thanks,
Kevin

goossen 02-03-2011 12:55 PM

And when you use contigs.txt ?

crts 02-03-2011 03:10 PM

Hi,

do you mean something like this
Code:

#!/bin/bash
exec 3>&1 1>result
while read line; do
  if [[ "$line" =~ Contig ]]; then
    sed -rn "/$line/{n;n;h;:a n;/\*+/bb;$ {H;bb};H;ba;:b x;p;Q99}" /path/to/reads_in_contigs.txt
    [[ $? == 99 ]] && continue
  fi
  echo "$line"
done < /path/to/contigs.txt
exec 1>&3 3>&-

It is not clear from your initial post how you want to handle non-existent replacements. E.g., Contig826 has no replacement according to your sample data. Do you want to keep it or replaced it with nothing? The above script keeps it. The resulting file is named 'result'. Rename it if necessary.

grail 02-03-2011 08:43 PM

I am with goosen in that the following does not mention the contigs.txt file anywhere:
Quote:

Originally Posted by kmkocot
What I want to do is search through the reads_in_contigs.txt file and replace contig names with the singletons that make them up. For example, I would like to replace the text "Contig822" in the reads_in_contigs.txt file with the following:

Although I see crts' crystal ball is working better than mine ;)

kmkocot 02-04-2011 02:40 PM

crts, Nailed it! This really helped me out. Thank you. As you guessed, if a Contig# doesn't appear in reads_in_contigs.txt (e.g., Contig826) I want it to remain unchanged in contigs.txt. All of the Contig# names in contigs.txt should have been replaced with the singletons that make them up so if one remained it would serve as a red flag to me that I did something wrong.

grail and goosen, I'm sorry I wasn't very clear on what I was trying to do. If you care, it should have said: "What I want to do is search through the reads_in_contigs.txt file for matches to contig and singleton names in the contigs.txt file and replace contig names (e.g., Contig822) with the singletons that make them up." Does that make more sense?

Thanks a lot!
Kevin

grail 02-05-2011 01:50 AM

Cool ... in that case how about something like:
Code:

awk 'FNR == NR && NF{getline arr[$0];next}{for(x = 1;x <= NF;x++)if($x in arr)print arr[$x];else print $x}' RS="[\n]?[*]+\n" reads_in_contigs.txt contigs.txt

kmkocot 10-28-2011 02:09 PM

Hi all,

I'm trying to do something similar to the above but with a twist. Say we have a similar contigs.txt file that looks something like this:
Code:

Contig822
Contig826
Contig835

We also have a file containing DNA or amino acid sequences (sequences.txt) where each sequence spans two lines. The first line that begins with ">" is the header (some of these will correspond to entries in contigs.txt) and the second is the sequence itself.
Code:

>Contig822
GGAACAAAAACGGTTGGATGGCCTGAAAAATGGACAAATGTTTATTTATATAATTATATAAACATACAACAAGAGGTATCATTATATCCGTGTATAGTGTATAGTATACTGTATACGATAGTCTTGTCGTTCATTATGTTAAAATGAGAT
>Contig826
TACTTCTCTATTCGCACAGGCCTTCAAATAGTTCTACTGTTCCGCCATATGTATCTATTGTAATTGTATTGCATGTTTATTTTATGTTGCACATTAATCAATCAACCTCGTACAGTGTGTCGGGCAGTTGGAACACTATAGGGCCTATAGTATAAGCGCACTTTAAATCACCTCATTCATTCATACAAAACATCGACAGTCCCATTCTATTTTCATATGCTACCAGATAGTCAATGTTTCAGATGAGCAAC
>Contig830
TTTTTATATGTGTTTTTTTATTTATATATACAAAGCTTCATGTTTAGAATCGCAACTTCCTGAAGAACCACTTGTGTTTGCCAGTCTTGTATTTCGCCTCCATCTTGGACTTGATCTCTCGCGACGCTTTCCTCTTCCTGCCTGGATCCTTGAACGCCTCCTTGTTCACCACCTCTTTGTTGTAGTCCAAGTCTACGCAGTATCTGGTAGGCATGAGATGGTTGTAATTGTACAGCTTGATGAACGGGCGGATCTTGGAGCGCT
>Contig835
TGTCGAACACTCAGAAAATGTCGAGTAAATCTAATGAGAAAGGAGGAGCTGAGGAGATTCCCATCAAGAAATGTAAGACTGACCCATCCACACCGACAAAAGCTGATAGCAGTGGTGGTGG

I want to extract the headers and sequences in the sequences.txt file and send them to a new file if the header (not including the ">") appears in the contigs.txt file. I thought an awk command might work well but I can't figure out how to use getline to get multiple lines. Also, it won't print the ">" that I want to add in. Any help would be greatly appreciated! Here's what I have so far:
Code:

awk 'FNR == NR && NF{getline arr[$0];next}{for(x = 1;x <= NF;x++)if($x in arr)print ">"arr[$x];else print $x}' RS=">" sequences.txt contigs.txt


All times are GMT -5. The time now is 03:32 AM.