LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-03-2011, 01:51 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
Shell script to search one file for contents of another and replace text?


Hi all,

Suppose I have a pair of files containing lists. The first file is called contigs.txt and it contains a list that looks like this:
Code:
Contig822
Contig826
Contig835
Contig841
Contig917
Contig968
GKUWSH001A21XE
GKUWSH001AK3B4
GKUWSH001AO6MK
GKUWSH001AQU52
The second file is called reads_in_contigs.txt. It consists of a row of *s, a Contig name, a second row of *s, and a list of the singletons that make up that contig. It looks like this:
Code:
*******************
Contig817
********************
GMNVR6W01AS7Z9
GKUWSH001D3EXR
GMNVR6W01AS7Z9
*******************
Contig818
********************
GMNVR6W01EOCOR
GMNVR6W01DFJYN
GMNVR6W01EOCOR
*******************
Contig819
********************
GMNVR6W01D7LDZ
GMNVR6W01DBITS
GMNVR6W01D7LDZ
GMNVR6W01AI41M
*******************
Contig820
********************
GMNVR6W01D351L
GMNVR6W01AJJQI
GLOIIK001C5DHC
*******************
Contig821
********************
GMNVR6W01EHWPG
GKUWSH001C70GV
GMNVR6W01EHWPG
GKUWSH001B7R9X
GMNVR6W01EHWPG
GKUWSH001B5O4D
GMNVR6W01EHWPG
GKUWSH001ALEE4
GKUWSH001DDMOQ
GKUWSH001DBEUT
GKUWSH001DDMOQ
*******************
Contig822
********************
GMNVR6W01BC1GA
GMNVR6W01BXQE5
GKUWSH001B1416
GKUWSH001A0VIX
GKUWSH001B1416
GKUWSH001DB04I
GLOIIK001B3O6Z
GKUWSH001DB04I
GKUWSH001E3NLP
GLOIIK001B3O6Z
GKUWSH001EV96A
GLOIIK001B3O6Z
GLOIIK001CAD8R
GLOIIK001B3O6Z
GLOIIK001EB88S
GLOIIK001B3O6Z
GKUWSH001DM3ZY
GLOIIK001EB88S
GLOIIK001DLAAT
GLOIIK001B3O6Z
GLOIIK001DDSHO
GLOIIK001B3O6Z
GKUWSH001BTEFS
GLOIIK001B3O6Z
GLOIIK001D9JSF
GLOIIK001CA9NU
GKUWSH001CH1QS
GKUWSH001A1T82
GKUWSH001CH1QS
GKUWSH001D4PWE
GKUWSH001EPER0
GKUWSH001D4PWE
*******************
Contig823
********************
GMNVR6W01D3V2S
GMNVR6W01EOZ66
*******************
Contig824
********************
GMNVR6W01AI64M
GLOIIK001AEI4W
GKUWSH001D18JE
GLOIIK001AEI4W
What I want to do is search through the reads_in_contigs.txt file and replace contig names with the singletons that make them up. For example, I would like to replace the text "Contig822" in the reads_in_contigs.txt file with the following:
Code:
GMNVR6W01BC1GA
GMNVR6W01BXQE5
GKUWSH001B1416
GKUWSH001A0VIX
GKUWSH001B1416
GKUWSH001DB04I
GLOIIK001B3O6Z
GKUWSH001DB04I
GKUWSH001E3NLP
GLOIIK001B3O6Z
GKUWSH001EV96A
GLOIIK001B3O6Z
GLOIIK001CAD8R
GLOIIK001B3O6Z
GLOIIK001EB88S
GLOIIK001B3O6Z
GKUWSH001DM3ZY
GLOIIK001EB88S
GLOIIK001DLAAT
GLOIIK001B3O6Z
GLOIIK001DDSHO
GLOIIK001B3O6Z
GKUWSH001BTEFS
GLOIIK001B3O6Z
GLOIIK001D9JSF
GLOIIK001CA9NU
GKUWSH001CH1QS
GKUWSH001A1T82
GKUWSH001CH1QS
GKUWSH001D4PWE
GKUWSH001EPER0
GKUWSH001D4PWE
Any suggestions would be greatly appreciated!

Thanks,
Kevin
 
Old 02-03-2011, 01:55 PM   #2
goossen
Member
 
Registered: May 2006
Location: Bayern, Germany
Distribution: Many
Posts: 224

Rep: Reputation: 41
And when you use contigs.txt ?
 
Old 02-03-2011, 04:10 PM   #3
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,606

Rep: Reputation: 448Reputation: 448Reputation: 448Reputation: 448Reputation: 448
Hi,

do you mean something like this
Code:
#!/bin/bash
exec 3>&1 1>result
while read line; do
  if [[ "$line" =~ Contig ]]; then
    sed -rn "/$line/{n;n;h;:a n;/\*+/bb;$ {H;bb};H;ba;:b x;p;Q99}" /path/to/reads_in_contigs.txt
    [[ $? == 99 ]] && continue
  fi
  echo "$line"
done < /path/to/contigs.txt
exec 1>&3 3>&-
It is not clear from your initial post how you want to handle non-existent replacements. E.g., Contig826 has no replacement according to your sample data. Do you want to keep it or replaced it with nothing? The above script keeps it. The resulting file is named 'result'. Rename it if necessary.

Last edited by crts; 02-03-2011 at 04:33 PM. Reason: refinement
 
1 members found this post helpful.
Old 02-03-2011, 09:43 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,255

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
I am with goosen in that the following does not mention the contigs.txt file anywhere:
Quote:
Originally Posted by kmkocot
What I want to do is search through the reads_in_contigs.txt file and replace contig names with the singletons that make them up. For example, I would like to replace the text "Contig822" in the reads_in_contigs.txt file with the following:
Although I see crts' crystal ball is working better than mine
 
Old 02-04-2011, 03:40 PM   #5
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Smile

crts, Nailed it! This really helped me out. Thank you. As you guessed, if a Contig# doesn't appear in reads_in_contigs.txt (e.g., Contig826) I want it to remain unchanged in contigs.txt. All of the Contig# names in contigs.txt should have been replaced with the singletons that make them up so if one remained it would serve as a red flag to me that I did something wrong.

grail and goosen, I'm sorry I wasn't very clear on what I was trying to do. If you care, it should have said: "What I want to do is search through the reads_in_contigs.txt file for matches to contig and singleton names in the contigs.txt file and replace contig names (e.g., Contig822) with the singletons that make them up." Does that make more sense?

Thanks a lot!
Kevin
 
Old 02-05-2011, 02:50 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,255

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
Cool ... in that case how about something like:
Code:
awk 'FNR == NR && NF{getline arr[$0];next}{for(x = 1;x <= NF;x++)if($x in arr)print arr[$x];else print $x}' RS="[\n]?[*]+\n" reads_in_contigs.txt contigs.txt
 
Old 10-28-2011, 03:09 PM   #7
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Question

Hi all,

I'm trying to do something similar to the above but with a twist. Say we have a similar contigs.txt file that looks something like this:
Code:
Contig822
Contig826
Contig835
We also have a file containing DNA or amino acid sequences (sequences.txt) where each sequence spans two lines. The first line that begins with ">" is the header (some of these will correspond to entries in contigs.txt) and the second is the sequence itself.
Code:
>Contig822
GGAACAAAAACGGTTGGATGGCCTGAAAAATGGACAAATGTTTATTTATATAATTATATAAACATACAACAAGAGGTATCATTATATCCGTGTATAGTGTATAGTATACTGTATACGATAGTCTTGTCGTTCATTATGTTAAAATGAGAT
>Contig826
TACTTCTCTATTCGCACAGGCCTTCAAATAGTTCTACTGTTCCGCCATATGTATCTATTGTAATTGTATTGCATGTTTATTTTATGTTGCACATTAATCAATCAACCTCGTACAGTGTGTCGGGCAGTTGGAACACTATAGGGCCTATAGTATAAGCGCACTTTAAATCACCTCATTCATTCATACAAAACATCGACAGTCCCATTCTATTTTCATATGCTACCAGATAGTCAATGTTTCAGATGAGCAAC
>Contig830
TTTTTATATGTGTTTTTTTATTTATATATACAAAGCTTCATGTTTAGAATCGCAACTTCCTGAAGAACCACTTGTGTTTGCCAGTCTTGTATTTCGCCTCCATCTTGGACTTGATCTCTCGCGACGCTTTCCTCTTCCTGCCTGGATCCTTGAACGCCTCCTTGTTCACCACCTCTTTGTTGTAGTCCAAGTCTACGCAGTATCTGGTAGGCATGAGATGGTTGTAATTGTACAGCTTGATGAACGGGCGGATCTTGGAGCGCT
>Contig835
TGTCGAACACTCAGAAAATGTCGAGTAAATCTAATGAGAAAGGAGGAGCTGAGGAGATTCCCATCAAGAAATGTAAGACTGACCCATCCACACCGACAAAAGCTGATAGCAGTGGTGGTGG
I want to extract the headers and sequences in the sequences.txt file and send them to a new file if the header (not including the ">") appears in the contigs.txt file. I thought an awk command might work well but I can't figure out how to use getline to get multiple lines. Also, it won't print the ">" that I want to add in. Any help would be greatly appreciated! Here's what I have so far:
Code:
awk 'FNR == NR && NF{getline arr[$0];next}{for(x = 1;x <= NF;x++)if($x in arr)print ">"arr[$x];else print $x}' RS=">" sequences.txt contigs.txt
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trying to make a script that will search and replace text in a file Jongi Programming 28 07-08-2007 01:37 PM
Need a script to search and replace text in file using shell script unixlearner Programming 14 06-21-2007 11:37 PM
Need command to search and replace text in file acascianelli AIX 12 04-11-2007 09:16 PM
Script to search and replace in text file - kinda... jeffreybluml Programming 45 11-07-2004 06:37 PM
Search and replace text in file using shell script? matthurne Linux - Software 2 11-02-2004 11:11 AM


All times are GMT -5. The time now is 02:27 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration