LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Need help with script to replace certain text in file with part of the file's name (http://www.linuxquestions.org/questions/linux-newbie-8/need-help-with-script-to-replace-certain-text-in-file-with-part-of-the-files-name-749409/)

kmkocot 08-21-2009 06:38 PM

Need help with script to replace certain text in file with part of the file's name
 
Hi all,

I have a directory with about 16,000 files with this format:

>LGIG|175428
MSIIIAQTPITYFGSDIQKSLGSLHGFRWAKYPGEKPLPGHNYTGPGISEDKLTALESKL
SDDSEIQKQIVAIQQQLINVVDKTQLQNLSSLISNLDDKITKQKKDLKQLIDNINPGISE
DKLQRELTKFTTELQKEIKNIDDSVIQQQITTINNEVLKQEKNIAALEKNLKEENKSYFN
LPFRNLRDENASISYNIDKSRESEYEKYGITANIIEFFRIQISISKPKAYLMVIVYHIYI
SYTGKIILHKDNIKEIKRSKVGKGTELLKKINIYTGRNCYIPTDGNCFIKCVNHVLNKDL
TNEFKNFIINFPKVNRKRVMTTARINEFNKKCETSFQIHTLKNRNLRPRDVKRELDWVLY
LHNSHFCLIRRNEKNLGIKEIEDNYEQVWKTCRDDNVVTQVSPLKLNVFSNMSDDT
>HROB|174996
MIVAHAPKTYFGSGDIQKSLGSLPGFPWAKYPGEKHLPGHNYTGRGTRLDLRLDENNKPK
PGEEPVNRVDAAALKHDILYRNKDIKFRHEADKQMIIELENIPNPTFKERMERALIIKLL
KAKMKLGTDCIDQMLQRLGKVDQKRLTLISHNGSGFDNWIALQNVKKLTQCPLVVDNKIL
SFPLSNPYTEERLQKKWKRQKEIMSNSNYLQNISFTCSFIHQSTSLAAWGNSSNLPMNLK
KITDVNIAKFTKETWESLRPE

In some of the files there are more or fewer sequences but the definition line always begins with a > symbol. The files are all named like "Moll_10000.fasta", "Moll_10001.fasta" and so on...

I am trying to write a script that reads the name of each file, strips out the number portion of the name ($NUMBER), and replaces all instances of ">" with ">$NUMBER|".

Here is what I tried (but didn't work). Can anyone point me in the right direction? Thanks!!!

Code:

COUNTER=10000
FILES=*.fasta

for i in $FILES
do
sed 's/>/>|$COUNTER/g'
COUNTER=COUNTER+1
done


colucix 08-21-2009 06:52 PM

The sed command is incomplete. It should be something like:
Code:

sed -i "s/>/>${COUNTER}|/g" $i
the -i option (very dangerous without testing) edits the file in place, the file name is given as argument $i, double quotes are used to let the shell substitute the variable COUNTER with its actual value. Test it on some copies of the original files, before modifying them.

Edit: a more simple version for your script could be:
Code:

#!/bin/bash
for file in *.fasta
do
  #
  # extract the digits part from the file name
  #

  counter=$(echo $file | egrep -o [0-9]{5})
  #
  # edit the file
  #

  sed -i "s/>/>${counter}|/g" $file
done

again, test it before executing on the original files.

kmkocot 08-23-2009 04:06 PM

Thanks! Your alternative is much more versatile. I really appreciate the help!

Kevin


All times are GMT -5. The time now is 10:49 PM.