LinuxQuestions.org - [SOLVED] Need help with script to replace certain text in file with part of the file's name

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Need help with script to replace certain text in file with part of the file's name (https://www.linuxquestions.org/questions/linux-newbie-8/need-help-with-script-to-replace-certain-text-in-file-with-part-of-the-files-name-749409/)

Need help with script to replace certain text in file with part of the file's name

Hi all,

I have a directory with about 16,000 files with this format:

>LGIG|175428
MSIIIAQTPITYFGSDIQKSLGSLHGFRWAKYPGEKPLPGHNYTGPGISEDKLTALESKL
SDDSEIQKQIVAIQQQLINVVDKTQLQNLSSLISNLDDKITKQKKDLKQLIDNINPGISE
DKLQRELTKFTTELQKEIKNIDDSVIQQQITTINNEVLKQEKNIAALEKNLKEENKSYFN
LPFRNLRDENASISYNIDKSRESEYEKYGITANIIEFFRIQISISKPKAYLMVIVYHIYI
SYTGKIILHKDNIKEIKRSKVGKGTELLKKINIYTGRNCYIPTDGNCFIKCVNHVLNKDL
TNEFKNFIINFPKVNRKRVMTTARINEFNKKCETSFQIHTLKNRNLRPRDVKRELDWVLY
LHNSHFCLIRRNEKNLGIKEIEDNYEQVWKTCRDDNVVTQVSPLKLNVFSNMSDDT
>HROB|174996
MIVAHAPKTYFGSGDIQKSLGSLPGFPWAKYPGEKHLPGHNYTGRGTRLDLRLDENNKPK
PGEEPVNRVDAAALKHDILYRNKDIKFRHEADKQMIIELENIPNPTFKERMERALIIKLL
KAKMKLGTDCIDQMLQRLGKVDQKRLTLISHNGSGFDNWIALQNVKKLTQCPLVVDNKIL
SFPLSNPYTEERLQKKWKRQKEIMSNSNYLQNISFTCSFIHQSTSLAAWGNSSNLPMNLK
KITDVNIAKFTKETWESLRPE

In some of the files there are more or fewer sequences but the definition line always begins with a > symbol. The files are all named like "Moll_10000.fasta", "Moll_10001.fasta" and so on...

I am trying to write a script that reads the name of each file, strips out the number portion of the name ($NUMBER), and replaces all instances of ">" with ">$NUMBER|".

Here is what I tried (but didn't work). Can anyone point me in the right direction? Thanks!!!

Code:

COUNTER=10000

FILES=*.fasta



for i in $FILES

do

sed 's/>/>|$COUNTER/g'

COUNTER=COUNTER+1

done

The sed command is incomplete. It should be something like:

Code:

sed -i "s/>/>${COUNTER}|/g" $i

the -i option (very dangerous without testing) edits the file in place, the file name is given as argument $i, double quotes are used to let the shell substitute the variable COUNTER with its actual value. Test it on some copies of the original files, before modifying them.

Edit: a more simple version for your script could be:

Code:

#!/bin/bash

for file in *.fasta

do

  #

  # extract the digits part from the file name

  #

  counter=$(echo $file | egrep -o [0-9]{5})

  #

  # edit the file

  #

  sed -i "s/>/>${counter}|/g" $file

done

again, test it before executing on the original files.

Thanks! Your alternative is much more versatile. I really appreciate the help!

Kevin