![]() |
Need a script to find/replace numbers with names in 1 file using another as the guide
Hi all,
I am trying to find / write a shell script that will go through a file organized like this (but with thousands of lines)... 93,5.00,"contig00002",169,83,"jgi|Brafl1|100379|fgenesh2_pg.scaffold_359000019" 579,1.00,"contig00003",3,380,"jgi|Brafl1|114745|estExt_fgenesh2_pm.C_1200006" 450,5.00,"contig00007",2,352,"jgi|Brafl1|274326|estExt_GenewiseH_1.C_8420008" ...and check the region of each line between the second and third pipes (the 6-digit numbers) against the values in the first column of a separate text file in CSV format like this... 274326,"Wnt family of developmental regulators" 114745,"FOG: Hormone receptors" 100379,"Transcription factor tinman/NKX2-3, contains HOX domain" ...and when they match, replace the value to the right of the third pipe (e.g., fgenesh2_pg.sca...) with the value in the second column in the CSV file associated with that number. I'm new at scripting but I'm sitting here with Burtch's Linux Shell Scripting with Bash trying to figure out where to start. If anyone can point me to a publicly available script that would be a good starting point or has some suggestions, I would really appreciate it. |
use an awk based script or a sed based script.
|
See here for a small example.
|
Quote:
Here are the commands used below (it would be a good idea to run a man on them): cat, cut, grep, sed, eval sh or bash you read the input file line by line say it's named input.txt you can do that with a loop like that for line in $(cat input.txt) do ..... you process line by line .... done ok, now inside the loop, you need to retrieve the code of the region. You can use the command 'cut' for that, get the 3rd field of the '|' delimited line: region=$(echo $line | cut -d'|' -f3) then, you can use the grep command to look for that region number in your CSV file, and if grep returns a line you retrieve the text by using 'cut' again to get the second field (but this time using ',' as delimiter) then you substitute the text for the region number and you write this to another file (that will eventually replace your input.txt file) text="$(grep $region csv.txt| cut -d',' -f2)" cmd="echo $line |sed 's/$region/$text/'" eval "$cmd" >> output.txt |
This is definitely an awk job here; no other tools should be necessary. Since awk is field-based, it's almost trivial to design a script to compare one field to a value, and changing another field based on the results.
Even I could probably design a basic script, just a simple if-loop, replace, and print, but I'm not sure how you'd go about searching through the values in a separate file for matching. Check out the awk tutorial at the unix grymoire for help here. It takes some time to work through, but it will be worth it for jobs like this. Edit: check out ghostdog's link above. That's exactly what I'm talking about. |
| All times are GMT -5. The time now is 06:59 AM. |