[SOLVED] replacing characters only within a string of length 30 in multiple files

kdo · 02-28-2014, 09:15 AM

Hello,
I need help replacing 0 and 1 with a and c in strings of length 30 chracters in multiples files. The content of one file is:
[Data]
[[Samples]]

#Number of independent chromosomes: 1
#Total number of polymorphic sites: 30
#Reporting status of a maximum of 30 sites
# 30 polymorphic positions on chromosome 1
#1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

SampleName="Sample 1"
SampleSize=27
SampleData= {
1_1 1 000001000000000010000000100010
000110001100000000000000000001
1_2 1 010000010000010000001000100000
010000000000010000001100100000
1_3 1 010000000000010000001000100000
010000010000010000001000100000
1_4 1 000110001101000000000000000000
010000000000010000001100100000

I want to replace the 0 and 1 in only the 30 character strings example:000001000000000010000000100010 with a and c respectively.

Here is what I have tried:
awk 'length($1) == 30 { print $1 }' trial_1_1.arp | sed -i 's/0/a/g' trial_1_1.arp

awk 'length($1) == 30 { print $1 }' trial_1_1.arp | sed -i 's/1/c/g' trial_1_1.arp
but this changes all 0 and 1 in the file called trial_1_1.arp without restricting it to only the 30 character string. I hope I can get help on this.

schneidz · 02-28-2014, 09:26 AM

not a full solution but heres a hint:

Code:

[schneidz@hyper sd-bak-04.04.2013]$ cat kdo.txt | while read line
> do
>  echo number of chars = `echo $line | wc -c`
>  echo $line | tr '01' 'ac'
> done
number of chars = 7
[Data]
number of chars = 12
[[Samples]]
number of chars = 1

number of chars = 38
#Number of independent chromosomes: c
number of chars = 39
#Total number of polymorphic sites: 3a
number of chars = 43
#Reporting status of a maximum of 3a sites
number of chars = 43
# 3a polymorphic positions on chromosome c
number of chars = 111
#c, 2, 3, 4, 5, 6, 7, 8, 9, ca, cc, c2, c3, c4, c5, c6, c7, c8, c9, 2a, 2c, 22, 23, 24, 25, 26, 27, 28, 29, 3a
number of chars = 1

number of chars = 22
SampleName="Sample c"
number of chars = 14
SampleSize=27
number of chars = 14
SampleData= {
number of chars = 37
c_c c aaaaacaaaaaaaaaacaaaaaaacaaaca
number of chars = 31
aaaccaaaccaaaaaaaaaaaaaaaaaaac
number of chars = 37
c_2 c acaaaaacaaaaacaaaaaacaaacaaaaa
number of chars = 31
acaaaaaaaaaaacaaaaaaccaacaaaaa
number of chars = 37
c_3 c acaaaaaaaaaaacaaaaaacaaacaaaaa
number of chars = 31
acaaaaacaaaaacaaaaaacaaacaaaaa
number of chars = 37
c_4 c aaaccaaaccacaaaaaaaaaaaaaaaaaa
number of chars = 31
acaaaaaaaaaaacaaaaaaccaacaaaaa

awk and sed can probably do it to but this came to mind fisrst.

somewhere in the for will need to be an if something like:

Code:

if [ `echo $line | wc -c` -lt 30 ]
then
 do something
else
 do something else
fi

allend · 02-28-2014, 10:42 AM

Building on schneidz suggestion, you could read the file (in this code I used test.txt) line by line in a bash while loop and test the line length using a bash parameter expansion.

Code:

#!/bin/bash

while read line; do 
  if [[ ${#line} == 30 ]]; then
    echo "$line" | tr  '01' 'ac';
  else
    echo "$line";
  fi;
done < test.txt

The output is

Code:

[Data]
[[Samples]]

#Number of independent chromosomes: 1
#Total number of polymorphic sites: 30
#Reporting status of a maximum of 30 sites
# 30 polymorphic positions on chromosome 1
#1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

SampleName="Sample 1"
SampleSize=27
SampleData= {
1_1 1 000001000000000010000000100010
aaaccaaaccaaaaaaaaaaaaaaaaaaac
1_2 1 010000010000010000001000100000
acaaaaaaaaaaacaaaaaaccaacaaaaa
1_3 1 010000000000010000001000100000
acaaaaacaaaaacaaaaaacaaacaaaaa
1_4 1 000110001101000000000000000000
acaaaaaaaaaaacaaaaaaccaacaaaaa

jpollard · 02-28-2014, 10:55 AM

Personally, I would use perl -

Code:

#!/usr/bin/perl

while(<>) {
    if (/^\d*$/) {
        s/0/a/g; s/1/c/g;
    }
   print;
}

This looks for only lines composed of digits (assuming it could be 0-9 as well as just 0/1). If it IS only 0/1 then replace the \d with 01 (it would then look like /^[01]$/).

This has the benefit of not looking at the comments which may also have 30 characters on a line.

I think it could even be reduced to a single line:

Code:

perl -ne 'if (/^[\d]+$/) { s/0/a/g; s/1/c/g;} print;' <inputdatafile >outputdatafile

grail · 02-28-2014, 11:19 AM

@jpollard - the downside with the perl script is you will need to also include white space as currently your script returns the original file in tact.

kdo · 02-28-2014, 12:04 PM

Dear all,

Thanks for the effort to help me solve my problem. Unfortunately all the answers provided
return my file intact without changing the 0 and 1 within the strings of 30 characters.Below is the output
I am looking for (in several files):

[Data]
[[Samples]]

#Number of independent chromosomes: 1
#Total number of polymorphic sites: 30
#Reporting status of a maximum of 30 sites
# 30 polymorphic positions on chromosome 1
#1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

SampleName="Sample 1"
SampleSize=27
SampleData= {
1_1 1 aaaaacaaaaaaaaaacaaaaaaacaaaca
aaaccaaaaccaaaaaaaaaaaaaaaaaaac
1_2 1 acaaaaacaaaaacaaaaaacaaacaaaaa
acaaaaaaaaaaacaaaaaaccaacaaaaa
1_3 1 aaaaaaaaaaaaacaaaaaacaaacaaaaa
acaaaaacaaaaaaaaaaaacaaacaaaaa
1_4 1 aaaccaaaccacaaaaaaaaaaaaaaaaaaaa
acaaaaaaaaaaacaaaaaaccaacaaaaa

Thank you

jpollard · 02-28-2014, 03:42 PM

Well, the following seems to work form

Code:

#!/usr/bin/perl

while(<>) {
    if (! /^#/) {
        @v = split;
        if (30 == length($v[$#v])) {
            if ($v[$#v] =~ /^[01]*$/) {
                $v[$#v] =~ s/0/a/g;
                $v[$#v] =~ s/1/c/g;
            }
        }
        print join(' ',@v),"\n";
    } else {
        print;
    }
}

It is longer, but having to handle parts of a record is a bit trickier.

Now I am still assuming the 30 character number is at the end of a line...

so that last line of your sample output has more than 30 digits...

metaschima · 02-28-2014, 04:14 PM

If the perl script doesn't do it, would you take a solution in C ?

kdo · 02-28-2014, 04:40 PM

Hello Jpollard,
Your script worked for me. Thanks so much. The only
thing remaining is that I want the changes to save
to the file. I will play with your scripts to see
how I can do this. Thanks once again.

kdo · 02-28-2014, 04:46 PM

Hello Metaschima,
The perl script worked. Thanks for offering to help.

jpollard · 02-28-2014, 08:05 PM

Quote:

Originally Posted by kdo

Hello Jpollard,
Your script worked for me. Thanks so much. The only
thing remaining is that I want the changes to save
to the file. I will play with your scripts to see
how I can do this. Thanks once again.

The simplest is to redirect input from the file, and output to a new file.

grail · 02-28-2014, 11:12 PM

You will also find that if you use [code][/code] tags around code or data it will preserve the formatting and help people understand the format of the data better

Just as a quick alternative:

Code:

sed -r '/^[[:space:]]*[01]{30}$/{s/0/a/g;s/1/c/g}' file

Once you are happy with the output, simply add the -i option.

jpollard · 03-01-2014, 05:37 AM

drat. Didn't think of the {30} construct...

But that still would modify comments.

grail · 03-01-2014, 09:02 AM

Quote:

But that still would modify comments.

I fail to see how as the sed encompasses the entire line (^$)? Unless whitespace prior to the digits signifies a comment??

jpollard · 03-01-2014, 02:05 PM

Quote:

Originally Posted by grail

I fail to see how as the sed encompasses the entire line (^$)? Unless whitespace prior to the digits signifies a comment??

You are right. I'm an idiot.