LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   replacing characters only within a string of length 30 in multiple files (https://www.linuxquestions.org/questions/linux-newbie-8/replacing-characters-only-within-a-string-of-length-30-in-multiple-files-4175496541/)

kdo 02-28-2014 09:15 AM

replacing characters only within a string of length 30 in multiple files
 
Hello,
I need help replacing 0 and 1 with a and c in strings of length 30 chracters in multiples files. The content of one file is:
[Data]
[[Samples]]

#Number of independent chromosomes: 1
#Total number of polymorphic sites: 30
#Reporting status of a maximum of 30 sites
# 30 polymorphic positions on chromosome 1
#1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

SampleName="Sample 1"
SampleSize=27
SampleData= {
1_1 1 000001000000000010000000100010
000110001100000000000000000001
1_2 1 010000010000010000001000100000
010000000000010000001100100000
1_3 1 010000000000010000001000100000
010000010000010000001000100000
1_4 1 000110001101000000000000000000
010000000000010000001100100000

I want to replace the 0 and 1 in only the 30 character strings example:000001000000000010000000100010 with a and c respectively.

Here is what I have tried:
awk 'length($1) == 30 { print $1 }' trial_1_1.arp | sed -i 's/0/a/g' trial_1_1.arp

awk 'length($1) == 30 { print $1 }' trial_1_1.arp | sed -i 's/1/c/g' trial_1_1.arp
but this changes all 0 and 1 in the file called trial_1_1.arp without restricting it to only the 30 character string. I hope I can get help on this.

schneidz 02-28-2014 09:26 AM

not a full solution but heres a hint:
Code:

[schneidz@hyper sd-bak-04.04.2013]$ cat kdo.txt | while read line
> do
>  echo number of chars = `echo $line | wc -c`
>  echo $line | tr '01' 'ac'
> done
number of chars = 7
[Data]
number of chars = 12
[[Samples]]
number of chars = 1

number of chars = 38
#Number of independent chromosomes: c
number of chars = 39
#Total number of polymorphic sites: 3a
number of chars = 43
#Reporting status of a maximum of 3a sites
number of chars = 43
# 3a polymorphic positions on chromosome c
number of chars = 111
#c, 2, 3, 4, 5, 6, 7, 8, 9, ca, cc, c2, c3, c4, c5, c6, c7, c8, c9, 2a, 2c, 22, 23, 24, 25, 26, 27, 28, 29, 3a
number of chars = 1

number of chars = 22
SampleName="Sample c"
number of chars = 14
SampleSize=27
number of chars = 14
SampleData= {
number of chars = 37
c_c c aaaaacaaaaaaaaaacaaaaaaacaaaca
number of chars = 31
aaaccaaaccaaaaaaaaaaaaaaaaaaac
number of chars = 37
c_2 c acaaaaacaaaaacaaaaaacaaacaaaaa
number of chars = 31
acaaaaaaaaaaacaaaaaaccaacaaaaa
number of chars = 37
c_3 c acaaaaaaaaaaacaaaaaacaaacaaaaa
number of chars = 31
acaaaaacaaaaacaaaaaacaaacaaaaa
number of chars = 37
c_4 c aaaccaaaccacaaaaaaaaaaaaaaaaaa
number of chars = 31
acaaaaaaaaaaacaaaaaaccaacaaaaa

awk and sed can probably do it to but this came to mind fisrst.

somewhere in the for will need to be an if something like:
Code:

if [ `echo $line | wc -c` -lt 30 ]
then
 do something
else
 do something else
fi


allend 02-28-2014 10:42 AM

Building on schneidz suggestion, you could read the file (in this code I used test.txt) line by line in a bash while loop and test the line length using a bash parameter expansion.
Code:

#!/bin/bash

while read line; do
  if [[ ${#line} == 30 ]]; then
    echo "$line" | tr  '01' 'ac';
  else
    echo "$line";
  fi;
done < test.txt

The output is
Code:

[Data]
[[Samples]]

#Number of independent chromosomes: 1
#Total number of polymorphic sites: 30
#Reporting status of a maximum of 30 sites
# 30 polymorphic positions on chromosome 1
#1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

SampleName="Sample 1"
SampleSize=27
SampleData= {
1_1 1 000001000000000010000000100010
aaaccaaaccaaaaaaaaaaaaaaaaaaac
1_2 1 010000010000010000001000100000
acaaaaaaaaaaacaaaaaaccaacaaaaa
1_3 1 010000000000010000001000100000
acaaaaacaaaaacaaaaaacaaacaaaaa
1_4 1 000110001101000000000000000000
acaaaaaaaaaaacaaaaaaccaacaaaaa


jpollard 02-28-2014 10:55 AM

Personally, I would use perl -
Code:

#!/usr/bin/perl

while(<>) {
    if (/^\d*$/) {
        s/0/a/g; s/1/c/g;
    }
  print;
}

This looks for only lines composed of digits (assuming it could be 0-9 as well as just 0/1). If it IS only 0/1 then replace the \d with 01 (it would then look like /^[01]$/).

This has the benefit of not looking at the comments which may also have 30 characters on a line.

I think it could even be reduced to a single line:
Code:

perl -ne 'if (/^[\d]+$/) { s/0/a/g; s/1/c/g;} print;' <inputdatafile >outputdatafile

grail 02-28-2014 11:19 AM

@jpollard - the downside with the perl script is you will need to also include white space as currently your script returns the original file in tact.

kdo 02-28-2014 12:04 PM

Dear all,

Thanks for the effort to help me solve my problem. Unfortunately all the answers provided
return my file intact without changing the 0 and 1 within the strings of 30 characters.Below is the output
I am looking for (in several files):


[Data]
[[Samples]]

#Number of independent chromosomes: 1
#Total number of polymorphic sites: 30
#Reporting status of a maximum of 30 sites
# 30 polymorphic positions on chromosome 1
#1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

SampleName="Sample 1"
SampleSize=27
SampleData= {
1_1 1 aaaaacaaaaaaaaaacaaaaaaacaaaca
aaaccaaaaccaaaaaaaaaaaaaaaaaaac
1_2 1 acaaaaacaaaaacaaaaaacaaacaaaaa
acaaaaaaaaaaacaaaaaaccaacaaaaa
1_3 1 aaaaaaaaaaaaacaaaaaacaaacaaaaa
acaaaaacaaaaaaaaaaaacaaacaaaaa
1_4 1 aaaccaaaccacaaaaaaaaaaaaaaaaaaaa
acaaaaaaaaaaacaaaaaaccaacaaaaa

Thank you

jpollard 02-28-2014 03:42 PM

Well, the following seems to work form
Code:

#!/usr/bin/perl

while(<>) {
    if (! /^#/) {
        @v = split;
        if (30 == length($v[$#v])) {
            if ($v[$#v] =~ /^[01]*$/) {
                $v[$#v] =~ s/0/a/g;
                $v[$#v] =~ s/1/c/g;
            }
        }
        print join(' ',@v),"\n";
    } else {
        print;
    }
}

It is longer, but having to handle parts of a record is a bit trickier.

Now I am still assuming the 30 character number is at the end of a line...

so that last line of your sample output has more than 30 digits...

metaschima 02-28-2014 04:14 PM

If the perl script doesn't do it, would you take a solution in C ?

kdo 02-28-2014 04:40 PM

Hello Jpollard,
Your script worked for me. Thanks so much. The only
thing remaining is that I want the changes to save
to the file. I will play with your scripts to see
how I can do this. Thanks once again.

kdo 02-28-2014 04:46 PM

Hello Metaschima,
The perl script worked. Thanks for offering to help.

jpollard 02-28-2014 08:05 PM

Quote:

Originally Posted by kdo (Post 5126735)
Hello Jpollard,
Your script worked for me. Thanks so much. The only
thing remaining is that I want the changes to save
to the file. I will play with your scripts to see
how I can do this. Thanks once again.

The simplest is to redirect input from the file, and output to a new file.

grail 02-28-2014 11:12 PM

You will also find that if you use [code][/code] tags around code or data it will preserve the formatting and help people understand the format of the data better :)

Just as a quick alternative:
Code:

sed -r '/^[[:space:]]*[01]{30}$/{s/0/a/g;s/1/c/g}' file
Once you are happy with the output, simply add the -i option.

jpollard 03-01-2014 05:37 AM

drat. Didn't think of the {30} construct...

But that still would modify comments.

grail 03-01-2014 09:02 AM

Quote:

But that still would modify comments.
I fail to see how as the sed encompasses the entire line (^$)? Unless whitespace prior to the digits signifies a comment??

jpollard 03-01-2014 02:05 PM

Quote:

Originally Posted by grail (Post 5126990)
I fail to see how as the sed encompasses the entire line (^$)? Unless whitespace prior to the digits signifies a comment??

You are right. I'm an idiot.


All times are GMT -5. The time now is 04:43 AM.