Adding a line

sam@ · 02-16-2014, 09:31 PM

Hi

I have a file whose contents are as follows:

Code:

sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     350567    353321    .       +       0                       Parent=f000012.2;

I wanted it to look like this

Code:


sorce1       LEN   predictive    695     3570    0.770047        -       .       ID=f000001;source_id=A.off_LEN_10008424;
sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   predictive    8859    11328   0.628724        +       .       ID=f000002;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   predictive    350567    361011    0.628724        +       .       ID=f000012;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     350567    353321    .       +       0                       Parent=f000012.2;

Basically I wanted to add a statement with the third column entry as predictive and the ID having only the id name without anything after the dot.
So for every statement for assumption,I need to add a statement with predictive.

So i used this code
sed 's/\(.*\)assumption\(.*\)\(ID=[^.]*\)[^;]*\(;.*\)/\1predictive\2\3\4\n&/' file

However in my file, I have some instance where there are variants for the id name :For example One variant of id is f000012.1 and the other is f000012.2
this above code worked perfectly well for instance having no variants of IDS. But in case of variants,I am getting a multiple entry of predictive statement for the same ids.

result of the code

Code:

sorce1       LEN   predictive    695     3570    0.770047        -       .       ID=f000001;source_id=A.off_LEN_10008424;
sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   predictive    8859    11328   0.628724        +       .       ID=f000002;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   predictive   354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN  predictive     350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descrptive     350567    353321    .       +       0                       Parent=f000012.2;

whereas what i needed should look like this
sorce1 LEN predictive 350567 361011 0.628724 + . ID=f000012;source_id=A.off_LEN_10008425;

Is there a way I could only add a single line with predictive statement with using the earliest start point i e : and farthest away end point to represent the predictive statement?The ID name shouldnt have variants .

thanks in advance

pan64 · 02-17-2014, 12:47 AM

probably I missed something, your sed is working exactly as you explained, I could not reproduce that problem.

colucix · 02-17-2014, 02:48 AM

A solution in awk, that checks if the ID has been already used:

Code:

/assumption/ {
  line = $0
  i = gensub(/^.*ID=([^.]+)\.[^;]+;.*$/,"\\1","g",line)
  if ( ! _[i] ) {
    sub(/assumption/,"predictive",line)
    line = gensub(/^(.*ID=[^.]+)\.[^;]+(;.*$)/,"\\1\\2","g",line)
    print line
  }
  _[i]++
}
1

pan64 · 02-17-2014, 02:57 AM

now I think I understand: you need only the last line containing the same ID and should be printed at the first occurrence? Is that ok?
that can be solved only in two passes: first you need to parse input file (looking for all the possible IDs) and calculate lines and print the result.

sam@ · 02-18-2014, 06:14 PM

@ colucix
I used the command :

Code:

awk '/assumption/ {
  line = $0
  i = gensub(/^.*ID=([^.]+)\.[^;]+;.*$/,"\\1","g",line)
  if ( ! _[i] ) {
    sub(/assumption/,"predictive",line)
    line = gensub(/^(.*ID=[^.]+)\.[^;]+(;.*$)/,"\\1\\2","g",line)
    print line
  }
  _[i]++
}
1
' infile > outfile

its somehow changing the format of the file