LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Adding a line (https://www.linuxquestions.org/questions/linux-newbie-8/adding-a-line-4175495172/)

sam@ 02-16-2014 09:31 PM

Adding a line
 
Hi

I have a file whose contents are as follows:
Code:

sorce1      LEN  assumption  695    3570    0.770047        -      .      ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1      LEN  descriptive    3334    3570    .      -      0      Parent=f000001.1;

sorce1      LEN  assumption    8859    11328  0.628724        +      .      ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive    8859    9032    .      +      0      Parent=f000002.1;

sorce1      LEN  assumption    354569    361011  0.628724        +      .      ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive        354600    360111    .      +      0      Parent=f000012.1;

sorce1      LEN  assumption    350567    354686    0.628724        +      .      ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive    350567    353321    .      +      0                      Parent=f000012.2;

I wanted it to look like this
Code:


sorce1      LEN  predictive    695    3570    0.770047        -      .      ID=f000001;source_id=A.off_LEN_10008424;
sorce1      LEN  assumption  695    3570    0.770047        -      .      ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1      LEN  descriptive    3334    3570    .      -      0      Parent=f000001.1;

sorce1      LEN  predictive    8859    11328  0.628724        +      .      ID=f000002;source_id=A.off_LEN_10008425;

sorce1      LEN  assumption    8859    11328  0.628724        +      .      ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive    8859    9032    .      +      0      Parent=f000002.1;

sorce1      LEN  predictive    350567    361011    0.628724        +      .      ID=f000012;source_id=A.off_LEN_10008425;
sorce1      LEN  assumption    354569    361011  0.628724        +      .      ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive        354600    360111    .      +      0      Parent=f000012.1;

sorce1      LEN  assumption    350567    354686    0.628724        +      .      ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive    350567    353321    .      +      0                      Parent=f000012.2;

Basically I wanted to add a statement with the third column entry as predictive and the ID having only the id name without anything after the dot.
So for every statement for assumption,I need to add a statement with predictive.

So i used this code
sed 's/\(.*\)assumption\(.*\)\(ID=[^.]*\)[^;]*\(;.*\)/\1predictive\2\3\4\n&/' file


However in my file, I have some instance where there are variants for the id name :For example One variant of id is f000012.1 and the other is f000012.2
this above code worked perfectly well for instance having no variants of IDS. But in case of variants,I am getting a multiple entry of predictive statement for the same ids.


result of the code
Code:

sorce1      LEN  predictive    695    3570    0.770047        -      .      ID=f000001;source_id=A.off_LEN_10008424;
sorce1      LEN  assumption  695    3570    0.770047        -      .      ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1      LEN  descriptive    3334    3570    .      -      0      Parent=f000001.1;

sorce1      LEN  predictive    8859    11328  0.628724        +      .      ID=f000002;source_id=A.off_LEN_10008425;
sorce1      LEN  assumption    8859    11328  0.628724        +      .      ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive    8859    9032    .      +      0      Parent=f000002.1;

sorce1      LEN  predictive  354569    361011  0.628724        +      .      ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1      LEN  assumption    354569    361011  0.628724        +      .      ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1      LEN  descriptive        354600    360111    .      +      0      Parent=f000012.1;

sorce1      LEN  predictive    350567    354686    0.628724        +      .      ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1      LEN  assumption    350567    354686    0.628724        +      .      ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1      LEN  descrptive    350567    353321    .      +      0                      Parent=f000012.2;

whereas what i needed should look like this
sorce1 LEN predictive 350567 361011 0.628724 + . ID=f000012;source_id=A.off_LEN_10008425;


Is there a way I could only add a single line with predictive statement with using the earliest start point i e : and farthest away end point to represent the predictive statement?The ID name shouldnt have variants .

thanks in advance

pan64 02-17-2014 12:47 AM

probably I missed something, your sed is working exactly as you explained, I could not reproduce that problem.

colucix 02-17-2014 02:48 AM

A solution in awk, that checks if the ID has been already used:
Code:

/assumption/ {
  line = $0
  i = gensub(/^.*ID=([^.]+)\.[^;]+;.*$/,"\\1","g",line)
  if ( ! _[i] ) {
    sub(/assumption/,"predictive",line)
    line = gensub(/^(.*ID=[^.]+)\.[^;]+(;.*$)/,"\\1\\2","g",line)
    print line
  }
  _[i]++
}
1


pan64 02-17-2014 02:57 AM

now I think I understand: you need only the last line containing the same ID and should be printed at the first occurrence? Is that ok?
that can be solved only in two passes: first you need to parse input file (looking for all the possible IDs) and calculate lines and print the result.

sam@ 02-18-2014 06:14 PM

reply
 
@ colucix
I used the command :
Code:

awk '/assumption/ {
  line = $0
  i = gensub(/^.*ID=([^.]+)\.[^;]+;.*$/,"\\1","g",line)
  if ( ! _[i] ) {
    sub(/assumption/,"predictive",line)
    line = gensub(/^(.*ID=[^.]+)\.[^;]+(;.*$)/,"\\1\\2","g",line)
    print line
  }
  _[i]++
}
1
' infile > outfile

its somehow changing the format of the file


All times are GMT -5. The time now is 03:34 PM.