LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-06-2014, 08:46 PM   #1
sam@
Member
 
Registered: Sep 2013
Posts: 31

Rep: Reputation: Disabled
inner join in sed /awk


Hi ,
I have a file which looks like this


Quote:
source1 LEN predictive 392879 394347 0.955489 + . Name=sa000003.1;ID=sa000003;Alias=sa121751.1;
source1 LEN descriptive_1 391082 392878 . . . Parent=sa000003.1;supp_id=.1805.1;


source1 LEN predictive 347320 348798 0.485927 - . Name=sa000006.1;ID=sa000006;Alias=sa121750.;
source1 LEN descriptive_1 348799 351449 . . . Parent=sa000006.1;supp_id=1800.1;
source1 LEN descriptive_3 347154 347319 . . . Parent=sa000006.1;supp_id=1800.1;
Basically ,I need to change the start and end locations of predictive statement in accordance to descriptive_1 and 3


if line containing PREDICTIVE statement has start locations > descriptive_1 or descriptive_3 start location then i have to replace start location of predictive by descriptive_1 or or descriptive_3
if line containing PREDICTIVE statement has end location < descriptive end .then replace end location of predictive by descriptive1 or descriptive_3
Basically all descriptive locations should fall under all predictive start and end locations.
Another thing to note is thedescriptive statements have same parent as the predictive name)
The end result should look like :



Quote:
source1 LEN predictive 391082 394347 0.955489 + . Name=sa000003.1;ID=sa000003;Alias=sa121751.1;
source1 LEN descriptive_1 391082 392878 . . . Parent=sa000003.1;supp_id=.1805.1;

source1 LEN predictive 347154 351449 0.485927 - . Name=sa000006.1;ID=sa000006;Alias=sa121750.;
source1 LEN descriptive_1 348799 351449 . . . Parent=sa000006.1;supp_id=1800.1;
source1 LEN descriptive_3 347154 348319 . . . Parent=sa000006.1;supp_id=1800.1;


thanks!
 
Old 03-07-2014, 10:31 PM   #2
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,433

Rep: Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353
I think you need a scripted solution.I have called this change.sc
Code:
#!/bin/bash

while read -a aline; do
  # initialise variables if blank line
  if  [[ "${aline[0]}" == "" ]]; then
    start1=999999
    start3=999999
    end1=0
    end3=0
  # test for descriptive_3
  elif [[ "${aline[2]}" == "descriptive_3" ]]; then
    start3=${aline[3]}
    end3=${aline[4]}
  # test for descriptive_1
  elif [[ "${aline[2]}" == "descriptive_1" ]]; then
    start1=${aline[3]}
    end1=${aline[4]}
  # test for predictive
  elif [[ "${aline[2]}" == "predictive" ]]; then
    # set replacement values
    start=$(( $start1<$start3?$start1:$start3 ))
    end=$(( $end1>$end3?$end1:$end3 ))
    # change line to be output
    if (( ${aline[3]} > $start )) ; then
      aline[3]=$start
    fi
    if (( ${aline[4]} < $end )); then
        aline[4]=$end
    fi
  fi
  echo "${aline[@]}"
done
You will need to run this using
Code:
tac <yourfile> | ./change.sc | tac
Add appropriate redirection to capture output to a file.

Note: This code assumes that there is a blank line between sections with the same Parent/ Name as well as a blank line at the end of the file for proper operation.

Last edited by allend; 03-08-2014 at 01:39 AM.
 
Old 03-09-2014, 12:45 PM   #3
sam@
Member
 
Registered: Sep 2013
Posts: 31

Original Poster
Rep: Reputation: Disabled
Question

When I used it on the given file :

Code:
source1 LEN predictive 392879 394347 0.955489 + . Name=sa3.1;ID=sa3;source_id=shLEN_10020387;identic al_supp_id=.;p_supp_id=CUFF14.1805.1,CUFF14.1805.1 ;Alias=sa121751.1;
source1 LEN descriptive_5 391082 392878 . . . Parent=sa3.1;supp_id=CUFF14.1805.1;
source1 LEN predictive 408863 416522 0.999853 - . Name=sa4.1;ID=sa4;source_id=shLEN_10020388;identic al_supp_id=CUFF14.1812.2;p_supp_id=.;
source1 LEN descriptive_5 416523 416853 . . . Parent=sa4.1;supp_id=CUFF14.1812.2;
source1 LEN descriptive_3 408716 408862 . . . Parent=sa4.1;supp_id=CUFF14.1812.3;
source1 LEN predictive 159485 170778 0.999006 - . Name=sa5.1;ID=sa5;source_id=shLEN_10020383;identic al_supp_id=.;p_supp_id=CUFF14.1731.2,CUFF14.1731.2 ;
source1 LEN descriptive_5 170779 174239 . . . Parent=sa5.1;supp_id=CUFF14.1731.2;
source1 LEN predictive 347320 348798 0.485927 - . Name=sa6.1;ID=sa6;source_id=shLEN_10020386;identic al_supp_id=CUFF14.1800.1;p_supp_id=.;Alias=sa12175 0.1,sa57152.1;
source1 LEN malign 347320 348798 0.485927 - . Name=sa6.1;ID=sa6.1;source_id=shLEN_10020386;ident ical_supp_id=CUFF14.1800.1;p_supp_id=.;Alias=sa121 750.1,sa57152.1;
source1 LEN descriptive_5 348799 351449 . . . Parent=sa6.1;supp_id=CUFF14.1800.1;
source1 LEN Cancer 347320 348798 . - 0 Parent=sa6.1;
source1 LEN descriptive_3 347154 347319 . . . Parent=sa6.1;supp_id=CUFF14.1800.1;
it gave me this :
Code:
./change.sc: line 21: 347900<?347900: : syntax error: operand expected (error token is "?347900: ")
source1 LEN descriptive_1 347900 348600 . . . Parent=sa000007.1;supp_id=1800.1
Could you please advise?
 
Old 03-09-2014, 02:14 PM   #4
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 840

Rep: Reputation: 380Reputation: 380Reputation: 380Reputation: 380
I guess it's because one of the variables in the arithmetic context (probably start3) is not set. This brings me to an interesting issue -- most of the time you hear people saying you don't have to use the '$' sign in bash arithmetic expansion and that the following are identical:

Code:
$(($a + 1))
$((a + 1))
This makes an impression the dollar sign in arithmetic context is optional and doesn't make any difference -- i.e is discarded by (( )) operator. That is, obviously, not the case. In the first case, the "$a" token is replaced with the value by bash parameter substitution before it is passed to the arithmetic operator. I would say it is an equivalent of what in some languages is referred to as "passing by value". That means you don't use '$' with an lvalue:
Code:
$(( $a = 1 ))
will not assign 1 to variable $a. (* see below).
Also, if the variable $a is not set, or is empty, code like

Code:
$(( $a < 1 ? 1 : 0 ))
means that the expression the arithmetic operator gets will be

Code:
< 1 ? 1 : 0
which is syntactically incorrect. If you omitted the '$' sign, the a token would not be substituted by parameter expansion and would be resolved by the arithmetic operator as a reference to a variable (and, if I understand it correctly, since the variable is not set, assume a default value of 0).


(*) note that using a dollar sign with an lvalue actually seems to work as yet another way of indirection in bash:

Code:
foo=bar
(( $foo = 4 ))
echo $foo
bar
echo $bar
4
Compare with
Code:
foo=bar
(( foo = 4 ))
echo $foo
4
echo $bar
 
Old 03-09-2014, 07:05 PM   #5
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,433

Rep: Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353
As pointed out by millgates, the problem arises from a variable being unset.
With the file in post#3, and I run the script with
Code:
echo "" >> input1.txt ; tac input1.txt | ./change.sc | tac
then the error goes away.
As noted in post#2, the script was written expecting a blank line at the end of the file, which causes the variables to be initialised.

Having seen a further example of the data you are dealing with, I note that the other assumption that was made, i.e. that there is a blank line between sections withe the same Parent/Name does not hold.

I have modified the script to initialise the variables on startup and to re-initialise the variables when a 'predictive' line has been processed. This should work OK provided that there is only one 'predictive' line for a section with the same Parent/Name.

Code:
#!/bin/bash

set_variables() {
    start1=999999
    start3=999999
    end1=0
    end3=0
}

set_variables()

while read -a aline; do
  # test for  descriptive_3
  if [[ "${aline[2]}" == "descriptive_3" ]]; then
    start3=${aline[3]}
    end3=${aline[4]}
  # test for  descriptive_1
  elif [[ "${aline[2]}" == "descriptive_1" ]]; then
    start1=${aline[3]}
    end1=${aline[4]}
  # test for predictive
  elif [[ "${aline[2]}" == "predictive" ]]; then
    start=$(( $start1<$start3?$start1:$start3 ))
    end=$(( $end1>$end3?$end1:$end3 ))
    if (( ${aline[3]} > $start )) ; then
      aline[3]=$start
    fi
    if (( ${aline[4]} < $end )); then
        aline[4]=$end
    fi
    set_variables()
  fi
  echo "${aline[@]}"
done
Unfortunately I just noted another problem. The second data example shows the lines being labelled as 'descriptive_3' and 'descriptive_5', which will cause the amended script to fail. Are these names important or could they be changed to 'descriptive_1' and 'descriptive_3'? Do you have other data with further changes in the names?

Last edited by allend; 03-09-2014 at 08:25 PM.
 
Old 03-09-2014, 07:42 PM   #6
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 840

Rep: Reputation: 380Reputation: 380Reputation: 380Reputation: 380
An alternative using awk:

Code:
tac inFile | awk '
    BEGIN { s = 1e9; e = 0; }
    $3 ~ /descriptive_.*/ { if ($4 < s) s = $4; if ($5 > e) e = $5; }
    $3 ~ /predictive/ { if ($4 > s) $4 = s; if ($5 < e) $5 = e; s = 1e9; e = 0; }
    1' | tac
 
2 members found this post helpful.
Old 03-09-2014, 08:32 PM   #7
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,433

Rep: Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353Reputation: 1353
awk FTW! At least my general logic was OK.
 
1 members found this post helpful.
Old 03-11-2014, 01:44 AM   #8
sam@
Member
 
Registered: Sep 2013
Posts: 31

Original Poster
Rep: Reputation: Disabled
Question

Thanks all for your valuable inputs!

I used this code given
Code:
tac inFile | awk '
    BEGIN { s = 1e9; e = 0; }
    $3 ~ /descriptive_.*/ { if ($4 < s) s = $4; if ($5 > e) e = $5; }
    $3 ~ /predictive/ { if ($4 > s) $4 = s; if ($5 < e) $5 = e; s = 1e9; e = 0; }
    1' | tac
But it changed the format of file .

Basically my infile doesn't have the words separated column wise
Code:
 
source1 LEN predictive 392879 394347 0.955489 + . Name=sa000003.1;ID=sa000003;Alias=sa121751.1;
Eg : The words source1 and say LEN are separated by tabs .They are not like $1 ,$2 and so on.


Could anyone please advise?
 
Old 03-11-2014, 03:55 AM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Another one that doesn't require tac. To preserve the original tab-separated format, basically you have to change the default OFS, usually made in the BEGIN block.
Code:
BEGIN {
  FS = OFS = "\t"
}

/predictive/ {
  if ( c > 0 ) {
    new = $0
    $0 = line[1]
    $4 = start
    $5 = end
    print
    for ( i = 2; i <= c; i++ )
      print line[i]
    $0 = new
  }
  c = 0
  start = $4
  end = $5
  line[++c] = $0
}

/descriptive_1|descriptive_3/ {
  if ( $4 < start )
    start = $4
  if ( $5 > end )
    end = $5
  line[++c] = $0
}

!/predictive|descriptive_1|descriptive_3/ {
  line[++c] = $0
}

END {
  if ( c > 0 ) {
    $0 = line[1]
    $4 = start
    $5 = end
    print
    for ( i = 2; i <= c; i++ )
      print line[i]
  }
}
 
Old 03-11-2014, 03:37 PM   #10
sam@
Member
 
Registered: Sep 2013
Posts: 31

Original Poster
Rep: Reputation: Disabled
Unhappy

Thanks for all the inputs .

However the code in reply 9 didn't make the required changes in the infile as it didn't change the start and end locations of predictive statement.
The infile can be viewed in comment 3

I used the awk command followed by code?
$ awk ' code infile

Am I missing something?
 
Old 03-11-2014, 06:13 PM   #11
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Sam, pardon me, but it's not clear what your awk skills really are. You asked for sed/awk help, and it gave me the impression you did know what you were talking about. Am I wrong? The code I posted in #9 was related to the input sample given in post #1 (my fault). What puzzles me it's that a very slight modification can adapt the code to the sample in post #3 (see parts in red) and I am surprised you didn't show any effort in order to do this by yourself, anyway here we go:
Code:
BEGIN {
  #
  #  Uncomment this if TAB separated I/O
  #
  #FS = OFS = "\t"
  nope
}

/predictive/ {
  if ( c > 0 ) {
    new = $0
    $0 = line[1]
    $4 = start
    $5 = end
    print
    for ( i = 2; i <= c; i++ )
      print line[i]
    $0 = new
  }
  c = 0
  start = $4
  end = $5
  line[++c] = $0
}

/descriptive_/ {
  if ( $4 < start )
    start = $4
  if ( $5 > end )
    end = $5
  line[++c] = $0
}

!/predictive|descriptive_/ {
  line[++c] = $0
}

END {
  if ( c > 0 ) {
    $0 = line[1]
    $4 = start
    $5 = end
    print
    for ( i = 2; i <= c; i++ )
      print line[i]
  }
}
Regarding the changes in the original file, this is the way awk works: it doesn't change the input. On the contrary sed can change it using the -i option, but in awk you have to redirect the output to a file and eventually rename the new file as the original.
 
Old 03-11-2014, 11:01 PM   #12
sam@
Member
 
Registered: Sep 2013
Posts: 31

Original Poster
Rep: Reputation: Disabled
Sorry but I had formatting problems in my infile which was produced from subsequent parsing of another program.
That's what caused the error at my end.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk join 2 files secret Linux - Newbie 3 05-15-2013 07:45 AM
[SOLVED] awk - how to join the string with calculated values kcapple Programming 15 05-15-2013 04:25 AM
[SOLVED] SED - join lines if they dont end in " KevHallas Linux - Software 2 01-19-2012 11:06 AM
sed/awk help Eppo Programming 19 04-09-2011 03:50 AM
Join all lines using sed chipix Programming 3 04-03-2007 10:55 AM


All times are GMT -5. The time now is 10:00 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration