LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   shell scripting (https://www.linuxquestions.org/questions/linux-newbie-8/shell-scripting-847987/)

kswapnadevi 12-02-2010 01:46 PM

shell scripting
 
Sir, I have a big datafile in the given format. I have to extract the lines which are starting with ‘Chr’ followed by number like ‘Chr5’, ‘Chr25’ etc. and fix the range for each line by subtracting 10 from that number left and adding 10 to that number right. (or in other way; remove the lines starting with miR, let,CH, number, Chr without number: shown in blue color) For example, Chr5-26236044 has to be displayed like Chr5:26236034-26236054. If the ‘Chr5’ with same range is came in the output file, n-1 entries are to be removed otherwise all Chr5 must be present. In the given output Chr5 with same data range came twice. One has to be removed (shown in red color). Similarly Chr2 came 3 times with different ranges. So all three should be there. I given the output required. The output format must be like this Chrno:number1-number2 (no spaces). Shell scripting for this is highly appreciated. Thanks in Advance.


Quote:


Chr

Chr

Chr5-26236044
Chr25-2622227
Chr10-23813153
ChrX-62081599
miR-1-1-3p;Chr13:55237544-55237619
Chr18-31139230
miR-2331-3p;Chr19:15308148-15308218
CH240-242E2-CH240-416P12-96217

Chr2-66268692
miR-2379-5p;Chr23:30788153-30788230
Chr13-3857984
Chr23-29971922
let-7a-2-5p;Chr15:33347557-33347652
Chr4-120427453
Chr2-119023403
miR-2347-3p;Chr19:51593973-51594031
Chr25-21194342
miR-449b-5p;Chr20:23967269-23967366
Chr25-9506360
Chr2-66270795
Chr5-26236044
miR-2484-5p;ChrX:20461131-20461206
93748382

Output required:

Quote:

Chr5:26236034-26236055
Chr25:2622217-2622237
Chr10:23813143-26813163
ChrX:62081589-62081609
Chr18:31139220-31139240
Chr2:66268682-66268702
Chr13:3857974-3857994
Chr23:29971912-29971932
Chr4:120427443-120427463
Chr2:119023393-119023413
Chr25:32391131-32391151
Chr25:9506350-9506370
Chr2:66270785-66270805
Chr5:26236034-26236055

GrapefruiTgirl 12-02-2010 01:58 PM

Can it be done with awk?

I don't understand the significance of the red coloring of the final line of your output, but here's an awk snippet which seems to produce the output you want (including correctly adding 10 to the right, not 11):
Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename
EDIT: Sorry, I see the part about the dupes now. The above code won't get rid of dupes (yet!).

Here's an update:
Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\
  cat -n | sort -k2 | uniq -u | sort -k1 | awk '{print $2}'

However, I'm still unclear about this line in the output, because of the "X" in it...:
Code:

ChrX:62081589-62081609

colucix 12-02-2010 02:06 PM

Code:

awk '/^Chr.+/
{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10)
  if ( ! arr[str]++ ) print str
}' file


GrapefruiTgirl 12-02-2010 02:13 PM

Heh, I'm not doing well today; here's a *working* version:
Code:

awk -F"-" '/^Chr[0-9]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

I kept missing that duplicate in red - I think the red threw me off ;)

Output:
Code:

Chr5:26236034-26236054
Chr25:2622217-2622237
Chr10:23813143-23813163
Chr18:31139220-31139240
Chr2:66268682-66268702
Chr13:3857974-3857994
Chr23:29971912-29971932
Chr4:120427443-120427463
Chr2:119023393-119023413
Chr25:21194332-21194352
Chr25:9506350-9506370
Chr2:66270785-66270805


TB0ne 12-02-2010 02:14 PM

Is it just me, or does the OP only seem to want folks to write scripts for him?

kswapnadevi 12-02-2010 03:28 PM

Shell scripting
 
Sir, in output, ChrX:62081589-62081609 is also required. Chr followed by number or Character is also required sir.

Quote:

Originally Posted by GrapefruiTgirl (Post 4178632)
Can it be done with awk?

I don't understand the significance of the red coloring of the final line of your output, but here's an awk snippet which seems to produce the output you want (including correctly adding 10 to the right, not 11):
Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename
EDIT: Sorry, I see the part about the dupes now. The above code won't get rid of dupes (yet!).

Here's an update:
Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\
  cat -n | sort -k2 | uniq -u | sort -k1 | awk '{print $2}'

However, I'm still unclear about this line in the output, because of the "X" in it...:
Code:

ChrX:62081589-62081609


GrapefruiTgirl 12-02-2010 03:34 PM

Code:

awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Perhaps this adjustment to the regex will help with that. Or maybe this:
Code:

awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Or this:
Code:

awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

And please, that's Miss, not Sir. ;)

colucix 12-02-2010 03:52 PM

Nice code, but you don't really need the sort and uniq stuff if you can do all in awk, as shown in post #3. Here you can simply print a line if it has not been printed before, otherwise ignore it. :)

GrapefruiTgirl 12-02-2010 03:57 PM

Ehh, very true ;) but when someone else does your homework, you cannot always count on getting the most efficient results. :D

BTW, your code doesn't work for me, but I have not investigated:
Code:

sasha@reactor: awk '/^Chr.+/                                                                                                                     
{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10)
  if ( ! arr[str]++ ) print str
}' filename
:-10-10
Chr:-10-10
Chr5-26236044
Chr5:26236034-26236054
Chr25-2622227
Chr25:2622217-2622237
Chr10-23813153
Chr10:23813143-23813163
ChrX-62081599
ChrX:62081589-62081609
miR:-9-11
Chr18-31139230
Chr18:31139220-31139240
miR:2321-2341
CH240:24190-24210
Chr2-66268692
Chr2:66268682-66268702
miR:2369-2389
Chr13-3857984
Chr13:3857974-3857994
Chr23-29971922
Chr23:29971912-29971932
let:-3-17
Chr4-120427453
Chr4:120427443-120427463
Chr2-119023403
Chr2:119023393-119023413
miR:2337-2357
Chr25-21194342
Chr25:21194332-21194352
miR:439-459
Chr25-9506360
Chr25:9506350-9506370
Chr2-66270795
Chr2:66270785-66270805
Chr5-26236044
miR:2474-2494
93748382:-10-10
sasha@reactor:

:scratch:

GrapefruiTgirl 12-02-2010 04:08 PM

This appears to work also, and is nicer than all that sorting stuff:
Code:

awk -F"-" '/^Chr[[:alnum:]]/{if(!arr[$0]){arr[$0]++; print $1 ":" $2-10 "-" $2+10}}' filename

colucix 12-02-2010 04:21 PM

That's strange: I splitted my test one-liner into multiple lines (for readability) and it doesn't work indeed. But if I repeat my test as one-liner it works:
Code:

awk '/^Chr.+/{split($0,_,"-");str=sprintf("%s:%d-%d",_[1],_[2]-10,_[2]+10);if (! arr[str]++) print str}' file
I cannot see the difference, right now. Anyway, better your code in the post above!


Edit: oops... it appears I erroneously put the bracket into a newline, so that the code is split into two rules: one being just an expression (printing the whole matching record), the other being a rule with no expression applied to all records. Sorry... :redface:

GrapefruiTgirl 12-02-2010 04:27 PM

Yup, it does work, and as multi-line also; your code below, but with the initial "{" moved to directly after the /.../

Code:

awk '/^Chr.+/{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10);
  if ( ! arr[str]++ ) print str
}' filename


kswapnadevi 12-02-2010 04:43 PM

Shell scripting
 
Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance.

Quote:

Originally Posted by GrapefruiTgirl (Post 4178730)
Code:

awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Perhaps this adjustment to the regex will help with that. Or maybe this:
Code:

awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Or this:
Code:

awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

And please, that's Miss, not Sir. ;)


GrapefruiTgirl 12-02-2010 05:08 PM

Quote:

Originally Posted by kswapnadevi (Post 4178798)
Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance.

That was not a requirement of the original problem. However, `sort -V` will do it, if you pipe the output of the code you're current using, into it:
Code:

awk '/^Chr.+/{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10);
  if ( ! arr[str]++ ) print str
}' filename | sort -V
Chr2:66268682-66268702
Chr2:66270785-66270805
Chr2:119023393-119023413
Chr4:120427443-120427463
Chr5:26236034-26236054
Chr10:23813143-23813163
Chr13:3857974-3857994
Chr18:31139220-31139240
Chr23:29971912-29971932
Chr25:2622217-2622237
Chr25:9506350-9506370
Chr25:21194332-21194352
ChrX:62081589-62081609

No guarantees this is reliable - after all, these are not really version numbers we're sorting. It would be better to pre-sort the output appropriately while still inside the awk, but for the time being, I'll leave that as an exercise for you or someone else.

Hey, what do I get if you graduate? :)

GrapefruiTgirl 12-02-2010 07:03 PM

sorting with awk (sorting arrays)
 
In case anyone is (still?) interested, here's an awk script that sorts the output according to requirements. Note that my awk skills are not as good as those of some others around here (I did this to see if I could), and so the way I've implemented the sorting might not be even close to the "best" way - but it doesn't appear to me that awk has a whole whack of super sorting capability, so I have separated the input data into 3 arrays, according to the character(s) following the leading "Chr": there's uppercase, lowercase, or digits. This allows each array contents to be sorted against similar types of characters, because otherwise, awk doesn't produce the desired (numeric, lower, upper) order of sorted output. Also, you'll note that in the middle of the script, I temporarily padded leading zeroes onto any digit(s) following the "Chr", because awk doesn't seem to numerically sort integers in naturally occurring order unless they are the same length. I may have overlooked something there (maybe the characters following the digits are causing trouble?), but here's what I'm talking about:
Code:

18:31139220-31139240
23:29971912-29971932
25:21194332-21194352
25:2622217-2622237
25:9506350-9506370
2:119023393-119023413

Here's the code:
Code:

#!/usr/bin/awk -f

BEGIN{
        FS = "-"
}


{
        if(/^Chr[[:alnum:]]/){
                gsub("^Chr","",$1)
                if(!LWR[$0] && !UPP[$0] && !NUM[$0]){
                        if($1 ~ /^[[:upper:]]/){
                                UPP[$0] = $1 ":" $2-10 "-" $2+10
                                next
                        }
                        if($1 ~ /^[[:lower:]]/){
                                LWR[$0] = $1 ":" $2-10 "-" $2+10
                                next
                        }
                        if($1 ~ /^[[:digit:]]/){
                                $1 = sprintf("%03u",$1)
                                NUM[$0] = $1 ":" $2-10 "-" $2+10
                        }
                }
        }
}


END{
        UPPS = asort(UPP); LWRS = asort(LWR); NUMS = asort(NUM)
        for(x=1 ; x<=NUMS ; x++){
                if(NUM[x]){
                        gsub("^0|^00","",NUM[x])
                        print "Chr" NUM[x]
                }
        }
        for(x=1; x<=LWRS; x++){
                if(LWR[x]){
                        print "Chr" LWR[x]
                }
        }
        for(x=1; x<=UPPS; x++){
                if(UPP[x]){
                        print "Chr" UPP[x]
                }
        }
}

Output for me now:
Code:

./awksorting input_filename
Chr2:119023393-119023413
Chr2:66268682-66268702
Chr2:66270785-66270805
Chr4:120427443-120427463
Chr5:26236034-26236054
Chr10:23813143-23813163
Chr13:3857974-3857994
Chr18:31139220-31139240
Chr23:29971912-29971932
Chr25:21194332-21194352
Chr25:2622217-2622237
Chr25:9506350-9506370
ChrgS:62081589-62081609
ChrgZ:62081589-62081609
Chry:894371222-894371242
ChrC:62081589-62081609
ChrP:93847555-93847575
ChrX:62081589-62081609
ChrZ:72081589-72081609
sasha@reactor:

I call my script "awksorting"; whatever you call yours, execute it by name and give the filename of the input file as the argument, as shown above.

EDITS:
1) Fix so script reliably supports input lines starting with "ChrN" where: (0 <= N) && (N <= 999)


All times are GMT -5. The time now is 09:27 PM.