LinuxQuestions.org - [SOLVED] shell scripting

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - shell scripting (https://www.linuxquestions.org/questions/linux-newbie-8/shell-scripting-847987/)

Sir, I have a big datafile in the given format. I have to extract the lines which are starting with ‘Chr’ followed by number like ‘Chr5’, ‘Chr25’ etc. and fix the range for each line by subtracting 10 from that number left and adding 10 to that number right. (or in other way; remove the lines starting with miR, let,CH, number, Chr without number: shown in blue color) For example, Chr5-26236044 has to be displayed like Chr5:26236034-26236054. If the ‘Chr5’ with same range is came in the output file, n-1 entries are to be removed otherwise all Chr5 must be present. In the given output Chr5 with same data range came twice. One has to be removed (shown in red color). Similarly Chr2 came 3 times with different ranges. So all three should be there. I given the output required. The output format must be like this Chrno:number1-number2 (no spaces). Shell scripting for this is highly appreciated. Thanks in Advance.

Quote:

Chr

Chr
Chr5-26236044
Chr25-2622227
Chr10-23813153
ChrX-62081599
miR-1-1-3p;Chr13:55237544-55237619
Chr18-31139230
miR-2331-3p;Chr19:15308148-15308218
CH240-242E2-CH240-416P12-96217
Chr2-66268692
miR-2379-5p;Chr23:30788153-30788230
Chr13-3857984
Chr23-29971922
let-7a-2-5p;Chr15:33347557-33347652
Chr4-120427453
Chr2-119023403
miR-2347-3p;Chr19:51593973-51594031
Chr25-21194342
miR-449b-5p;Chr20:23967269-23967366
Chr25-9506360
Chr2-66270795
Chr5-26236044
miR-2484-5p;ChrX:20461131-20461206
93748382

Output required:

Quote:

Chr5:26236034-26236055
Chr25:2622217-2622237
Chr10:23813143-26813163
ChrX:62081589-62081609
Chr18:31139220-31139240
Chr2:66268682-66268702
Chr13:3857974-3857994
Chr23:29971912-29971932
Chr4:120427443-120427463
Chr2:119023393-119023413
Chr25:32391131-32391151
Chr25:9506350-9506370
Chr2:66270785-66270805
Chr5:26236034-26236055

Can it be done with awk?

I don't understand the significance of the red coloring of the final line of your output, but here's an awk snippet which seems to produce the output you want (including correctly adding 10 to the right, not 11):

Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename

EDIT: Sorry, I see the part about the dupes now. The above code won't get rid of dupes (yet!).

Here's an update:

Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\

  cat -n | sort -k2 | uniq -u | sort -k1 | awk '{print $2}'

However, I'm still unclear about this line in the output, because of the "X" in it...:

Code:

ChrX:62081589-62081609

Code:

awk '/^Chr.+/ 

{

  split($0, _, "-")

  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10)

  if ( ! arr[str]++ ) print str

}' file

Heh, I'm not doing well today; here's a *working* version:

Code:

awk -F"-" '/^Chr[0-9]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

I kept missing that duplicate in red - I think the red threw me off ;)

Output:

Code:

Chr5:26236034-26236054

Chr25:2622217-2622237

Chr10:23813143-23813163

Chr18:31139220-31139240

Chr2:66268682-66268702

Chr13:3857974-3857994

Chr23:29971912-29971932

Chr4:120427443-120427463

Chr2:119023393-119023413

Chr25:21194332-21194352

Chr25:9506350-9506370

Chr2:66270785-66270805

Is it just me, or does the OP only seem to want folks to write scripts for him?

Sir, in output, ChrX:62081589-62081609 is also required. Chr followed by number or Character is also required sir.

Quote:

Originally Posted by GrapefruiTgirl (Post 4178632)

Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename

EDIT: Sorry, I see the part about the dupes now. The above code won't get rid of dupes (yet!).

Here's an update:

Code:

awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\

  cat -n | sort -k2 | uniq -u | sort -k1 | awk '{print $2}'

However, I'm still unclear about this line in the output, because of the "X" in it...:

Code:

ChrX:62081589-62081609

Code:

awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Perhaps this adjustment to the regex will help with that. Or maybe this:

Code:

awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Or this:

Code:

awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

And please, that's Miss, not Sir. ;)

Nice code, but you don't really need the sort and uniq stuff if you can do all in awk, as shown in post #3. Here you can simply print a line if it has not been printed before, otherwise ignore it. :)

Ehh, very true ;) but when someone else does your homework, you cannot always count on getting the most efficient results. :D

BTW, your code doesn't work for me, but I have not investigated:

Code:

sasha@reactor: awk '/^Chr.+/                                                                                                                      

{

  split($0, _, "-")

  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10)

  if ( ! arr[str]++ ) print str

}' filename

:-10-10

Chr:-10-10

Chr5-26236044

Chr5:26236034-26236054

Chr25-2622227

Chr25:2622217-2622237

Chr10-23813153

Chr10:23813143-23813163

ChrX-62081599

ChrX:62081589-62081609

miR:-9-11

Chr18-31139230

Chr18:31139220-31139240

miR:2321-2341

CH240:24190-24210

Chr2-66268692

Chr2:66268682-66268702

miR:2369-2389

Chr13-3857984

Chr13:3857974-3857994

Chr23-29971922

Chr23:29971912-29971932

let:-3-17

Chr4-120427453

Chr4:120427443-120427463

Chr2-119023403

Chr2:119023393-119023413

miR:2337-2357

Chr25-21194342

Chr25:21194332-21194352

miR:439-459

Chr25-9506360

Chr25:9506350-9506370

Chr2-66270795

Chr2:66270785-66270805

Chr5-26236044

miR:2474-2494

93748382:-10-10

sasha@reactor:

:scratch:

This appears to work also, and is nicer than all that sorting stuff:

Code:

awk -F"-" '/^Chr[[:alnum:]]/{if(!arr[$0]){arr[$0]++; print $1 ":" $2-10 "-" $2+10}}' filename

That's strange: I splitted my test one-liner into multiple lines (for readability) and it doesn't work indeed. But if I repeat my test as one-liner it works:

Code:

awk '/^Chr.+/{split($0,_,"-");str=sprintf("%s:%d-%d",_[1],_[2]-10,_[2]+10);if (! arr[str]++) print str}' file

I cannot see the difference, right now. Anyway, better your code in the post above!

Edit: oops... it appears I erroneously put the bracket into a newline, so that the code is split into two rules: one being just an expression (printing the whole matching record), the other being a rule with no expression applied to all records. Sorry... :redface:

Yup, it does work, and as multi-line also; your code below, but with the initial "{" moved to directly after the /.../

Code:

 awk '/^Chr.+/{

  split($0, _, "-") 

  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10);

  if ( ! arr[str]++ ) print str

}' filename

Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance.

Quote:

Originally Posted by GrapefruiTgirl (Post 4178730)

Code:

awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Perhaps this adjustment to the regex will help with that. Or maybe this:

Code:

awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

Or this:

Code:

awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\

  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'

And please, that's Miss, not Sir. ;)

Quote:

Originally Posted by kswapnadevi (Post 4178798)

Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance.

That was not a requirement of the original problem. However, `sort -V` will do it, if you pipe the output of the code you're current using, into it:

Code:

awk '/^Chr.+/{

  split($0, _, "-")

  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10);

  if ( ! arr[str]++ ) print str

}' filename | sort -V

Chr2:66268682-66268702

Chr2:66270785-66270805

Chr2:119023393-119023413

Chr4:120427443-120427463

Chr5:26236034-26236054

Chr10:23813143-23813163

Chr13:3857974-3857994

Chr18:31139220-31139240

Chr23:29971912-29971932

Chr25:2622217-2622237

Chr25:9506350-9506370

Chr25:21194332-21194352

ChrX:62081589-62081609

No guarantees this is reliable - after all, these are not really version numbers we're sorting. It would be better to pre-sort the output appropriately while still inside the awk, but for the time being, I'll leave that as an exercise for you or someone else.

Hey, what do I get if you graduate? :)

sorting with awk (sorting arrays)

In case anyone is (still?) interested, here's an awk script that sorts the output according to requirements. Note that my awk skills are not as good as those of some others around here (I did this to see if I could), and so the way I've implemented the sorting might not be even close to the "best" way - but it doesn't appear to me that awk has a whole whack of super sorting capability, so I have separated the input data into 3 arrays, according to the character(s) following the leading "Chr": there's uppercase, lowercase, or digits. This allows each array contents to be sorted against similar types of characters, because otherwise, awk doesn't produce the desired (numeric, lower, upper) order of sorted output. Also, you'll note that in the middle of the script, I temporarily padded leading zeroes onto any digit(s) following the "Chr", because awk doesn't seem to numerically sort integers in naturally occurring order unless they are the same length. I may have overlooked something there (maybe the characters following the digits are causing trouble?), but here's what I'm talking about:

Code:

18:31139220-31139240

23:29971912-29971932

25:21194332-21194352

25:2622217-2622237

25:9506350-9506370

2:119023393-119023413

Here's the code:

Code:

#!/usr/bin/awk -f



BEGIN{

        FS = "-"

}





{

        if(/^Chr[[:alnum:]]/){

                gsub("^Chr","",$1)

                if(!LWR[$0] && !UPP[$0] && !NUM[$0]){

                        if($1 ~ /^[[:upper:]]/){

                                UPP[$0] = $1 ":" $2-10 "-" $2+10

                                next

                        }

                        if($1 ~ /^[[:lower:]]/){

                                LWR[$0] = $1 ":" $2-10 "-" $2+10

                                next

                        }

                        if($1 ~ /^[[:digit:]]/){

                                $1 = sprintf("%03u",$1)

                                NUM[$0] = $1 ":" $2-10 "-" $2+10

                        }

                }

        }

}





END{

        UPPS = asort(UPP); LWRS = asort(LWR); NUMS = asort(NUM)

        for(x=1 ; x<=NUMS ; x++){

                if(NUM[x]){

                        gsub("^0|^00","",NUM[x])

                        print "Chr" NUM[x]

                }

        }

        for(x=1; x<=LWRS; x++){

                if(LWR[x]){

                        print "Chr" LWR[x]

                }

        }

        for(x=1; x<=UPPS; x++){

                if(UPP[x]){

                        print "Chr" UPP[x]

                }

        }

}

Output for me now:

Code:

./awksorting input_filename

Chr2:119023393-119023413

Chr2:66268682-66268702

Chr2:66270785-66270805

Chr4:120427443-120427463

Chr5:26236034-26236054

Chr10:23813143-23813163

Chr13:3857974-3857994

Chr18:31139220-31139240

Chr23:29971912-29971932

Chr25:21194332-21194352

Chr25:2622217-2622237

Chr25:9506350-9506370

ChrgS:62081589-62081609

ChrgZ:62081589-62081609

Chry:894371222-894371242

ChrC:62081589-62081609

ChrP:93847555-93847575

ChrX:62081589-62081609

ChrZ:72081589-72081609

sasha@reactor:

I call my script "awksorting"; whatever you call yours, execute it by name and give the filename of the input file as the argument, as shown above.

EDITS:
1) Fix so script reliably supports input lines starting with "ChrN" where: (0 <= N) && (N <= 999)