LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 12-02-2010, 02:46 PM   #1
kswapnadevi
LQ Newbie
 
Registered: Oct 2010
Posts: 16

Rep: Reputation: 0
shell scripting


Sir, I have a big datafile in the given format. I have to extract the lines which are starting with ‘Chr’ followed by number like ‘Chr5’, ‘Chr25’ etc. and fix the range for each line by subtracting 10 from that number left and adding 10 to that number right. (or in other way; remove the lines starting with miR, let,CH, number, Chr without number: shown in blue color) For example, Chr5-26236044 has to be displayed like Chr5:26236034-26236054. If the ‘Chr5’ with same range is came in the output file, n-1 entries are to be removed otherwise all Chr5 must be present. In the given output Chr5 with same data range came twice. One has to be removed (shown in red color). Similarly Chr2 came 3 times with different ranges. So all three should be there. I given the output required. The output format must be like this Chrno:number1-number2 (no spaces). Shell scripting for this is highly appreciated. Thanks in Advance.


Quote:

Chr

Chr

Chr5-26236044
Chr25-2622227
Chr10-23813153
ChrX-62081599
miR-1-1-3p;Chr13:55237544-55237619
Chr18-31139230
miR-2331-3p;Chr19:15308148-15308218
CH240-242E2-CH240-416P12-96217

Chr2-66268692
miR-2379-5p;Chr23:30788153-30788230
Chr13-3857984
Chr23-29971922
let-7a-2-5p;Chr15:33347557-33347652
Chr4-120427453
Chr2-119023403
miR-2347-3p;Chr19:51593973-51594031
Chr25-21194342
miR-449b-5p;Chr20:23967269-23967366
Chr25-9506360
Chr2-66270795
Chr5-26236044
miR-2484-5p;ChrX:20461131-20461206
93748382
Output required:

Quote:
Chr5:26236034-26236055
Chr25:2622217-2622237
Chr10:23813143-26813163
ChrX:62081589-62081609
Chr18:31139220-31139240
Chr2:66268682-66268702
Chr13:3857974-3857994
Chr23:29971912-29971932
Chr4:120427443-120427463
Chr2:119023393-119023413
Chr25:32391131-32391151
Chr25:9506350-9506370
Chr2:66270785-66270805
Chr5:26236034-26236055
 
Old 12-02-2010, 02:58 PM   #2
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Can it be done with awk?

I don't understand the significance of the red coloring of the final line of your output, but here's an awk snippet which seems to produce the output you want (including correctly adding 10 to the right, not 11):
Code:
awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename
EDIT: Sorry, I see the part about the dupes now. The above code won't get rid of dupes (yet!).

Here's an update:
Code:
awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\
  cat -n | sort -k2 | uniq -u | sort -k1 | awk '{print $2}'
However, I'm still unclear about this line in the output, because of the "X" in it...:
Code:
ChrX:62081589-62081609

Last edited by GrapefruiTgirl; 12-02-2010 at 03:08 PM. Reason: Update - I hadn't read closely enough, so fixed code.
 
1 members found this post helpful.
Old 12-02-2010, 03:06 PM   #3
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Code:
awk '/^Chr.+/ 
{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10)
  if ( ! arr[str]++ ) print str
}' file
 
Old 12-02-2010, 03:13 PM   #4
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Heh, I'm not doing well today; here's a *working* version:
Code:
awk -F"-" '/^Chr[0-9]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
I kept missing that duplicate in red - I think the red threw me off

Output:
Code:
Chr5:26236034-26236054
Chr25:2622217-2622237
Chr10:23813143-23813163
Chr18:31139220-31139240
Chr2:66268682-66268702
Chr13:3857974-3857994
Chr23:29971912-29971932
Chr4:120427443-120427463
Chr2:119023393-119023413
Chr25:21194332-21194352
Chr25:9506350-9506370
Chr2:66270785-66270805
 
Old 12-02-2010, 03:14 PM   #5
TB0ne
Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 15,077

Rep: Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713Reputation: 2713
Is it just me, or does the OP only seem to want folks to write scripts for him?

Last edited by TB0ne; 12-02-2010 at 03:15 PM.
 
Old 12-02-2010, 04:28 PM   #6
kswapnadevi
LQ Newbie
 
Registered: Oct 2010
Posts: 16

Original Poster
Rep: Reputation: 0
Shell scripting

Sir, in output, ChrX:62081589-62081609 is also required. Chr followed by number or Character is also required sir.

Quote:
Originally Posted by GrapefruiTgirl View Post
Can it be done with awk?

I don't understand the significance of the red coloring of the final line of your output, but here's an awk snippet which seems to produce the output you want (including correctly adding 10 to the right, not 11):
Code:
awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename
EDIT: Sorry, I see the part about the dupes now. The above code won't get rid of dupes (yet!).

Here's an update:
Code:
awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\
  cat -n | sort -k2 | uniq -u | sort -k1 | awk '{print $2}'
However, I'm still unclear about this line in the output, because of the "X" in it...:
Code:
ChrX:62081589-62081609
 
Old 12-02-2010, 04:34 PM   #7
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Code:
awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
Perhaps this adjustment to the regex will help with that. Or maybe this:
Code:
awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
Or this:
Code:
awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
And please, that's Miss, not Sir.

Last edited by GrapefruiTgirl; 12-02-2010 at 04:51 PM.
 
1 members found this post helpful.
Old 12-02-2010, 04:52 PM   #8
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Nice code, but you don't really need the sort and uniq stuff if you can do all in awk, as shown in post #3. Here you can simply print a line if it has not been printed before, otherwise ignore it.
 
Old 12-02-2010, 04:57 PM   #9
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Ehh, very true but when someone else does your homework, you cannot always count on getting the most efficient results.

BTW, your code doesn't work for me, but I have not investigated:
Code:
sasha@reactor: awk '/^Chr.+/                                                                                                                       
{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10)
  if ( ! arr[str]++ ) print str
}' filename
:-10-10
Chr:-10-10
Chr5-26236044
Chr5:26236034-26236054
Chr25-2622227
Chr25:2622217-2622237
Chr10-23813153
Chr10:23813143-23813163
ChrX-62081599
ChrX:62081589-62081609
miR:-9-11
Chr18-31139230
Chr18:31139220-31139240
miR:2321-2341
CH240:24190-24210
Chr2-66268692
Chr2:66268682-66268702
miR:2369-2389
Chr13-3857984
Chr13:3857974-3857994
Chr23-29971922
Chr23:29971912-29971932
let:-3-17
Chr4-120427453
Chr4:120427443-120427463
Chr2-119023403
Chr2:119023393-119023413
miR:2337-2357
Chr25-21194342
Chr25:21194332-21194352
miR:439-459
Chr25-9506360
Chr25:9506350-9506370
Chr2-66270795
Chr2:66270785-66270805
Chr5-26236044
miR:2474-2494
93748382:-10-10
sasha@reactor:
 
Old 12-02-2010, 05:08 PM   #10
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
This appears to work also, and is nicer than all that sorting stuff:
Code:
awk -F"-" '/^Chr[[:alnum:]]/{if(!arr[$0]){arr[$0]++; print $1 ":" $2-10 "-" $2+10}}' filename
 
Old 12-02-2010, 05:21 PM   #11
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
That's strange: I splitted my test one-liner into multiple lines (for readability) and it doesn't work indeed. But if I repeat my test as one-liner it works:
Code:
awk '/^Chr.+/{split($0,_,"-");str=sprintf("%s:%d-%d",_[1],_[2]-10,_[2]+10);if (! arr[str]++) print str}' file
I cannot see the difference, right now. Anyway, better your code in the post above!


Edit: oops... it appears I erroneously put the bracket into a newline, so that the code is split into two rules: one being just an expression (printing the whole matching record), the other being a rule with no expression applied to all records. Sorry...

Last edited by colucix; 12-02-2010 at 05:26 PM.
 
Old 12-02-2010, 05:27 PM   #12
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Yup, it does work, and as multi-line also; your code below, but with the initial "{" moved to directly after the /.../

Code:
 awk '/^Chr.+/{
  split($0, _, "-") 
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10);
  if ( ! arr[str]++ ) print str
}' filename
 
1 members found this post helpful.
Old 12-02-2010, 05:43 PM   #13
kswapnadevi
LQ Newbie
 
Registered: Oct 2010
Posts: 16

Original Poster
Rep: Reputation: 0
Shell scripting

Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance.

Quote:
Originally Posted by GrapefruiTgirl View Post
Code:
awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
Perhaps this adjustment to the regex will help with that. Or maybe this:
Code:
awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
Or this:
Code:
awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\
  sort -rk2 | uniq -f1 | sort -gk1 | awk '{print $2}'
And please, that's Miss, not Sir.
 
Old 12-02-2010, 06:08 PM   #14
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by kswapnadevi View Post
Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance.
That was not a requirement of the original problem. However, `sort -V` will do it, if you pipe the output of the code you're current using, into it:
Code:
awk '/^Chr.+/{
  split($0, _, "-")
  str = sprintf("%s:%d-%d", _[1], _[2]-10, _[2]+10);
  if ( ! arr[str]++ ) print str
}' filename | sort -V
Chr2:66268682-66268702
Chr2:66270785-66270805
Chr2:119023393-119023413
Chr4:120427443-120427463
Chr5:26236034-26236054
Chr10:23813143-23813163
Chr13:3857974-3857994
Chr18:31139220-31139240
Chr23:29971912-29971932
Chr25:2622217-2622237
Chr25:9506350-9506370
Chr25:21194332-21194352
ChrX:62081589-62081609
No guarantees this is reliable - after all, these are not really version numbers we're sorting. It would be better to pre-sort the output appropriately while still inside the awk, but for the time being, I'll leave that as an exercise for you or someone else.

Hey, what do I get if you graduate?
 
1 members found this post helpful.
Old 12-02-2010, 08:03 PM   #15
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
sorting with awk (sorting arrays)

In case anyone is (still?) interested, here's an awk script that sorts the output according to requirements. Note that my awk skills are not as good as those of some others around here (I did this to see if I could), and so the way I've implemented the sorting might not be even close to the "best" way - but it doesn't appear to me that awk has a whole whack of super sorting capability, so I have separated the input data into 3 arrays, according to the character(s) following the leading "Chr": there's uppercase, lowercase, or digits. This allows each array contents to be sorted against similar types of characters, because otherwise, awk doesn't produce the desired (numeric, lower, upper) order of sorted output. Also, you'll note that in the middle of the script, I temporarily padded leading zeroes onto any digit(s) following the "Chr", because awk doesn't seem to numerically sort integers in naturally occurring order unless they are the same length. I may have overlooked something there (maybe the characters following the digits are causing trouble?), but here's what I'm talking about:
Code:
18:31139220-31139240
23:29971912-29971932
25:21194332-21194352
25:2622217-2622237
25:9506350-9506370
2:119023393-119023413
Here's the code:
Code:
#!/usr/bin/awk -f

BEGIN{
        FS = "-"
}


{
        if(/^Chr[[:alnum:]]/){
                gsub("^Chr","",$1)
                if(!LWR[$0] && !UPP[$0] && !NUM[$0]){
                        if($1 ~ /^[[:upper:]]/){
                                UPP[$0] = $1 ":" $2-10 "-" $2+10
                                next
                        }
                        if($1 ~ /^[[:lower:]]/){
                                LWR[$0] = $1 ":" $2-10 "-" $2+10
                                next
                        }
                        if($1 ~ /^[[:digit:]]/){
                                $1 = sprintf("%03u",$1)
                                NUM[$0] = $1 ":" $2-10 "-" $2+10
                        }
                }
        }
}


END{
        UPPS = asort(UPP); LWRS = asort(LWR); NUMS = asort(NUM)
        for(x=1 ; x<=NUMS ; x++){
                if(NUM[x]){
                        gsub("^0|^00","",NUM[x])
                        print "Chr" NUM[x]
                }
        }
        for(x=1; x<=LWRS; x++){
                if(LWR[x]){
                        print "Chr" LWR[x]
                }
        }
        for(x=1; x<=UPPS; x++){
                if(UPP[x]){
                        print "Chr" UPP[x]
                }
        }
}
Output for me now:
Code:
./awksorting input_filename
Chr2:119023393-119023413
Chr2:66268682-66268702
Chr2:66270785-66270805
Chr4:120427443-120427463
Chr5:26236034-26236054
Chr10:23813143-23813163
Chr13:3857974-3857994
Chr18:31139220-31139240
Chr23:29971912-29971932
Chr25:21194332-21194352
Chr25:2622217-2622237
Chr25:9506350-9506370
ChrgS:62081589-62081609
ChrgZ:62081589-62081609
Chry:894371222-894371242
ChrC:62081589-62081609
ChrP:93847555-93847575
ChrX:62081589-62081609
ChrZ:72081589-72081609
sasha@reactor:
I call my script "awksorting"; whatever you call yours, execute it by name and give the filename of the input file as the argument, as shown above.

EDITS:
1) Fix so script reliably supports input lines starting with "ChrN" where: (0 <= N) && (N <= 999)

Last edited by GrapefruiTgirl; 12-03-2010 at 10:22 AM. Reason: See EDITS above.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Terminal functions for shell scripting with Shell Curses LXer Syndicated Linux News 0 03-27-2008 12:50 AM
SHELL scripting/ shell functions mayaabboud Linux - Newbie 6 12-26-2007 09:18 AM
Shell Scripting: Getting a pid and killing it via a shell script topcat Programming 15 10-28-2007 03:14 AM
teaching shell scripting: cool scripting examples? fax8 Linux - General 1 04-20-2006 05:29 AM
shell interface vs shell scripting? I'm confused jcchenz Linux - Software 1 10-26-2005 04:32 PM


All times are GMT -5. The time now is 06:10 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration