shell scripting
Sir, I have a big datafile in the given format. I have to extract the lines which are starting with ‘Chr’ followed by number like ‘Chr5’, ‘Chr25’ etc. and fix the range for each line by subtracting 10 from that number left and adding 10 to that number right. (or in other way; remove the lines starting with miR, let,CH, number, Chr without number: shown in blue color) For example, Chr5-26236044 has to be displayed like Chr5:26236034-26236054. If the ‘Chr5’ with same range is came in the output file, n-1 entries are to be removed otherwise all Chr5 must be present. In the given output Chr5 with same data range came twice. One has to be removed (shown in red color). Similarly Chr2 came 3 times with different ranges. So all three should be there. I given the output required. The output format must be like this Chrno:number1-number2 (no spaces). Shell scripting for this is highly appreciated. Thanks in Advance.
Quote:
Quote:
|
Can it be done with awk?
I don't understand the significance of the red coloring of the final line of your output, but here's an awk snippet which seems to produce the output you want (including correctly adding 10 to the right, not 11): Code:
awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' input_filename Here's an update: Code:
awk -F"-" '/^Chr[0-9]/{ print $1 ":" $2-10 "-" $2+10 }' filename |\ Code:
ChrX:62081589-62081609 |
Code:
awk '/^Chr.+/ |
Heh, I'm not doing well today; here's a *working* version:
Code:
awk -F"-" '/^Chr[0-9]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\ Output: Code:
Chr5:26236034-26236054 |
Is it just me, or does the OP only seem to want folks to write scripts for him?
|
Shell scripting
Sir, in output, ChrX:62081589-62081609 is also required. Chr followed by number or Character is also required sir.
Quote:
|
Code:
awk -F"-" '/^Chr[^\n]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\ Code:
awk -F"-" '/^Chr[[:alnum:]]/{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\ Code:
awk -F"-" '/^Chr./{ print NR " " $1 ":" $2-10 "-" $2+10 }' filename |\ |
Nice code, but you don't really need the sort and uniq stuff if you can do all in awk, as shown in post #3. Here you can simply print a line if it has not been printed before, otherwise ignore it. :)
|
Ehh, very true ;) but when someone else does your homework, you cannot always count on getting the most efficient results. :D
BTW, your code doesn't work for me, but I have not investigated: Code:
sasha@reactor: awk '/^Chr.+/ |
This appears to work also, and is nicer than all that sorting stuff:
Code:
awk -F"-" '/^Chr[[:alnum:]]/{if(!arr[$0]){arr[$0]++; print $1 ":" $2-10 "-" $2+10}}' filename |
That's strange: I splitted my test one-liner into multiple lines (for readability) and it doesn't work indeed. But if I repeat my test as one-liner it works:
Code:
awk '/^Chr.+/{split($0,_,"-");str=sprintf("%s:%d-%d",_[1],_[2]-10,_[2]+10);if (! arr[str]++) print str}' file Edit: oops... it appears I erroneously put the bracket into a newline, so that the code is split into two rules: one being just an expression (printing the whole matching record), the other being a rule with no expression applied to all records. Sorry... :redface: |
Yup, it does work, and as multi-line also; your code below, but with the initial "{" moved to directly after the /.../
Code:
awk '/^Chr.+/{ |
Shell scripting
Respected Miss,
The output is in sorting order madam like: Chr2, Chr3,.......ChrX, ChrY,ChrZ. Thanks in advance. Quote:
|
Quote:
Code:
awk '/^Chr.+/{ Hey, what do I get if you graduate? :) |
sorting with awk (sorting arrays)
In case anyone is (still?) interested, here's an awk script that sorts the output according to requirements. Note that my awk skills are not as good as those of some others around here (I did this to see if I could), and so the way I've implemented the sorting might not be even close to the "best" way - but it doesn't appear to me that awk has a whole whack of super sorting capability, so I have separated the input data into 3 arrays, according to the character(s) following the leading "Chr": there's uppercase, lowercase, or digits. This allows each array contents to be sorted against similar types of characters, because otherwise, awk doesn't produce the desired (numeric, lower, upper) order of sorted output. Also, you'll note that in the middle of the script, I temporarily padded leading zeroes onto any digit(s) following the "Chr", because awk doesn't seem to numerically sort integers in naturally occurring order unless they are the same length. I may have overlooked something there (maybe the characters following the digits are causing trouble?), but here's what I'm talking about:
Code:
18:31139220-31139240 Code:
#!/usr/bin/awk -f Code:
./awksorting input_filename EDITS: 1) Fix so script reliably supports input lines starting with "ChrN" where: (0 <= N) && (N <= 999) |
All times are GMT -5. The time now is 09:27 PM. |