Inserting Blank Spaces-Line command

sawdusted · 05-28-2013, 03:26 PM

Heu Guys, I hope you can help me with a simple newbie question. I would like to insert in blank spaces/lines so that the scores can line up sequentially with my line number. This is a tab file with only 2 columns. Could you help me with a command to insert the blank spaces? Or even better insert in a score of 0 in the lines that do not have a score. Some of my files contain thousands of lines, so ideally, this can be done using a script rather than manually filling in the 0s.

Thanks!

Example:
------------------------------
Score Line
2 0
1 1
1 2
2 5
2 6
1 7
3 8
2 16
1 18
1 19
1 24
1 25

Want it to look like :
Score Line number
2 0
1 1
1 2
0 3
0 4
2 5
2 6
1 7
3 8
0 9
1 10
3 11
1 12
3 13
1 14
2 15
2 16
0 17
1 18
1 19
0 20
0 21
0 22
0 23
1 24
1 25

chrism01 · 05-28-2013, 06:10 PM

1. please use code tags https://www.linuxquestions.org/quest...do=bbcode#code
2. can you expand on that; the 2nd file just looks like a longer version of the first one. I don't get your qn

Beryllos · 05-28-2013, 09:37 PM

chrism01, In the first file, line numbers are omitted when the score is zero. He wants those lines put back in so the full range of line numbers is shown for each file.

sawdusted, Where are these score files coming from? What code is used to generate them? Rather than patching up the files after the fact, why don't you rewrite the originating program or script so it includes all the lines (doesn't skip lines with score of zero)?

By the way, is this a homework question?

The way I would approach it is to declare an array and initialize all scores to zero. Then read in the file to insert the non-zero scores wherever they may occur. Then write all lines back to the output file (as the original scoring program should have done).

Try writing your own script to do that. If you get stuck or have specific questions, give us a holler.

chrism01 · 05-29-2013, 12:11 AM

now I see it.
I agree, fixing up the generating program makes more sense than post-facto.

sawdusted · 05-29-2013, 02:33 PM

Thanks for the replies guys. No this is not a homework

The original files were generated by counting hits that fall within a particular numbered region. If there were no hits in that region, there were not counts/score.

I'm not sure how I can modify my original script. Maybe you can help?

Basically I first start off by extracting lines that fall within a particular region of a chromosome

Code:

 grep -w chr10 nelf-ctl.bowtie | gawk '$4>102104816 && $4<102126247'> scd.nelf-ctl

Then I cut out a column and count and sort the hits:

Code:

 cut -f4 scd.nelf-ctl | gawk '{print int(($1-102104816)/10)}'| sort | uniq -c | sort -k2,2n > scd.nelf-ctl.10bp-bin.counts

Which output is as I first described in the original post.

I have tried to create a sequentially numbered file and to join the counts file with the sequentially numbered file but it doesn't always work. Sometimes it joins up to line 100, sometimes line 90, sometimes skips line 100-999.

Code:

 gawk 'BEGIN {for (i=0; i<=( 21431/10); i++) print i}' > scd-10bp.allbins
 join -1 1 -2 2 -a 1 scd-10bp.allbins scd.nelf-ctl.10bp-bin.counts > scd.nelf-ctl.10bp-bin.allbins

Sorry for this long post.

Thank you for your help.
Julian

chrism01 · 05-29-2013, 06:32 PM

I'd pick a lang eg awk (in my case Perl of course

and do the whole thing in that language.
This way, you can init the score to zero for every record/match before you start and just overwrite it if you get a non-zero 'score'.
If you call lots of other tools, it makes it harder to preserve multiple values, unless you use a lot of temp files.

Re Perl; Bio isn't my area, but I know there's a lot of Perl modules for it see search.cpan.org.
A couple of examples http://search.cpan.org/~cjfields/Bio...01/Bio/Perl.pm, https://en.wikipedia.org/wiki/Bioperl