Help finding gaps in a sequence

mikeleigh · 08-20-2009, 08:34 AM

Hi,

I have a text file (CSV) which has missing ranges and overlapping ranges. What I am looking for is the easiest method to find these gaps and overlapping ranges from the command line so I can use in another script.

Below is an example of a missing range
0,3
4,10
11,14
17,23

As you can see I am missing 15,16

Below is an example of an overlapping range
0,3
4,10
8,10
11,14

As you can see 8,10 is overlapping with 4,10 or 4,10 is overlapping with 8,10

What I would like to do is have a script to output the missing ranges to STDOUT and another to output the overlapping to STDOUT.

Has anyone done anything similar or can point me along the right path.

Thanks

Mike

johnsfine · 08-20-2009, 08:53 AM

Sure looks like a homework question.

For homework questions, please make a real try at it yourself and ask specific questions about specific details on which you get stuck. Don't ask us to do your homework for you.

Meanwhile: You showed the input sorted. Is the actual input sorted? In that case the problem seems trivial.

If the actual input isn't sorted, it would be easiest to first sort it, then the rest is trivial. I suggest sorting primarily by the first number of the range and in case of a tie by the second number.

So 1,9 would be before 2,3 but after 1,8

mikeleigh · 08-20-2009, 08:56 AM

hhmm thanks for that. Its not a homework question. I don't know the best way to go around tackling this. I have created a php script to look for the missing ranges and this works but is slow. I was hoping not to load all data into an array and then process it as this could and will lead to memory issues. I was looking to see if anyone can point me in the path of a better solution using sed or awk or something along those lines.

Thanks

Mike

johnsfine · 08-20-2009, 09:23 AM

Quote:

Originally Posted by mikeleigh

I was hoping not to load all data into an array and then process it as this could and will lead to memory issues. I was looking to see if anyone can point me in the path of a better solution using sed or awk or something along those lines.

You still didn't say whether it starts out sorted.

If you want to get the job done (rather than turn in a homework answer) and you don't want to read all the data into an array, then your shell script should call a sorting program before it calls your processing program, so the data will get pre sorted.

As you read the sorted data, remember the largest value seen so far.

When you read a pair, if the small value of the pair is less than or equal to the largest value already seen then you can report the overlap.

If the small value of the pair is more than one greater than the largest value seen you can report the gap.

If the large value of the pair is more than the largest value seen, you must adjust the largest value seen.

mikeleigh · 08-20-2009, 09:30 AM

Thanks for the reply,

Yes the data has already been sorted using sort.

I was hoping something was already available as an option to sort or something within awk. Like I said before I already have a php script that I wrote to do this but it is rather clunky. My examples gave a simple way of seeing this data but in reality I have millions of lines of data to go through, hence the faster or better method.

So it looks like I will need to rewrite the php into bash to get the job done.

Oh and on re-reading my first post it does look somewhat like a homework question

Thanks for your help.

kbp · 08-20-2009, 09:39 AM

Hi Mike,

I put together a quick one just for the practice, it only checks for missing ranges but you should be able to mod it for the second test
- return code is the error count

cheers

kbp

----------------------------------------
#!/usr/bin/perl

use strict;
use warnings;

my ( $first, $second, $index, $line );
my $error = 0;

if ( $#ARGV != 0 ) {
print "Usage: $0 filename\n";
exit(1);
}

my $file = shift;

open ( DATA, "$file" ) || die "Could not open file.";

$line = 1;

while (<DATA>) {
chomp $_;
if ( $_ =~ /^$/ ) {
$line++;
next;
}
( $first, $second ) = split (",", $_ );
print "First: " . $first . " - Second: " . $second . "\n";
if ( $first > $second ) {
print "Invalid data line: $line\n";
$error++;
exit($error);
}
if ( $line == 1 && $first != 0 ) {
print "Missing range at line: $line\n";
$error++;
}
if ( $line > 1 ) {
if ( $first != $index+1 ) {
print "Missing range at line: $line\n";
$error++;
}
}
$index = $second;
$line++;
}
close(DATA);
exit($error);
--------------------------------------

mikeleigh · 08-20-2009, 09:42 AM

Thanks for the script.

It resembles the one I wrote in php. At least I know I have taken a similar approach to yourself.

Thanks

Mike

johnsfine · 08-20-2009, 09:48 AM

Gaps are easy.
How much info do you need on overlaps, especially multiple overlaps (or are those impossible?)

1,9
3,7
4,12

It's easy to say the 3,7 overlaps something and a bit harder to say it overlaps the 1,9.
It's easy to say the 4,9 portion of 4,12 overlaps something and harder to say it overlaps the 1,9 and much harder to say the 4,7 portion of it also overlaps the 3,7.

If you need complicated reporting of unlimited multiple overlaps, maybe you need a program in a more powerful language, such as C++.

If you need fewer details reported, the task is so simple that language shouldn't matter. Reading the file takes the time and processing it isn't significant.

mikeleigh · 08-20-2009, 10:04 AM

Yes the gaps are the easy part.

I don't know if multiple overlaps are possible or not from the data. I know I have overlaps though.

I think it is easy to say the 3,7 overlaps 1,9 but harder to say 3,7 overlaps 4,12. If we assume I will be reading the file sequentially and then looking at the previous values.

I think I will have to analyze the data further to see how many overlaps I have before scripting something to report on the output. I have a feeling that my overlaps will be as simple as your 1,9 and 3,7 example, without the 4,12. I will analyze and see what I have.

Thanks for the info.

Mike

johnsfine · 08-20-2009, 10:12 AM

Quote:

Originally Posted by kbp

Code:

                if ( $first != $index+1 ) {
                        print "Missing range at line: $line\n";
                        $error++;
                }
        }
        $index = $second;

That is wrong in a few ways. It will report a missing range when there is actually just an overlap.
I think it should be

Code:

                if ( $first > $index+1 ) {
                        print "Missing range at line: $line\n";
                        $error++;
                }
        }
        if ( $index < $second ) {
                $index = $second;
        }

kbp · 08-20-2009, 05:13 PM

Thanks johnsfine, shouldn't be scripting when tired

kbp