LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-20-2009, 08:34 AM   #1
mikeleigh
LQ Newbie
 
Registered: Feb 2009
Posts: 6

Rep: Reputation: 0
Help finding gaps in a sequence


Hi,

I have a text file (CSV) which has missing ranges and overlapping ranges. What I am looking for is the easiest method to find these gaps and overlapping ranges from the command line so I can use in another script.

Below is an example of a missing range
0,3
4,10
11,14
17,23

As you can see I am missing 15,16

Below is an example of an overlapping range
0,3
4,10
8,10
11,14

As you can see 8,10 is overlapping with 4,10 or 4,10 is overlapping with 8,10

What I would like to do is have a script to output the missing ranges to STDOUT and another to output the overlapping to STDOUT.

Has anyone done anything similar or can point me along the right path.

Thanks

Mike
 
Old 08-20-2009, 08:53 AM   #2
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Sure looks like a homework question.

For homework questions, please make a real try at it yourself and ask specific questions about specific details on which you get stuck. Don't ask us to do your homework for you.

Meanwhile: You showed the input sorted. Is the actual input sorted? In that case the problem seems trivial.

If the actual input isn't sorted, it would be easiest to first sort it, then the rest is trivial. I suggest sorting primarily by the first number of the range and in case of a tie by the second number.

So 1,9 would be before 2,3 but after 1,8

Last edited by johnsfine; 08-20-2009 at 08:58 AM.
 
Old 08-20-2009, 08:56 AM   #3
mikeleigh
LQ Newbie
 
Registered: Feb 2009
Posts: 6

Original Poster
Rep: Reputation: 0
hhmm thanks for that. Its not a homework question. I don't know the best way to go around tackling this. I have created a php script to look for the missing ranges and this works but is slow. I was hoping not to load all data into an array and then process it as this could and will lead to memory issues. I was looking to see if anyone can point me in the path of a better solution using sed or awk or something along those lines.

Thanks

Mike
 
Old 08-20-2009, 09:23 AM   #4
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by mikeleigh View Post
I was hoping not to load all data into an array and then process it as this could and will lead to memory issues. I was looking to see if anyone can point me in the path of a better solution using sed or awk or something along those lines.
You still didn't say whether it starts out sorted.

If you want to get the job done (rather than turn in a homework answer) and you don't want to read all the data into an array, then your shell script should call a sorting program before it calls your processing program, so the data will get pre sorted.

As you read the sorted data, remember the largest value seen so far.

When you read a pair, if the small value of the pair is less than or equal to the largest value already seen then you can report the overlap.

If the small value of the pair is more than one greater than the largest value seen you can report the gap.

If the large value of the pair is more than the largest value seen, you must adjust the largest value seen.
 
Old 08-20-2009, 09:30 AM   #5
mikeleigh
LQ Newbie
 
Registered: Feb 2009
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks for the reply,

Yes the data has already been sorted using sort.

I was hoping something was already available as an option to sort or something within awk. Like I said before I already have a php script that I wrote to do this but it is rather clunky. My examples gave a simple way of seeing this data but in reality I have millions of lines of data to go through, hence the faster or better method.

So it looks like I will need to rewrite the php into bash to get the job done.

Oh and on re-reading my first post it does look somewhat like a homework question

Thanks for your help.
 
Old 08-20-2009, 09:39 AM   #6
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,790

Rep: Reputation: 653Reputation: 653Reputation: 653Reputation: 653Reputation: 653Reputation: 653
Hi Mike,

I put together a quick one just for the practice, it only checks for missing ranges but you should be able to mod it for the second test
- return code is the error count

cheers

kbp

----------------------------------------
#!/usr/bin/perl

use strict;
use warnings;


my ( $first, $second, $index, $line );
my $error = 0;

if ( $#ARGV != 0 ) {
print "Usage: $0 filename\n";
exit(1);
}

my $file = shift;

open ( DATA, "$file" ) || die "Could not open file.";

$line = 1;

while (<DATA>) {
chomp $_;
if ( $_ =~ /^$/ ) {
$line++;
next;
}
( $first, $second ) = split (",", $_ );
print "First: " . $first . " - Second: " . $second . "\n";
if ( $first > $second ) {
print "Invalid data line: $line\n";
$error++;
exit($error);
}
if ( $line == 1 && $first != 0 ) {
print "Missing range at line: $line\n";
$error++;
}
if ( $line > 1 ) {
if ( $first != $index+1 ) {
print "Missing range at line: $line\n";
$error++;
}
}
$index = $second;
$line++;
}
close(DATA);
exit($error);
--------------------------------------

Last edited by kbp; 08-20-2009 at 05:10 PM. Reason: Shouldnt be scripting when tired...
 
Old 08-20-2009, 09:42 AM   #7
mikeleigh
LQ Newbie
 
Registered: Feb 2009
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks for the script.

It resembles the one I wrote in php. At least I know I have taken a similar approach to yourself.

Thanks

Mike
 
Old 08-20-2009, 09:48 AM   #8
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Gaps are easy.
How much info do you need on overlaps, especially multiple overlaps (or are those impossible?)

1,9
3,7
4,12

It's easy to say the 3,7 overlaps something and a bit harder to say it overlaps the 1,9.
It's easy to say the 4,9 portion of 4,12 overlaps something and harder to say it overlaps the 1,9 and much harder to say the 4,7 portion of it also overlaps the 3,7.

If you need complicated reporting of unlimited multiple overlaps, maybe you need a program in a more powerful language, such as C++.

If you need fewer details reported, the task is so simple that language shouldn't matter. Reading the file takes the time and processing it isn't significant.
 
Old 08-20-2009, 10:04 AM   #9
mikeleigh
LQ Newbie
 
Registered: Feb 2009
Posts: 6

Original Poster
Rep: Reputation: 0
Yes the gaps are the easy part.

I don't know if multiple overlaps are possible or not from the data. I know I have overlaps though.

I think it is easy to say the 3,7 overlaps 1,9 but harder to say 3,7 overlaps 4,12. If we assume I will be reading the file sequentially and then looking at the previous values.

I think I will have to analyze the data further to see how many overlaps I have before scripting something to report on the output. I have a feeling that my overlaps will be as simple as your 1,9 and 3,7 example, without the 4,12. I will analyze and see what I have.

Thanks for the info.

Mike
 
Old 08-20-2009, 10:12 AM   #10
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by kbp View Post
Code:
                if ( $first != $index+1 ) {
                        print "Missing range at line: $line\n";
                        $error++;
                }
        }
        $index = $second;
That is wrong in a few ways. It will report a missing range when there is actually just an overlap.
I think it should be
Code:
                if ( $first > $index+1 ) {
                        print "Missing range at line: $line\n";
                        $error++;
                }
        }
        if ( $index < $second ) {
                $index = $second;
        }
 
Old 08-20-2009, 05:13 PM   #11
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,790

Rep: Reputation: 653Reputation: 653Reputation: 653Reputation: 653Reputation: 653Reputation: 653
Thanks johnsfine, shouldn't be scripting when tired

kbp
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML - IE giving gaps between lines, Firefox looks perfect fcdev Programming 7 03-31-2009 07:00 PM
Gaps between fonts getBoa Linux - Desktop 1 02-13-2008 09:35 AM
Finding gaps with 'ls' or some other command mijohnst Linux - General 5 01-09-2006 04:59 PM
Dependecy gaps between udev 0.074 and linux-image 2.6.12 TheOneKEA Debian 3 11-18-2005 05:40 PM
CD audio gaps problem cscott Linux - Software 1 08-27-2004 09:15 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:33 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration