blog.apolloner.eu

Interval coalescence in SQL

Recently I needed to coalescence intervals in SQL. There are already existing solutions to this problem, but for my taste neither of them was really nice or easy to understand (At least none of them had good explanations). So here is my solution for this problem -- and hopefully explained well enough.

Assume you have a database table like this (x is the start time and y the end time):

t x y
a 10:00 11:00
a 11:00 12:00
a 14:00 16:00
b 12:00 14:00

and we would like an result like this:

t x y
a 10:00 12:00
a 14:00 16:00
b 12:00 14:00

This means we want to collapse overlapping rows (which are value equivalent aside from start and end) into single rows. To get this result we need to fulfill two conditions:

  • The results are not allowed to have gaps.
  • The results need to have maximal coverage.

We will consider the second condition first. To get the start time of one interval and the end time of another interval into one result row we choose a Cartesian join:

select f.t, f.x, l.y from intervals f, intervals l where f.t = l.t and f.x < l.y;

The query already makes sure that we only get intervals where the start time is less than the end time and the types are the same:

t x y
a 10 11
a 10 12
a 10 16
a 11 12
a 11 16
a 14 16
b 12 14

Maximal coverage means our first interval can't have any interval left of it with the same type and an end time greater than or equal to our first intervals start time. Same applies for the last interval; there can't be any interval with the same type and a start time less than or equal to the last interval's end time. This can be achieved by adding a not exists subquery to our statement:

select f.t, f.x, l.y from intervals f, intervals l where f.t = l.t and f.x < l.y
    and not exists (
        select * from intervals where t=f.t and
            ((x<f.x and y>=f.x) or (y>l.y and x<=l.y))
    )

The result looks promising:

t x y
a 10 12
a 10 16
a 14 16
b 12 14

The start times and end times are already minimal/maximal but they still do have gaps in them. Eg the second row isn't valid since it goes to far and 12 - 14 isn't covered by type a intervals. So let's add our second condition to the mix:

select distinct f.t, f.x, min(l.y) y from intervals f, intervals l where f.t = l.t and f.x < l.y
    and not exists (
        select * from intervals where t=f.t and
            ((x<f.x and y>=f.x) or (y>l.y and x<=l.y))
    ) group by f.t, f.x

The idea behind that query is as follows: Although we want maximal coverage for our intervals we need to choose those where the end time is minimal. This works because the not exists subquery makes sure that the selected end times are not followed by other intervals. Since the end times are no longer followed by intervals they have to be followed by gaps and as such the earliest end time is the longest interval we can get. We also apply distinct on the query since we could get duplicate rows if we had rows with same end times or start times. This gives us the final result of:

t x y
a 14 16
b 12 14
a 10 12

If someone knows a better way to tackle this problem I'd be more than happy to hear about it.

0 comments: