Quantcast
Viewing latest article 29
Browse Latest Browse All 30

Using IGNORE NULLS With SQL Window Functions to Fill Gaps

I found a very interesting SQL question on Twitter recently:

Rephrasing the question: We have a set of sparse data points:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 |
| 2019-01-05 |   125 |
| 2019-01-06 |   128 |
| 2019-01-10 |   130 |
+------------+-------+

Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:

+------------+-------+
| VALUE_DATE | VALUE |
+------------+-------+
| 2019-01-01 |   100 |
| 2019-01-02 |   120 | <-+
| 2019-01-03 |   120 |   | -- Generated
| 2019-01-04 |   120 |   | -- Generated
| 2019-01-05 |   125 |
| 2019-01-06 |   128 | <-+
| 2019-01-07 |   128 |   | -- Generated
| 2019-01-08 |   128 |   | -- Generated
| 2019-01-09 |   128 |   | -- Generated
| 2019-01-10 |   130 |
+------------+-------+

In the generated columns, we’ll just repeat the most recent value.

How to do this with SQL?

For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:

  1. Generate all the dates between the first and the last data points
  2. For each date, find either the current data point, or the most recent one

But first, let’s create the data:

create table t (value_date, value) as
  select date '2019-01-01', 100 from dual union all
  select date '2019-01-02', 120 from dual union all
  select date '2019-01-05', 125 from dual union all
  select date '2019-01-06', 128 from dual union all
  select date '2019-01-10', 130 from dual;

1. Generating all the dates

In Oracle, we can use the convenient CONNECT BY syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH, or some PIPELINED function, but I like CONNECT BY for this purpose.

We’ll write:

select (
  select min(t.value_date) 
  from t
) + level - 1 as value_date
from dual
connect by level <= (
  select max(t.value_date) - min(t.value_date) + 1
  from t
)

This produces:

VALUE_DATE|
----------|
2019-01-01|
2019-01-02|
2019-01-03|
2019-01-04|
2019-01-05|
2019-01-06|
2019-01-07|
2019-01-08|
2019-01-09|
2019-01-10|

Now we wrap the above query in a derived table and left join the actual data set:

select 
  d.value_date,
  t.value
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

The date gaps are now filled, but our values column is still sparse:

VALUE_DATE|VALUE|
----------|-----|
2019-01-01|  100|
2019-01-02|  120|
2019-01-03|     |
2019-01-04|     |
2019-01-05|  125|
2019-01-06|  128|
2019-01-07|     |
2019-01-08|     |
2019-01-09|     |
2019-01-10|  130|

2. Fill the value gaps

On each row, the VALUE column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:

last_value (t.value) ignore nulls over (order by d.value_date)

Since we have added an ORDER BY clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment)

Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.

This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS. Among the ones supported by jOOQ, currently these ones support the syntax:

  • DB2
  • H2
  • Informix
  • Oracle
  • Redshift
  • Sybase SQL Anywhere
  • Teradata

Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants.

The full query now reads:

select 
  d.value_date,
  last_value (t.value) ignore nulls over (order by d.value_date)
from (
  select (
    select min(t.value_date) 
    from t
  ) + level - 1 as value_date
  from dual
  connect by level <= (
    select max(t.value_date) - min(t.value_date) + 1
    from t
  )
) d
left join t
on d.value_date = t.value_date
order by d.value_date;

… and it yields the desired result:

VALUE_DATE         |VALUE|
-------------------|-----|
2019-01-01 00:00:00|  100|
2019-01-02 00:00:00|  120|
2019-01-03 00:00:00|  120|
2019-01-04 00:00:00|  120|
2019-01-05 00:00:00|  125|
2019-01-06 00:00:00|  128|
2019-01-07 00:00:00|  128|
2019-01-08 00:00:00|  128|
2019-01-09 00:00:00|  128|
2019-01-10 00:00:00|  130|

Other RDBMS

This solution made use of some Oracle specific features such as CONNECT BY. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.


Viewing latest article 29
Browse Latest Browse All 30

Trending Articles