I found a very interesting SQL question on Twitter recently:
Rephrasing the question: We have a set of sparse data points:
+------------+-------+ | VALUE_DATE | VALUE | +------------+-------+ | 2019-01-01 | 100 | | 2019-01-02 | 120 | | 2019-01-05 | 125 | | 2019-01-06 | 128 | | 2019-01-10 | 130 | +------------+-------+
Since dates can be listed as discrete, continuous data points, why not fill in the gaps between 2019-01-02 and 2019-01-05 or 2019-01-06 and 2019-01-10? The desired output would be:
+------------+-------+ | VALUE_DATE | VALUE | +------------+-------+ | 2019-01-01 | 100 | | 2019-01-02 | 120 | <-+ | 2019-01-03 | 120 | | -- Generated | 2019-01-04 | 120 | | -- Generated | 2019-01-05 | 125 | | 2019-01-06 | 128 | <-+ | 2019-01-07 | 128 | | -- Generated | 2019-01-08 | 128 | | -- Generated | 2019-01-09 | 128 | | -- Generated | 2019-01-10 | 130 | +------------+-------+
In the generated columns, we’ll just repeat the most recent value.
How to do this with SQL?
For the sake of this example, I’m using Oracle SQL, as the OP was expecting to do this with Oracle. The idea is to do this in two steps:
- Generate all the dates between the first and the last data points
- For each date, find either the current data point, or the most recent one
But first, let’s create the data:
create table t (value_date, value) as select date '2019-01-01', 100 from dual union all select date '2019-01-02', 120 from dual union all select date '2019-01-05', 125 from dual union all select date '2019-01-06', 128 from dual union all select date '2019-01-10', 130 from dual;
1. Generating all the dates
In Oracle, we can use the convenient CONNECT BY
syntax for this. We could also use some other tool to generate dates to fill the gaps, including SQL standard recursion using WITH
, or some PIPELINED
function, but I like CONNECT BY
for this purpose.
We’ll write:
select ( select min(t.value_date) from t ) + level - 1 as value_date from dual connect by level <= ( select max(t.value_date) - min(t.value_date) + 1 from t )
This produces:
VALUE_DATE| ----------| 2019-01-01| 2019-01-02| 2019-01-03| 2019-01-04| 2019-01-05| 2019-01-06| 2019-01-07| 2019-01-08| 2019-01-09| 2019-01-10|
Now we wrap the above query in a derived table and left join the actual data set:
select d.value_date, t.value from ( select ( select min(t.value_date) from t ) + level - 1 as value_date from dual connect by level <= ( select max(t.value_date) - min(t.value_date) + 1 from t ) ) d left join t on d.value_date = t.value_date order by d.value_date;
The date gaps are now filled, but our values column is still sparse:
VALUE_DATE|VALUE| ----------|-----| 2019-01-01| 100| 2019-01-02| 120| 2019-01-03| | 2019-01-04| | 2019-01-05| 125| 2019-01-06| 128| 2019-01-07| | 2019-01-08| | 2019-01-09| | 2019-01-10| 130|
2. Fill the value gaps
On each row, the VALUE
column should either contain the actual value, or the “last_value” preceding the current row, ignoring all the nulls. Note that I specifically wrote this requirement using specific English language. We can now translate that sentence directly to SQL:
last_value (t.value) ignore nulls over (order by d.value_date)
Since we have added an ORDER BY
clause to the window function, the default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
applies, which colloquially means “all the preceding rows”. (Technically, that’s not accurate. It means all rows with values less than or equal to the value of the current row – see Kim Berg Hansen’s comment)
Convenient! We’re trying to find the last value in the window of all the preceding rows, ignoring the nulls.
This is standard SQL, but unfortunately not all RDBMS support IGNORE NULLS
. Among the ones supported by jOOQ, currently these ones support the syntax:
- DB2
- H2
- Informix
- Oracle
- Redshift
- Sybase SQL Anywhere
- Teradata
Sometimes, not the exact standard syntax is supported, but the standard feature. Use https://www.jooq.org/translate to see different syntax variants.
The full query now reads:
select d.value_date, last_value (t.value) ignore nulls over (order by d.value_date) from ( select ( select min(t.value_date) from t ) + level - 1 as value_date from dual connect by level <= ( select max(t.value_date) - min(t.value_date) + 1 from t ) ) d left join t on d.value_date = t.value_date order by d.value_date;
… and it yields the desired result:
VALUE_DATE |VALUE| -------------------|-----| 2019-01-01 00:00:00| 100| 2019-01-02 00:00:00| 120| 2019-01-03 00:00:00| 120| 2019-01-04 00:00:00| 120| 2019-01-05 00:00:00| 125| 2019-01-06 00:00:00| 128| 2019-01-07 00:00:00| 128| 2019-01-08 00:00:00| 128| 2019-01-09 00:00:00| 128| 2019-01-10 00:00:00| 130|
Other RDBMS
This solution made use of some Oracle specific features such as CONNECT BY
. In other RDBMS, the same idea can be implemented by using a different way of generating data. This article focuses only on using IGNORE NULLS
. If you’re interested, feel free to post an alternative solution in the comments for your RDBMS.