Temporal Data Update Methodologies

During the past two decades, several dozen temporal data models have been generated, all with timestamps being included as integral components. There are very few commercial databases yet on the market, perhaps due to the complex nature of temporal data. This article presents a technical outline of how to use the conventional commercial databases to update with temporal data. The goal is to make sure data consistency is maintained, and load and query performance is not compromised. Updating data warehouses with temporal data is a mechanism for storing the lineage of data in the tables. It captures all changes made to a data row over time (transaction lineage). Deleted rows are not physically deleted; they are labeled to exhibit expiration instead. Updated rows are handled by expiring the existing rows and inserting the new version of the rows. Both current and historical time slices are available to any user by manipulating view filter criteria with less-than-equal-to (<=) predicates, because each version of a row share the same key. Temporal data must be evaluated by joins with inequality predicates; rather than equality predicates used in conventional data evaluation techniques.

A temporal data warehouse provides a consistent view of data for customer queries while data is being loaded into the same table being queried. It provides transaction lineage of data and mechanisms to harmonize (aggregate and synchronize) data based on time slicing. The business needs for several different time-varying data can be met using a temporal data warehouse.

Figure 1 presents a temporal data warehousing environment. The data comes from operational databases. The source systems provide change timestamp for temporal data. In a data warehouse source data is initially landed in staging subject areas. The data is then moved to downstream layers of data warehouse such as layers 2 (L2) and 3 (L3). L2 subject areas are shared subject areas used by any application(s) that needs data. An L3 subject area is dedicated to a specific application. Each L3 subject area provides an analytical environment used for reporting and business intelligence (BI) purposes.

Figure 1: A Typical Temporal Data Warehousing Environment.

Figure 1: A Typical Temporal Data Warehousing Environment.

In order to implement temporal data update methodology the data model will have four additional columns, such as 'row effective date', 'row effective time', 'row expired date' and 'row expired time', to mark row effective date/time and row expired date/time against each row in the table. To make each row unique, the row effective date and time columns need to be part of primary key. The data from the operational databases will arrive in a timely fashion via flat files. The cycle refresh time intervals can be 30 minutes, one, two, three, or four hours, etc based on the needs of the business organization. The data manipulation (DML) code (I/U/D) and data row change timestamp are provided by the source system in the data files. The data row change timestamp will be used as row effective date and time. '9999-12-31 12:00:00' will be used as row expired timestamp however the presence of this high value indicates an 'active' or current row. Time stamping is used to tag each record with some moment in time when a record is created or passed from one environment to another.

Immediately after staging tables are loaded from source system the data warehouse SQL will be used to process and expired the rows if multiple versions have arrived in a file. It is likely that during a cycle refresh a data file will contain multiple versions of rows for a particular record, with the same primary key. In that case, each previous version will be expired with the row effective timestamp of the immediate next version of that row. Only the current version will have the expired timestamp value '9999-12-31 12:00:00'. For example, if the same key has a DML code 'I' (insert) followed by 'U' (update) in that case only the row with 'U' will be treated as the active row. This will insure rows are expired in the staging table in case there are multiple versions of rows arriving via a source data file in a given cycle. Next the rows with both 'U' and 'I' will be inserted in the final target table. Following insert into the target table all rows with 'D' (delete) in the staging table will be used to expire the corresponding row in the target table. 'D' rows are not deleted physically in the target table. All the existing rows with DML code 'U' will be expired in the target table and new rows inserted. All these steps are used to perform incremental refreshes (Rahman 2010) with temporal data. In case of full refresh as initial load, the current version of rows will be used to perform the load. The change date of current row will be used as row effective timestamp and '9999-12-31 12:00:00' as expired timestamp.

Table-1 shows how data look in the final target table in the data warehouse:

Table 1: Current and historical time slices of data for the same record.

Table 1: Current and historical time slices of data for the same record.

Among the highlighted rows (with the same key: fin_doc_nbr & fscl_yr_nbr) in Table 1, the last row is the current row which is active with row expired date and time as '9999-12-31 12:00:00'.