Viewing Consistent Data in Reporting Environment

The biggest challenge for creating an integrated data model in a data warehouse is to provide a consistent view (Radeschutz et al. 2013) of the data across subject areas. A data warehouse with temporal data must be able to manage data in a shared, reliable, time-sliced and efficient way (Hartmann et al. 2012). Aggregation and synchronization of data across subject areas provide unique challenges. View maintenance in a temporal data warehouse is complicated (Amo and Alves 2000) because they have to deal with multiple versions of data in a table. We propose separate application specific views to allow applications to have consistent view of data with separate time filters as needed by the users.

There are several different business application needs which can be filled by providing a separate set of views with different timestamp filters as required. Business users do not want to see data showing up in their reports as it is being appended or changed in a given table. Also, users would like to have all the updates to related tables in a subject area completed before a report shows any of that new cycle's data. They want to maintain "data latency" in this particular case. The report users want to see data based on the most current cycle refresh that occurred across all tables in the subject area(s).

The application specific views with timestamp filters are defined and dynamically recompiled right after each individual application's upstream subject area refreshes are completed per service level agreement (SLA). The view filters are based on 'row effective date', 'row effective time', and 'row expired date' and 'row expired time'. The row effective and expired timestamps associated with each subject area refresh begin and end timestamps are captured in the data warehouse metadata model (Kim et al. 2000) during the cycle refresh and later used for view recompilation.


Base Views

The base views that point to the target table, will have the 'lock for access' locking mechanism defined in the view. This will allow row level access to the table no matter if the table is being updated. The views will be defined with timestamp filter row_eff_ts <= last_refresh_ts. These views will be recompiled in the end of each cycle refresh. Figure 5 shows a complete set of data after cycle refresh followed by views swap. The Base-View for Table A shows all change history for a particular key.

Figure 5: Table showing data after view re-point to active set of data.

Figure 5: Table showing data after view re-point to active set of data.


Source Application Views

Source application business views will be defined on top of base views (for dirty reads). These views will provide data with filter based on application needs. For example, row_eff_ts <= application_reqd_asof_ts and row exp_ts > application_reqd_asof_ts. These views will be recompiled at end of last cycle refresh of an application. In these views the row uniqueness is maintained via business views. Figure 5 shows BusinessView for Table A. The view is showing the most current version of rows for each key (e.g., doc_nbr) value.

The data warehouse is a shared environment. Some subject areas are application specific while some others are shared by more than one subject area. Each application has its own SLA for data freshness. Also there are dependencies between subject area refreshes. All these factors make applications use different time slices of data. For example, finance related subject areas may run six times a day such as 2:00am 6:00am, 10:00am, 2:00pm, 6:00pm, and 10:00pm. On the other hand, capital related subject areas may run three times a day such as 3:00am, 11:00am and 7:00pm. Both these applications share some common subject areas. They use some application specific subject areas, too. This requires data to be "frozen" via a different set of views on the proper time slice to make data consistent and available per each applications specific business needs and SLA. For example a finance application might want to display finance data right after the finance subject areas load (e.g., 10:00am) while the capital application would not want to display just refreshed data right at that moment because capital analysis is based on the previous load or the analysis cannot begin until the other related subject areas have completed loading. In that case, a capital analyst will use data based on a separate set of views with previous refresh timestamp filter specified in the view definition. The finance analysis will see finance data up to the latest refresh via a different set of views. This way, data as of a point in time can be obtained across multiple tables or multiple subject areas, resolving consistency and synchronization issues. In this case two applications will be provided data freshness based on their individual SLA.

The report users normally want to see data based on the most recent cycle refresh that occurred across all tables in the subject area(s). For that particular time slice they like data demographics to remain as-is for analysis purposes. So, they may be provided with business views for each table that will show any data up to a point in time as needed. The reports will not see any new data that is being loaded as part of current cycle refresh until the cycle is finished. The report users will run queries in a business view with below timestamp filters in Figure 6 and 7:

Figure 6: Filters to pull rows up to a particular cycle refresh.

Figure 6: Filters to pull rows up to a particular cycle refresh.

Filters to pull rows based on a particular time slice.

Figure 7: Filters to pull rows based on a particular time slice.


Query Performance

In acquiring data warehouse database systems customers take cost control and performance optimization as the critical evaluation criteria (Feinberg and Beyer 2010). Performance is referred to a product's primary operating characteristics (Gavin 1987). To improve database query performance, commercial databases have come up with several efficient indexes. The row effective date columns may have partitioned primary index (PPI) defined on them. That will make queries faster as the partition primary index pulls rows based on partition number instead of a full table scan. When a query is run with filters on PPI columns the DBMS will directly pull data based on particular bucket(s) instead of scanning the whole table.

Based on a SQL score-card on both PPI and non-PPI tables it was found that the SQL uses only 33% of the resources to pull rows from a PPI table in relation to a non-PPI table. The run time is also less in the same proportion. The potential gain derived from partitioning a table is the ability to read a small subset of the table instead of the entire table. Queries which specify a restrictive condition on the partitioning column will avoid full table scans. By defining a PPI on 'row effective date' the report query performance was found to be four times faster and CPU savings about 33%.

Figure 8: Resource Usage: PPI vs. No PPI tables.

Figure 8: Resource Usage: PPI vs. No PPI tables.

Figure 8 shows a comparison of query response time and computational resource savings between PPI and No-PPI queries. The first query was run to pull 53K rows, with no PPI defined. The response time was eight seconds and CPU consumption was 29 seconds in row one. The same query was run against the same table with PPI defined on row effective date. For the second run the response time was one second and resource consumption was two seconds per row two. The first two rows show the resource usage statistics. A second query was run to pull 424K rows, with no PPI defined. The response time was 25 seconds and resource consumption was 101 CPU seconds in row three. The same query was run against the same table with PPI defined on row effective date. This second run response time was four seconds and resource consumption was 33 seconds in row four.

There are many techniques to improve performance (Rahman 2013) of data warehouse queries, ranging from commercial database indexes and query optimization. A number of indexing strategies have been proposed for data warehouses in literature and are heavily used in practice.