Data-Oriented Design

This text uses the Martin (1990) version of Information Engineering to illustrate data-oriented design. The result of data-oriented analysis – entity-relationship diagrams, data flow diagrams, CRUD matrices, and so on – is translated into screen designs, production database designs, action diagrams, procedural structures, and security plans. Compared to other approaches, data-oriented design strongly emphasizes security, recovery, and audit controls, relating each to data and processes in the application.

In this chapter, you will learn about the concepts and terminologies for data-oriented design, analyzing data and defining system controls, and the action diagram. The action diagram shows the processing details for an application in a structured format, which can be translated into programs and modules. You will also learn about menu structure, dialogue flow, and hardware and software installation and testing.

Analyze Data Use and Distribution

Guidelines for Data Use and Distribution Analysis 

The two activities in this section precede physical database design which is assumed to be performed by a DBA. First, data usage analysis is performed to confirm the logical database design. Then the potential for distributing data throughout the organization is analyzed. The result is a strategy for data and software location that best fits user needs.


FIGURE 10-11 Action Diagram with Entities

The entity/process (CRUD) matrix from IE analysis is reanalyzed and mapped to the completed action diagram. Each process is identified on the action diagram with its associated data items and the related entity. Recall that the clustering of entities and processes on the matrix is primarily based on which processes have created responsibility for the data. The entities and processes are arranged into a new entity/process matrix which is compared to the one developed during analysis. If the definition of subject area databases does not change, the distribution analysis can begin. If the definition of subject area databases does change, the logical definition of the databases is redone as discussed in Chapter 9. 

The second step to data analysis is to determine the potential for data distribution. Distribution analysis uses three matrices as the objective basis for determining whether data should be distributed. 


FIGURE 10-12 Action Diagram with Data Detail

First, a location/process matrix is developed to identify major and minor performance of processes in the application (see Figure 10-14). This location/process matrix determines which software is needed at each location to support the functions. The information needed to complete the matrix is provided by the users.

Next, a data distribution by location matrix is developed to show creation and retrieval needs by location (see Figure 10-15). This data location matrix is used to determine the potential age of data required by each location. For instance, retrieval data might be down-loaded from a centralized location each day at the close of business, rather than maintained at the remote sites. Created data must be available for creation, and therefore, up-to-date at the creating sites. The information needed to complete the matrix is provided partly from the entity/ process matrix from the first data analysis, and partly by the users.


FIGURE 10-13 Procedure Template for Error Message Processing

The next matrix shows data usage by location (see Figure 10-16). Recall from above that data can be centralized, vertically or horizontally partitioned, or federated. For instance, a bank branch might create data about customers, but it only accesses information about its own customers on a regular basis. So, for most processing, a vertical partition of the customer database, the branch's customers, could be accessible locally in the branch to speed processing. 


Function  Location A   Location B   Location C   Location D   Location E  
Purchasing 
Marketing
Customer Service
Sales
Product Development
Research & Dev
Manufacturing
\
X

X
X


X
X
X
X
X


\
\
\
\
X




\
X
\





\
X
Legend:
X – Major Involvement
\ – Minor Involvement

FIGURE 10-14 Process by Location Matrix Example


Subject Data Location A Location B  Location C  Location D  Location E 
Prospects  AII-UR  AII-UR 
Customer  AII-UR  AII-UR 
Customer Orders  AII-UR  Subset-Own
Products-UR 
AII-R  AII-R 
Customer Order History AII-R  AII-R  AII-R  AII-R 
Manufacturing Plans Subset–
own products–R
Subset–
own products–R
Subset–
own site–UR
AII-UR 
Manufacturing Goods Process Subset–
own products–R
Subset–
own products–R
Subset–
own site–UR
AII-UR 
Manufacturing Inventory Subset–
own products–R
Subset–
own products–R
AII-R  Subset–
own site–UR
AII-UR 
U = Update, R = Retrieve 


FIGURE 10-15 Data Usage by Location Matrix Example

The last objective matrix summarizes transaction volume by process by location (from the process/location table) against each subject database from the data analysis. Two daily transaction volume estimates for each process and location are developed (see Figure 10-17). The first estimate is for transactions that create or update the database. The second estimate is for read-only retrieval processing. Also notice that if no database access is performed by a process, no entry is made. This increases the readability of each matrix.

The analysis of this data is to first identify the location with the highest total transaction count for each database. The example shows a thick box around each such location (see Figure 10-18). If the application were distributed, with centralization of subject databases in one location, the boxes would identify the most likely location for each database. All other transactions, outside the boxes, represent transmission traffic. When the transmission traffic is a high percentage of the total traffic, say over 40%, different types of replication, federation, and partitioning are tried. To analyze the data, first box the transaction numbers for the site(s) representing 50% or more of the total processing. If there is one site boxed in a column, that identifies a centralized database at the location corresponding to the box. We have two of these in the example (Figure 10-18) – the Work in Process and Inventory databases at location E. The initial recommendation would be to centralize this data at E. Even though D's volume is significantly less than E's, the data usage table shows that each site accesses only its own data, so the option to vertically partition data and provide 'home ownership' could be used to support the business needs.

The other databases all have access competition from two sites (Figure 10-18). Two locations, A and B, have fairly even usage of the Prospect and Customer, Customer Order, and Customer History data. The options from the Data Usage table show that Replication would be the distributed recommendation since the sites both access all data. Customer History processing differs from the other databases in that it is all read-only and it has a much lower volume than the others. Therefore, it could be centralized at either site with an access delay at the other site for retrievals. This option might be chosen if there are hardware configuration differences that favor centralization.

Locations Band E compete for the Manufacturing Plan data (Figure 10-18). Location B only retrieves the data, while the location E volume of updates is low. The database could either be centralized at B to provide fast query access, with delayed access by E, or, if politics are involved, the data could be centralized at site E, the owner, with delayed retrieval by B.

Subject Data Location A Location B 
Location C                   
Location D  Location E 
Prospects  Replicate-Central Copy Replicate 
Customer  Replicate-Central Copy  Replicate 
Customer Orders  Central Copy  Horizontal Partition by Product  Access central copy with delay  Access central copy with delay 
Customer Order History Replicate Central Copy  Replicate or access central copy with delay    Access central copy with delay 
Manufacturing Plans Replicate or access central copies with delay Replicate or access central copies with delay Subset–
own site
Central Copy or Subset – own site with delayed access to D
Manufacturing Goods Process Access D and E Databases Access D and E Databases Subset–
own site
 Subset – own site with delayed access to D
Manufacturing Inventory Access D and E Databases Access D and E Databases
AII-R  Subset–
own site
Subset–
own site with delayed access to D 
 


FIGURE 10-16 Data Distribution by Location Matrix

The second part of the analysis is to compute the ratio of data retrieval transactions (DR) to data update transactions (Du). If the ratio is greater than one less than the number of locations (L) (or nodes in the network), distribution should be considered (see Table 10-2). In the example, the ratio clearly favors centralization of data (Table 10-2). Keep in mind that centralization here means that each database is stored at one location. It does not mean that the databases are all at the same location.

If a delay can be introduced for retrieval processing, then the ratio changes. It becomes much easier to argue for distribution. Distribution should be considered when retrieval volume is less than the ratio of locations to the delay (D). The delay is for update transactions which are now transmitted in bulk once per period to each other location. In the example, with even a 15-minute delay, the numbers overwhelmingly favor distribution. The rationale for these ratios is given in Table 10-3.

This discussion about distribution is important because it highlights an ethical problem in software engineering. The numbers can be made to argue for distribution regardless of transaction activity. If the transaction ratio of retrievals to updates is large, then the no-delay argument is more likely to favor distribution. If the retrieval to update ratio is less than one, the delay argument is likely to favor centralization.

As an ethical person, you are bound to tell the client about all computations and how the formulae can make either argument. 

Last, a subjective list of reasons for and against centralization and distribution is developed for the organization. The exact topic headings for this list are tailored to the company and application environment.

Critical data should be managed centrally 

Data is/is not critical to corporation/business unit. 

Most data can/cannot be stored locally/ centrally

Needs/does not need specific DBMS 

Requires/does not require larger machine than local sites have 

Data ownership is/is not an issue 

Data replication needed in one/many locations 

Unique data/application in one location 

Data affects/does not affect central corporate management 

Fast response time important/not important 

High availability important/not important 

Local staff skilled/unskilled with computers 

Application/data security is/is not vital to organization/business unit 

Centralized operations is/is not at capacity

Down-loading of yesterday's data would/would not work in local sites 

Updates with delay would/would not work in this application environment 

Partitioning of data would/would not work in supporting this application 

Replication of data would/would not work in supporting this application 

Data integrity is/is not paramount to the application 

Disaster recovery protection is/is not vital to the application 

Operators are/are not at remote sites

Subject Database 
Location/Function  Prospect  Customer  Customer Order  Customer History Mftg. Plan  Mftg. WIP  Mftg. Inven. 
Customer Service    100 R 
20 U
250 R
400 U 
5 R  2 R 2 R 2 R
Sales  50 R
20 U 
50 R
30 U 
150 R
50 U 
50 R  2 R 2 R 15 R
Marketing  15 R  5 R  10 R 50 R 2 R 1 R
B
Customer Service  250 R
50 U
250 R
400 U
50 R 250 R  250 R  250 R 
Sales  25 R
20 U 
25 R
5U 
10 R
100 U 
70 R  2R 2R 15 R
Marketing  20 R  10 R  10 R  50 R  2 R 5 R
Manufacturing  50 R
5 U 
50 R
250 U 
500 R
2,000 U 
Manufacturing  100 R
15 U 
200 R
2,500 U 
500 R
25,000 U 
Legend: U = Create, Update or Delete; R = Retrieve 

FIGURE 10-17 Summary Transaction Volume Matrix



Subject Database 
Location/Function  Prospect  Customer  Customer Order  Customer History Mftg. Plan  Mftg. WIP  Mftg. Inven. 
Customer Service    100 R 
20 U
250 R
400 U 
5 R  2 R 2 R 2 R
50 R
20 U 
50 R
30 U 
150 R
50 U 
50 R  2 R 2 R 15 R
Marketing  15 R  5 R  10 R 50 R 2 R 1 R
B
Customer Service  250 R
50 U
250 R
400 U
50 R 250 R  250 R  250 R 
Sales  25 R
20 U 
25 R
5U 
10 R
100 U 
70 R  2R 2R 15 R
Marketing  20 R  10 R  10 R  50 R  2 R 5 R
Manufacturing  50 R
5 U 
50 R
250 U 
500 R
2,000 U 
Manufacturing  100 R
15 U 
200 R
2,500 U 
500 R
25,000 U 
Legend: U = Create, Update or Delete;
R = Retrieve 


FIGURE 10-18 Analysis of Summary Transaction Volume Matrix


Each reason is rated as weak or strong justification of its position. The purpose of list creation is to surface and attempt to objectify objections and arguments from each stakeholder viewpoint regarding distribution of data in the application. An easy analysis is to count the capital and small letters of each type, and compare them. A more elaborate analysis might entail giving a weight to each item and developing a weighted ranking of the central/distributed positions. If the results of this analysis support the objective measures and results, a compelling justification for the result can be developed and presented to user management for approval. If the subjective analysis contradicts the objective measures, the user manager/champion might have to do some political maneuvering to obtain the desired result. Of course, if the champion is against the recommendation, the numbers in the traffic table still are useful in determining the size and speed of the machine and telecommunications lines required to service the application's data needs.

TABLE 10-2 Distribution Ratio Formulae

The breakeven point for distribution occurs when

DR/Du > N -1.

If the transaction ratio is greater than N - 1, distribute data.

An alternative is to allow a time delay for update transactions with all data replicated at all locations in a network. Then only updates generate network traffic. The breakeven point for distribution occurs with this scenario when

Du < N/TimeDelay or Du * TimeDelay < N

If the number of changes is less than the number of nodes divided by the time delay, distribution is favored.

Legend:

DR = Number of data retrieval transactions

Du = Number of data update transactions

N = Number of network nodes

D = Total number of data transactions (DR + Du)


ABC Video Example Data Use Distribution and Analysis 

ABC's one location simplifies the choices for this analysis. Centralization of data and processes is the only possible choice. For the record, a table of transaction volumes is presented in Figure 10-19. A secondary issue, if not already decided, is hardware selection. ABC could use a multiuser minicomputer or a LAN. This analysis, too, is simple because ABC is a small company without a high volume of processing. A LAN is cheaper, more easily maintained, more easily staffed, and less costly {or incremental upgrades. Therefore, a LAN is the choice. Most multiuser mini-computers allow eight units without major expenditures for an additional I/O controller board. Mini-computers tend to have proprietary operating systems and use packages that tie the user to a given vendor. The strength of multiuser minis is in their added horsepower that allows them to support applications with a high volume of transactions (in the millions per day). A multiuser mini is not recommended here because, for the money, it would be analogous to buying a new Porsche 911 Targa when a used Hyundai would do just fine. To discuss configuration of the LAN, we move to the next section on hardware and software installation.

TABLE 10-3 Rationale for Distribution Ratios

If T is the number of traffic units per hour (i.e., transactions), and if all data is centralized at one location (not necessarily the same), then the total traffic units per hours is

If T is the number of traffic units per hour (i.e., transactions), and if all data is centralized at one location (not necessarily the same), then the total traffic units per hours is

Tcentralized = (DR + Du) * (N - l)/N

Then, if all data is decentralized (i.e., fully replicated at all user locations), only update transactions generate network traffic, and

T distributed = Du * (N - 1)

Fully replicated, decentralized data generates less traffic than centralization if

T centralized > T distributed, or

(DR + Du) * (N -1)/N > DU * (N -1)

This reduces to DR I Du > N -1. This formula means that when the ratio of retrievals to changes (DR I Du = N - 1) is greater than N - 1, favor distribution. When the ratio is equal to N - 1, either choice is acceptable from a network point of view. When the ratio is less than N - 1, favor centralization.

If changes can be applied with a delay, the equations change. Then the breakeven point occurs when

DR < N/TimeDelay

The greater the delay, the more desirable a distributed strategy can be made to appear.

Legend: 

DR = Number of data retrieval transactions 

Du = Number of data update transactions 

N = Number of network nodes 

D = Total number of data transactions (DR + Du)


Subject Database 
Location/Function  Customer  Video  Item  Customer History Video History EOD  Archive 
Dunwoody Village Rent/Return  500 R
15 U
500 R
5 U
250 R
400U
500 R
500 U
500 R
500 U
Video Maintenance
20 R
5 U
150 R
50 U


Customer Maintenance 5 R
5 U


Other  15,000 U/ 
Once/Mo
1,000 U 15,000 U/ 
Once/Mo

FIGURE 10-19 ABC Transaction Volume Matrix