BUS611 Study Guide

Site: Saylor Academy
Course: BUS611: Data Management
Book: BUS611 Study Guide
Printed by: Guest user
Date: Monday, May 20, 2024, 4:35 AM

Navigating this Study Guide

Study Guide Structure

In this study guide, the sections in each unit (1a., 1b., etc.) are the learning outcomes of that unit. 

Beneath each learning outcome are:

  • questions for you to answer independently;
  • a brief summary of the learning outcome topic; and
  • and resources related to the learning outcome. 

At the end of each unit, there is also a list of suggested vocabulary words.

 

How to Use this Study Guide

  1. Review the entire course by reading the learning outcome summaries and suggested resources.
  2. Test your understanding of the course information by answering questions related to each unit learning outcome and defining and memorizing the vocabulary words at the end of each unit.

By clicking on the gear button on the top right of the screen, you can print the study guide. Then you can make notes, highlight, and underline as you work.

Through reviewing and completing the study guide, you should gain a deeper understanding of each learning outcome in the course and be better prepared for the final exam!

Unit 1: Introduction to Data Management

1a. Identify the value and relative importance of data management to the success of a project

  • What is data, and why is the management of data so critical for organizations?
  • How is data processed into information?
Data consist of facts, figures, and numbers in their raw format. Raw data is usually devoid of any inherent meaning. For example, an order from an online ordering system is a piece of raw data. Data can take the form of quantitative data or qualitative data. Quantitative data will usually have a numeric format. It can come from a count, measurement of a physical quantity, or some calculation. Qualitative data is often descriptive. "Green", as the description of a product's color, is qualitative data. Note numeric data may also be qualitative: if someone's favorite number is 4, that data would be considered qualitative because of its descriptive nature. It is not the result of any measurement.
 
Big data is data that comes in great quantity, is being collected very rapidly, or is so complex that it becomes difficult to process. It may even be impossible to process. Algorithms, AI, or other computer methods often analyze big data to reveal patterns, relationships, and history. Big data is very common in studying human behavior. It typically contains both quantitative and qualitative data.
 
In many ways, data represents the raw material of the information age economy. Organizations collect and store vast amounts of data as they operate their business processes. In most organizations, operational processes are automated. An operational process is a key activity or cluster of activities that must be performed to allow an organization to operate, achieve its mission, and remain competitive. Examples might be a marketing process, a customer ordering process, or an accounts payable process. Data from many external sources is also collected and stored in what we will call data warehouses.
 
This data serves two purposes. It helps organizations run their operational processes more efficiently, and it helps organizations gain insights. These insights are called information and result from processing data into a meaningful and useful context. Insights gained from the processing of data can be used to support decision-making. We call decisions made using such insights data-driven decision-making.
 
Data-driven decision-making involves collecting data, extracting patterns and facts from that data, and using those facts to guide decisions. Utilizing data in decision-making is superior to using a person's, group's, or organization's intuition. It can result in better decisions, generating new business insights, and identifying new business opportunities.
 
Information is data that has been processed or manipulated to provide a meaningful context. For example, sales data can be analyzed to reveal trends and to make forecasts. It can also be segregated, consolidated, aggregated, and so forth. This processing can reveal insights that managers can use to make decisions. All Data processing systems are the types of computer applications that perform this.
 
Knowledge is information that has been further processed and aggregated. Knowledge typically involves the sorting, classification, and analysis of information. If we further process this information, we can extract and organize knowledge, and this knowledge can lead human decision-makers to wisdom.
 
Wisdom is the final and highest level of organization. It is difficult to define, and we typically only associate it with a small number of humans.



This figure represents these relationships.
 
To review, see Data and Databases.


1b. Explain the need for managing/sharing data and identify relevant public policies

  • What is data management?
  • What are some of the management roles in managing data?
  • How is data obtained?
The vast majority of today's data was created in just the past few years. The challenge is to extract value from it and put it to work for organizations and individuals. The vast amount of personal data produced by citizens can be of value to the public and private sectors.
 
Data management
describes the process of collecting, storing, and analyzing data. Organizations use data management to process business transactions, measure day-to-day operations, and for future decision-making. As a result, decision-makers can rely on data to make choices and take actions that benefit the organization.
 
Data management plans (DMP) are written documents describing the data an organization expects to acquire during research or another project. The DMP provides the framework for managing, analyzing, and storing the data used in the project. It also includes the organization's mechanisms to share and store this data. Because the process of doing research may require adjustments, DMP is a living document. You may alter the plan as needed throughout the research changes. Remember, anytime a research plan changes, you must review the DMP to ensure it still meets the needs of the research.
 
Data is obtained through either the ongoing operation of the business processes or through deliberate efforts to gather it. Data may come from either within the organization or external to the organization. Most organizations are excellent at collecting data. However, there is a need for skilled professionals who can manage, analyze, and reveal insights from big data. Organizational leaders seek personnel who can provide reliable and trustworthy data insights through data management. Because of technological advancements, organizations can collect and store more data faster than ever. Leaders are ready to move past collecting raw data. It is time to leverage this data to improve best practices, profits, and efficiency.
 
To review, see Data Management Plans.


1c. Use the lifecycle continuum to manage and preserve data

  • What is the data lifecycle management process?
  • How do we use the systems development life cycle to develop data management systems?
Data lifecycle management (DLM) is a part of the data management plan (DMP). Therefore, it is essential to maintain DLM standards since data is considered a valuable resource to organizations. A standard is a repeatable, harmonized, agreed upon, and documented way of doing something. Standard practices ensure that reasonable uniformity is present in complex systems. Many standard practices in database management systems guide how work is performed.
 
The System Development Life Cycle (SDLC) is a process for developing computerized systems. It defines a standard set of actions, policies, and procedures developers would follow to develop or modify systems. The SDLC will typically consist of the following steps − systems analysis/planning; systems design; building of the system; implementation; and testing.
 
Even with a process and plan in place, an organization's ability to govern data will ensure value and integrity within stored data. Data governance is a set of regulations, policies, processes, and human responsibilities that govern how an organization will use data to achieve the organizational mission and strategy and comply with laws and regulations. Data governance defines who can take what action, upon what data, in what situations, and using what methods. Data governance frameworks and maturity models have been developed to aid the organization in ensuring that its governance policies and processes serve the organization's needs in the most effective way. A policy is a deliberate system of guidelines to guide decisions and achieve rational outcomes. A policy is a statement of intent and is implemented as a procedure or protocol. In data management, we must be aware of policies that define and guide the operational processes, data management, and governance, and those derived from external laws and regulations.
 
To review, see Data Management Planning.


1d. Explain what research data is and how it is collected and stored

  • What is research data, and how does it differ from operational data?
Research data is collected and stored daily in various forms. Because data is a valuable resource, it requires proper management and sharing between organizations. Remember, data management ensures reliable information and protects the integrity of data within your organization. Be sure you have a detailed plan on how to manage research data.
 
To review, see Basics of Research Data Management.
 

Unit 1 Vocabulary

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.
 
  • big data
  • data
  • data-driven decision-making
  • data governance
  • data lifecycle management
  • data management
  • data management plan
  • data processing system
  • data warehouse
  • information
  • knowledge
  • operational process
  • policy
  • qualitative data
  • quantitative data
  • research data
  • standard
  • System Development Life Cycle (SDLC)
  • wisdom

Unit 2: Understanding Databases and DBMSes

2a. Explain the fundamental properties of a database and a database management system 

  • What is meant by the term database?
  • How does a database management system allow for the creation and maintenance of databases?
  • What kinds of software are available to create and manage databases?

A database is a collection of data that is formatted in a formal manner to achieve a particular purpose for an organization. The database is both a historical repository that can support decision-making and a collection of data that can support ongoing operational processes. Multiple users access the database to support multiple roles.
 
A database management system (DBMS) is an integrated set of computer processes and algorithms that allow for a database's creation, maintenance, and operation. The DBMS will also control security and access to the underlying data. The primary objective of a DBMS is to provide an integrated and systematic computer environment. The DBMS should be easy to use, have a high degree of structural and data integrity, efficiently use computing resources, and facilitate users' needs as simply as possible.
 
DBMS Software is a set of designated programs that manages and regulates the database. DBMS Hardware consists of the physical components (computer, hard drive/disk) needed before any data can be successfully stored. DBMS Data is the primary reason for the DBMS design. Remember, the DBMS was created to store and analyze data to support organizational operations and planning. DBMS Procedures are the organization's general instructions on using a DBMS to support a specific organizational process. This includes setup, installation, login, logout, management process, and report generation. Database Access Language is a simple language to write commands to either access, insert, update, or delete data stored in a database. Structured Query Language (SQL) is the most commonly used access language.
 
Several characteristics distinguish the database approach from the file-based system or approach. Among these are:

  1. Self-describing: the database contains not only the database itself but also metadata that defines and describes the data and relationships between tables in the database
  2. Views: a subset of the database, which is defined and dedicated for particular users of the system
  3. Multi-user: allowing many users to access the same database at the same time
  4. Integrity constraints: restrictions or rules that dictate what can be entered or edited in a table
  5. Data independence: The system data descriptions or metadata are separated from the application programs.

To review, see Database Systems Concepts.

 

2b. Explain the objectives of data and information management and outline the database development process and software tools used to support development

  • How would you distinguish a file-based system from a database?
  • Why is data independence important, and what are the key features of this independence?

A fundamental goal of the management of information is that it be cared for in a way that allows the organization to achieve its mission and operate its systems. The use of a DBMS facilitates this management function. Because the DBMS is integrated, efficient, and robust, data becomes available to any organizational stakeholder needing access. This, in turn, supports achieving the organization's strategic objectives.
 
A database in a database management system represents a centralized repository for all of the data that is to be stored by an organization. In the past, the data associated with a particular IT application was stored with that application. We call that program-data dependence. With a database, we can eliminate that dependence. Each application can be developed independently and use the standardized structure of the database management system. Similarly, users can interact directly with the database management system. This allows for changes to be made to the operational systems without needing to change the database each time a change is made to the operational system.
 
As mentioned previously, several characteristics distinguish the database approach from the file-based system or approach. Perhaps the most significant of these is the elimination of program file independence. Users and applications both interact with the database. Users do not interact with application programs directly. This figure illustrates this idea graphically.


Notice the way that the database is represented with the large canister. This is a standard symbolic representation of a database. Notice how the various applications can represent only portions of the database. Each application will have its associated data stored within the central database. Notice also how the data associated with applications might overlap. This is because some data elements might be used in several different business processes and application programs. An example of this might be data relating to a customer. Such data might be used for product ordering, billing, and other applications.
 
We call this feature of a database management system data independence. One way to think about this is that the data and the way the data is defined are independent of the programs developed to support particular processes. The DBMS makes this possible by storing the data in an integrated database engine and requiring all access to the data from application programs to be routed through this engine.
 
To review, see Introduction to Database Management System (DBMS).

 

2c. Outline the database development process and software tools used to support development

  • What are the critical elements of a database design process?

Database design is a process to facilitate the design, development, implementation, and maintenance of an enterprise DBMS. Enterprise systems are a combination of hardware and software for operational use. Organizational operations typically consist of functions and tasks necessary to implement a business process. Therefore, an enterprise DBMS manages reporting and manages more than one function within an organization.
 
The database design and development process will be a multidisciplinary and multi-user process. The ultimate goal is to create a database design that conforms to good design practices and meets the users' needs. The business process owners contribute a deep understanding of the needs of their process and the data elements required to support those needs. Staff from the larger organizational information systems department contribute a perspective on the organization's hardware, software, and network standards. Finally, the database administrator provides an understanding of the type of modeling processes available and how those modeling processes and models will lead to a good design.
 
Most database design efforts follow some standardized methodology. This methodology is often either some variant of the Systems Development Life Cycle (SDLC) or one of the more rapid iterative methodologies like Rapid Application Development (RAD).
 
The development process will also be guided by the standards developed by that organization in data governance. Different data governance policies will influence the design of database systems. For example, if the data governance standard calls for the maintenance of data for some specified period, then the development process relating to the management of data archives and backups must be adjusted to be in conformance with this governance standard.
 
To review, see Combining Data Management with Organizational Change.

 

2d. Describe DBMS components and administration 

  • What are the main components of a database management system (DBMS)?
  • What are the properties of a DBMS?

In addition to the physical components, such as hardware, a database will have several primary components. These would include the data entry system, the storage system, the security system, and the backup and recovery system. Each of these systems will have unique administrative requirements and require specialized administrators, policies, and procedures. For example, the archiving, backup, and recovery system will require administrative policies describing how frequently backups are to be conducted, where backups are to be stored, what level of security is required on backed-up media, and so forth. There would also likely be an administrator for the backup process with specialized knowledge, skills, and abilities.
 
The various people and roles that have access to the database will have specific privileges. These privileges define what data a particular person will be granted access to. For example, in an HR system, some users might have read-only access, some might have the ability to modify certain data elements, and so on. There may also be multiple levels of approval before a data element, such as an employee's salary, is changed. These privileges will depend on the user's role within the operational or database process. This assignment of specific privileges is also why the DBMS needs to provide robust and granular security capabilities.
 
Before the advent of relational databases, most transaction processing systems were characterized by application-specific data structures. Applications were not integrated, and thus there was no way to share data between applications. With the advent of database processing systems, it was possible to integrate and centralize the data relating to many operational processes in a single location. The management and administration of this centralized database system is the database administrator's responsibility.
 
To review, see Combining Data Management with Organizational Change.
 

2e. Design the logical and physical structure of a relational database for efficient data storage

  • What is the purpose of a data model?
  • What is a data dictionary, and how is it used?

There are many different ways that databases could be organized. These different ways of data organization are referred to as data models. A data model is a logical structure of data items and the relationships between them. Numerous models have been developed over time, but the predominant one since about the 1980s is the relational model. Relational databases are now found in all types of organizations and are the most commonly used organizational scheme for databases.
 
In a relational database, the form of data organization consists of tables (or relations). A table will consist of fields. The format of the fields defines the structure of the data stored in the table. One instance or occurrence of a group of fields is called a record.
 
The following figure illustrates this graphically. The tables (relations) are shown along with the key field (attribute) and the non-key attributes. For example, The STUDENT table (STUDENT relation) indicates that StudentID is the key field. Similarly, the COURSE table is shown with CourseID as the key field. The relationships between the tables are represented by the inclusion of the key field from one (or more) relations occurring as a foreign key in another relation. Notice how the ClassroomID field (attribute), which is the key field in the CLASSROOM relation, is included in the COURSE relation as a non-key field. We call this type of structure a foreign key that establishes the cardinality of the relationship between the two tables.


When we define a collection of data in a database model, we also talk about how the data is organized. One convenient structure to create is a data dictionary containing all the data elements' definitions and format in one convenient location. The reason each element is collected, how it figures into the structure of the database, and how applications use it provides designers with guidance as they create and modify the database over time.
 
When designing the physical structure of the disk for efficient storage and read and write access times, we have to consider the organization of the disk and how the read and write heads of the disk drive travel over the surface of the disk. Partitioning allows for the different sections of the database to be stored in a unique area on a disk drive. This partitioning process improves disk performance and supports faster read and write times for the database.
 
To review, see Combining Data Management with Organizational Change.
 

Unit 2 Vocabulary

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • data dictionary
  • data independence
  • data model
  • database
  • database access language
  • database management system (DBMS)
  • enterprise system
  • field
  • key field
  • metadata
  • multi-user
  • partitioning
  • privileges
  • record
  • relation
  • relational database
  • self-describing
  • Structured Query Language (SQL)
  • table
  • views

Unit 3: Data Models

3a. Describe the various data models, including logical models like the ERD model and physical models like the relational model

  • What are data models, and how are they used?
  • What are the different types of data models, and how do they differ?
Data models provide a basis for designing a computerized database system and are also beneficial for communicating with non-technical business users. Business process owners work with data modelers to ensure that the developed database system will meet the organization's needs. The data model provides a common point of reference in the development process. As a data professional, you must create and structure database tables to support business operations and processes.
 
Conceptual, logical, and physical data models are methods used to model data in a domain. While each of these is considered a data model, they are created for different purposes.
 
A conceptual model focuses on identifying data used within a business. It supports business development and events and tracks performance measures. However, it does not focus on process flow or other physical characteristics.
 
Logical data models are a way to translate business concepts for data purpose/use into Entity-Relationship diagrams. Logical data models describe data in as much detail as possible. This model is more complex than conceptual models. Logical data models establish column (data) types to include all entities and relationships among them. Remember, logical data models have nothing to do with creating a database. We need to identify the data elements we want to translate and how they are related through key fields.
 
Physical data models define all required logical database components and services to build a database. Physical data models can also be a layout of an existing database. Typically, physical data models contain the table structure, column names, primary keys, and other relationships among tables. The most commonly used physical model is the Relational Model. The relational model was initially developed by Edgar F. Codd in 1969 and has become the international standard. Today most Database Management Systems are Relational Database Management Systems (RDBMS). Gaining an understanding of the relational model and relational databases is thus an essential skill for anyone seeking to work in the data management field.
 
This figure illustrates the relationship between the user's view of the data and the different types of models that are used to design databases.

relationship between the user's view of the data and the different types of models that are used to design databases

The ultimate product of a logical and physical data modeling process is the design of the structure of the database tables, also called relations. These tables must be appropriately designed to show things like cardinality, the key structure that preserves data integrity, and the relationships between the tables. These relational designs are created using relational notation and are then implemented in the DBMS using languages like SQL.
 
To review, see Types of Data Models.

 

3b. Distinguish between the disadvantages and advantages of different data models

  • What are the advantages and disadvantages of the E-R model?
  • What are the advantages and disadvantages of the relational model?
  • How can E-R and relational models be used to develop a database?
Conceptual data models allow us to represent the business process through a typically graphical representation. The most common type of model is the entity-relationship (E-R) model. The E-R model starts by allowing us to identify physical entities in our business process and identify the attributes of these entities and the relationships among them. An entity is anything we find in the real world. It can be physical, such as a customer, or conceptual, such as a project. The entity is composed of attributes representing properties like a customer's name, address, and age. A relationship is an association among entities; for example, a customer might buy many different products. A relationship then exists between the customer and the product.
 
A primary advantage of the E-R model is that it makes intuitive sense to non-technical business process owners. Other advantages include its use of non-technical, non-mathematical terminology. The main disadvantage of the E-R model is that such models cannot be directly converted to relational database implementations. A relational model must first be derived from the E-R model, and then that relational model is used to implement the physical structure of the database itself.
 
Older logical models might represent how the data is stored in the computer disk drives. The two most common of these older models are network data models and hierarchical data models. The two older models, hierarchical and network models, were developed before the widespread availability of commercial relational database management software and allowed for the development of databases that could run on older, non-relational software. Over the last several decades, the relational model and relational database software have come to predominate the marketplace. As a result, almost all new development will be done using the relational model. However, the older two models are still encountered in practice, especially on projects that are updating older systems.
 
The primary advantage of working directly in the relational model is that a relational model can be easily converted into a relational database. There are many automated design tools available to do this. The main disadvantage of the relational model is that it requires a certain level of understanding of relational design and uses somewhat arcane terminology that can make it difficult for business process owners to grasp intuitively.
 
To review, see Data Modeling and Metadata Management.
 

3c. Describe why each data model, ERD and Relational, are used, and the benefits and drawbacks of each

  • What are ERD and Relational modeling?
  • How do ERD models flow from logical models and business rules?
Logical data models are a way to translate business concepts for data purposes/uses into Entity-Relationship diagrams. Logical data models describe data in as much detail as possible. This model is more complex than conceptual models. Logical data models establish column (data) types to include all entities and relationships among them. Remember, logical data models have nothing to do with creating a database. We want to translate the data elements and identify how they are related through key fields.
 
Businesses define how entities or tables are related in the business process. A typical relationship characteristic is called cardinality. Cardinality expresses the minimum and the maximum number of entity occurrences associated with one occurrence of another entity.
 
In the following figure, from a database used by an educational institution to relate teachers to classes, the cardinality is indicated with the "crow's feet", where the business rule states that one teacher can teach many classes. Still, one class can only have one teacher, as indicated by (1:N). Notice also that the O symbol and the double vertical line symbols are required versus optional. In other words, each class must have a teacher, but each teacher does not have to teach any particular number of classes.



The entity-relationship (ER) data model has existed for over 35 years. The model is very useful for data modeling for use with databases. The model is abstract enough to represent complex situations, yet it is fairly intuitive and easy to convey to the business user. In addition, analysts can easily translate ER models into relational models. We call the graphical representation of an ER model an ER diagram.
 
ER modeling depends on two fundamental concepts:
 
  1. The entities themselves, with data defined as tables that hold specific information
  2. The relationships between the entities that define their relationships and interactions.

To review, see Physical Data Models.

 

3d. Apply E-R modeling to several practical examples

  • How is E-R modeling applied in practice?
  • What is the role of the database administrator in E-R modeling?
The database administration process and the database administrator (DBA) are responsible for the design and administration of data models and the data integrity constraints included in those models. Missing data elements are likely caused by poor data integrity controls and would thus result from poor administration. Database administration is the function of managing and maintaining database management systems (DBMS) software. As a part of this, database administrators are responsible for the data modeling and design process and ensure that operational databases are designed to high professional standards.
 
This figure provides an example in a practical case. In this case, we are storing information about students in a database being used by an educational institution. Notice the structure of the table. The fields are identified, the key field is identified, and the non-key fields are identified.



Notice that this type of table structure can be used both to facilitate communication between the business and technical staff on the development team and to support the later construction of the relational model and the design of the physical structure of the relational database.
 
During the process of conceptual modeling, the users and business process owners have identified the entities involved in their business process. Consider an academic institution. Two entities of interest are the teachers and the classes that those teachers are assigned to. During the modeling process, the business rules were determined, and one of those rules is that teachers may teach many different classes but that each class is only taught by a single teacher. Notice that this would be reflected in the relational model figure using "crow's feet" that identify the cardinality 1:N. One teacher can teach a number (N) of courses. Since a given teacher may not be teaching any courses at a point in time, this is indicated by the 0 on the relationship. However, each class has a teacher. This is indicated by the two vertical hash marks on the relationship.
 
Notice also that the key attributes have been identified for each entity and are underlined. Non-key attributes are not underlined.
 
By identifying all of the entities and the relationships between those entities, we have constructed a model of the business rules. This E-R model can then be used to implement the relational database to support the business process.
 
To review, see Lessons in Data Modeling.

 

3e. Analyze a database system for security flaws

  • What are some of the ways that we can analyze a database to identify security flaws?
Like any system, the security of a database processing system is both necessary and critical. Because organizations store large amounts of data, much of it proprietary, confidential, or subject to legal requirements for security, it is essential that database security is "designed in" and that any flaws be identified and corrected.
 
There are several dimensions to database security. Because most databases are used in a distributed, multiuser environment, access and privilege management is critical. Users and administrative staff should only be given access to those data elements required to perform their job or execute the allowed function. Most modern database management systems allow for much granularity and specificity concerning access; a common flaw is the overbroad granting of access. Great care should be taken to rigorously examine users' actual needs and grant them only the minimum required privileges.
 
In addition to the determination and granting of privileges to users authorized to access the system, we are also concerned with others who have no legitimate need to access the system gaining entry. We need to rely on the functionality built into the database system and the network operating environment for these situations. A somewhat trivial though prevalent flaw is the management of access passwords. The administrator should ensure that users are required to select strong passwords and that procedural and training processes are in place to avoid users sharing those passwords. Strategies like two-factor authentication, requiring frequent changing of passwords, and prohibiting the sharing of passwords are examples.
 
Other, more technical flaws may also exist in database systems. Because SQL is used extensively with relational database systems, security flaws can be introduced into the system through improperly developed SQL code. For example, SQL injection attacks are possible because of sloppy coding practices.
 
Data lineage includes the data origin, what happens to it, and where it moves over time (essentially the full journey of a piece of data). This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging big data pipelines, but the process is not simple. There are many challenges, such as scalability, fault tolerance, and anomaly detection.
 
To review, see Database Security.

 

3f. Design mitigating systems and procedures to address security flaws

  • What are some of the key strategies for addressing database security flaws?
  • Why is administration as important as technical processes in addressing security flaws?
As security policies are developed to support business operations, good security practice ensures that data is only made accessible to those staff with a documented business need to access the data. Security policies should be designed so that only those users who have a legitimate need to access particular data are given access to that data. This is particularly important in the case of sensitive information like customer data.
 
A flexible privileges policy should allow different people to have different levels of privilege. The privilege level assigned to a given user would depend on the roles and responsibilities of that individual. Careful attention should be paid to this. Good security practice is ensuring that data is only made accessible to those staff with a documented business need to access the data.
 
The administrative team should practice good password management. The administrator should ensure that users are required to select strong passwords and that procedural and training processes are in place to avoid users sharing those passwords. Strategies like two-factor authentication, requiring frequent changing of passwords, and prohibiting the sharing of passwords are examples.
 
Careful attention to developing, testing, and constructing SQL code can mitigate against SQL injection attacks.
 
To review, see Database Security.

 

Unit 3 Vocabulary

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.
 
  • attribute
  • cardinality
  • database administrator (DBA)
  • entity
  • entity-relationship (E-R) model
  • data lineage
  • password management
  • relational model
  • SQL injection

Unit 4: Big Data Processing and Cloud Computing

4a. Describe big data and cloud computing

  • What are some of the issues associated with cloud data storage?
  • What is meant by big data, and how is it related to cloud storage?

In many cases, the data we are storing in our DBMS is far too large to be stored in a single location. There are also significant risks associated with storing all of our strategic data in a single location. For this reason, we make use of cloud data storage systems.
 
A cloud system consists of various types of IT hardware, software, and physical infrastructure that allow cloud service providers to deliver various services. The most common services are 1- SaaS (software as a service), 2- PaaS (platform as a service), and 3 - IaaS (infrastructure as a service). These services would be delivered over a network, typically the public internet. Cloud systems must be highly flexible and allow for various technologies and systems of all vintages and standards. Cloud systems and the vendors and service providers who support them must be able to integrate many different types of technology and systems of different vintages and vendors. New technology and systems are constantly being developed, and cloud systems must allow these new technologies to be integrated into the older technologies already in use.
 
Cloud services vendors must be able to provide non-proprietary network management solutions to allow for the wide range of technologies that must be integrated into the cloud system.
 
This figure is a conceptualization of a cloud storage system.



To review, see Introduction to Big Data.
 

4b. Identify the advantages and disadvantages of cloud computing

  • What are some of the challenges of managing cloud services?

A distributed database represents multiple interconnected databases spread across several sites connected by a network. Since the databases are all connected, they appear as a single database to the users. The distributed sites are either managed by the host organization or contracted to a cloud services provider. One of the major advantages of the distributed model is the ability to run the database on a variety of different servers and equipment.
 
Cloud-based organizations use cloud computing to deliver computing services. This includes servers, databases, networking, software, analytics, and intelligence over the cloud (internet). The cloud offers faster innovation, economic scalability, and flexibility of organizational resources. Organizations rely on cloud computing to perform virtually every function, including analytics. Cloud services providers take great care in siting and designing data centers for optimal performance. A data center is simply a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunications and physical storage systems.
 
The primary reasons for using such an arrangement are efficiency, redundancy, and cost savings. By taking advantage of the scale and scope of a cloud services provider, we can get just what we need at a price that matches our needs. We also gain the advantages of extensive redundancy, reliability, and security that such a service provides.
 
Like anything, cloud services are not without their negative points, though. Disaster recovery is something that any manager of a system must consider. In cloud computing systems, disaster recovery can be more of a challenge than in a single-site system. This is because Cloud Service Providers (CSPs) must provide services to their customers at all times. Because they operate many data centers, this can make disaster recovery more complicated. Disasters can lead to expensive service disruptions. Two disaster recovery models can be used to prevent failure: Traditional and cloud-based service models.
 
Another strategy to address disaster recovery is the use of fault-tolerant architectures. A Fault-tolerant system can continue to perform its function and operate even in the presence of failures in some of its parts. The concept can apply to a single computer system, a cluster of data centers managed by a cloud services provider, a link in a network, or any other component that might fail. For example, multiple instances increase redundancy, resulting in a more fault-tolerant configuration.
 
To review, see Cloud Computing and Business Expectations.
 

4c. Describe the challenges involved with managing big data

  • What are some of the characteristics of big data?
  • What are some of the challenges in managing and processing big data?

Big data typically describes data sets so large or complex that traditional data-processing techniques often prove inadequate.
 
The structure of big data is described by:

  • Volume: amount measured in gigabytes or terabytes
  • Velocity: one-time snapshot frequency streams
  • Variety: structured, numeric, alpha, unstructured, text, sound, image or video, genomics
  • Veracity: validation, noise level, deception, detection, relevance, ranking
  • Value: the usefulness of the data in supporting decisions that add economic value

 
This figure illustrates these characteristics.

This figure illustrates these characteristics.

The administration of big data presents many challenges. These challenges revolve around the sheer size of the database, the networks required to move data, the disk resources required to store the data, and the computer processing power required to query or process the data. For these reasons, distributed networks and solutions are often the only options. Even supercomputers are not large enough to do the job on their own, and the amount of disk space required necessitates that resources be spread geographically among several data centers. This is why organizations that use big data resources almost always engage massively scaled cloud service providers.
 
It is also worth noting that the type of data that is increasing most rapidly is unstructured data. This data type is characterized by "human information" such as high-definition videos, movies, photos, scientific simulations, financial transactions, phone records, genomic datasets, seismic images, geospatial maps, e-mail, tweets, Facebook data, call-center conversations, mobile phone calls, website clicks, documents, sensor data, telemetry, medical records and images, climatology and weather records, log files, and text. This type of data is not only voluminous, but the diversity of different formats makes it very difficult to manage and store in traditional relational database structures.
 
To review, see Big Data Opportunities and Challenges.
 

4d. Analyze the relationships between cloud computing and big data

  • In what ways are the administration of big data and cloud services related?
  • What role does the network play in provisioning cloud services for big data?

Cloud computing and big data are logically interrelated. Big data, typically stored in a data warehouse, may have come from internal and external sources in the organization. Because of the size and diversity of big data, it is necessary to apply more comprehensive processes to manage it. Cloud systems represent a very attractive option for doing this. The management and administration of these cloud database systems is the database administrator's responsibility.
 
One issue that arises in the development of a cloud services relationship to support big data is the issue of provisioning. Provisioning involves the specifications and range of services the user will require from the cloud services provider to meet the organization's needs. Integrating big data administration with the cloud infrastructure must be a joint effort between the provider of those services and the contract user. It should also be noted that the requirements will change over time, and therefore provision needs to be made for the evolution of the services as needs change.
 
Another dimension of provisioning for cloud services involves the network. Local corporate data centers tend to be located where the company is located and thus might not reside in locations with the optimum characteristics for data centers; therefore, the data centers provided by the cloud services provider might be at some distance. Thus, network issues need to be addressed.
 
It is also worth noting that large cloud providers that implement geographically dispersed sites worldwide can typically achieve reliability rates sufficient for large-scale operations. These same levels of network reliability might not be achievable by the organization on its own.
 
To review, see Introduction to Big Data.
 

4e. Apply virtualization techniques to provide resources for cloud computing solutions

  • What is virtualization, and how is it accomplished?
  • What are some of the challenges of virtualization?

Network virtualization combines various hardware and software components that deliver network services into a single entity, managed by software, called a virtual network. Network virtualization could include the platform itself or another network resource. We often achieve this virtualization for large database systems by contracting with cloud service providers.
 
In a virtualized network with local autonomy, the individual local nodes are responsible for the administrative functions related to those nodes. This would include local node security, backup and recovery, and concurrency control.
 
Because virtualized systems are not confined geographically, they will generally result in resources and data being housed in data centers in a number of different countries. This results in an inconsistent set of laws and regulations that govern the storage and administration of the data.
 
To review, see Compliance and Risk in a Cloud Environment.
 

Unit 4 Vocabulary 

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • data center
  • distributed database
  • fault-tolerant
  • value
  • variety
  • velocity
  • veracity
  • virtualization
  • volume

Unit 5: Introduction to SQL

5a. Describe SQL and summarize its basic operators

  • What is SQL, and what are some of its most common uses in managing databases?
  • What are some standard operators of the SQL SELECT command, and how are they used?
A common way users and designers interact with relational databases is to use Structured Query Language, SQL. SQL is designed to be an easy-to-use language with a natural language feel. This makes it well-suited to work with relational databases.
 
SQL is a communication programming language that has been around for decades. It is the standard language for relational database management systems. Organizations cannot have an effective data management program without SQL. SQL is a universal and standard language that interfaces with relational databases. SQL is used to communicate, analyze, and provide results from the data fields stored in a relational database.
 
A query is a question or inquiry to a data set within a database, commonly referred to as a database query. SQL queries require a SELECT statement informing the system to look for specific data. Therefore, SQL query statements retrieve records asked from a database table. The SELECT statement is probably the most commonly used SQL statement employed by end users of database systems and is worthy of your careful attention. The SELECT statement will include operators. An operator manipulates individual data items and returns a result. SQL is an industry-standard language that consists of user-defined and system-defined functions and queries.
 
In a DBMS, common operations like creating tables and their structures can be implemented with SQL. The language can also add, modify, and delete data from the database structure. SQL can also be used to run extractions and queries to facilitate the transformation of raw data into formatted information for use in decision support applications.
 
The American National Standards Institute (ANSI) adopted the official SQL standard in 1986. Then, the International Organization for Standardization (ISO) adopted SQL IN 1987. SQL is a standardized programming language used to perform various operations that manage relational database management systems (RDBMS). For example, if you are asked to identify and calculate from a table of customers whose last name is "Jones'', SQL is a programming language used to communicate this command or query to the database.
 
There are many advantages to learning SQL. SQL is a universal language and intersects almost every industry. SQL is open-source and easier to learn compared to other programming languages. Open-source is code that is made freely available for possible modification and redistribution. You can manage millions of rows of data using SQL. Remember, RDBMS are designed to store millions of rows of data, and SQL is the language that lets you manage enormous amounts of data.
 
This is the syntax of a typical SQL command:
 
CREATE TABLE <tablename>
(
ColumnName, Datatype, Optional Column Constraint,
ColumnName, Datatype, Optional Column Constraint,
Optional table Constraints
);
 
SELECT FirstName, LastName, phone
FROM Employees
ORDER BY LastName

To review, see Data Manipulation Language.
 

5b. Use relational views to simplify database processing

  • What are database views, and how can they be used?
  • What is the difference between a dynamic view and a static view?
A database can also be divided into subsets called views. These views can then be used to run specialized queries that affect the data in the view. This makes for more efficient execution of queries. Views can also be saved for later use. A database view is a searchable object in a database that is defined by a query. Though a view doesn't store data, you can query a view like a table. Views allow queries to be simplified for developers. Views are also useful for security. In larger organizations, where many developers may be working on a project, views allow developers to access only the data they need. What they don't need, even if it is in the same table, is hidden from them, safe from being seen or manipulated.
 
Database views can be either static or dynamic. A dynamic view draws data from one or more tables and would include all the columns in those tables. Dynamic views are updated automatically if related or extended objects are created or changed. A static view draws data from multiple tables using the SELECT command and WHERE clauses. The WHERE clause filters records based on specified criteria. Static views must be manually updated when related or extended objects are created or changed.
 

 

5c. Use data definition language (DDL) commands to create tables

  • How is SQL used to create the database and tables?
  • How is SQL used as a data definition language?
SQL can be used to create the database and table structures, using SQL as a data definition language (DDL). A DDL is simply any tool we use to create the database structure in a database and manipulate data, for example inserting data, manipulating data, querying data, and so on. Most commercial relational database management systems allow for SQL to be used as the data definition language. Common DDL statements include:
 
  • CREATE (generates a new table)
  • ALTER (changes a table)
  • DROP (removes a table from the database)
An example of using SQL as a data definition language would be creating a new database named NDB. The SQL statement would be CREATE DATABASE NDB. After creating this manner, we can use SQL to create the database tables. We can use the ALTER TABLE statement to add or drop any required constraints. The SQL command DROP TABLE can be used as needed to remove tables.
 
To review, see More on SQL.

 

Unit 5 Vocabulary

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.
 
  • ALTER
  • CREATE
  • database query
  • DROP
  • data definition language (DDL)
  • dynamic view
  • open-source
  • operator
  • SELECT
  • static view
  • WHERE

Unit 6: Data on the Internet

6a. Describe the importance of emerging technologies that will carry businesses forward in a constantly evolving environment such as APIs and web crawlers, that allow businesses to operate internet-enabled data management systems.

  • How is data stored and accessed over the internet?
  • How does a web crawler extract information from websites?

The growth of the internet has led to an increase in e-business and e-commerce. An E-business is any organization that conducts business over the internet. E-commerce is transmitting funds or money over an electronic network, primarily the internet. Both e-business and e-commerce may occur as business-to-business (B2B), business-to-consumer (B2C), consumer-to-consumer (C2C) and consumer-to-business (C2B).
 
The internet gives us instant access to millions of IP addresses. An Internet Protocol address (IP address) is a numerical label such as 188.6.7.4 connected to a network that uses the Internet Protocol for communication. It digitally connects us to numerous networks with the click of a key or touch of a screen. Advancements increased internet use for business, and data is easily collected and used for business growth.
 
Websites collect and store vast amounts of data on each consumer. Organizations determine what is relevant and irrelevant to the consumer. Data abstraction is a process that delivers only necessary information while concealing background details. So far, you have learned that database systems (DMBS) are made of complex data structures. To improve user experience and ease user interaction on the internet, developers hide irrelevant internal details from the user. This is the definition of data abstraction. This data is used to conduct marketing and increase growth for B2B and B2C sales.
 
APIs are used to gather information from a wide variety of web sources. We can also use APIs to help e-Business and e-Commerce gather data from various web sources.
 
A web crawler is used to browse web pages for the content of interest and to copy web pages for offline viewing and indexing. This content is often also used to populate a data warehouse.
 
To review, see What is Web Crawler?.


6b. Identify the different constructs of the Internet, such as web-crawling and web-enabled databases, and how they work together

  • What is an API, and how is it used?
  • What is a web crawler, and how is it used?

There are various types of approaches for abstracting web data. Web data extraction is also known as web harvesting, web scraping, and screen scraping. This is commonly done through an Application Programming Interface (API), a set of definitions and protocols for building and integrating application software. APIs let a product or service communicate with other products and services without knowing how those other products have been implemented. This can simplify app development, saving time and money.
 
A web crawler is an automated script or program that browses the internet in a systematic and automated way. Web crawlers are also known as web spiders or web robots. Many internet websites, including search engines, use crawling to provide up-to-date data. Web crawlers copy web pages for processing later through a search engine. This allows users to find web pages quickly when using a search engine. Sometimes, web crawlers are used to extract other forms of information or data from websites.
 
This figure illustrates the basic conceptual design of a web crawler:


A commonly used language to format files for storing data on the internet is called Extensible Markup Language (XML). This language includes rules that allow for the encoding of documents, and the format is designed to be both readable by a human and for processing by a computer system.
 
To review, see Getting Data from the Web.


6c. Assess how clients are used to execute remote applications and access data

  • How are clients used to spread the computing load of large databases?
  • What are JSON and XML, and how are they used?

Clients are portions of programs that can be run remotely on different computers from the main program. Typically, the main program, a distributed relational database, would be run on a server. The client part of the program might be run on a smaller machine like a customer's laptop computer or even a mobile application.
 
Websites are probably the most widely used and primary "big data" source. Organizations across every industry use technology to store, collect, and integrate consumer data from websites in database management systems (DBMS). This makes it easy to store web data in a structured format using rows and columns. JSON and XML are two formats that can be used in this process. JSON is lighter than XML with fewer markup requirements, but XML is generally more forgiving of poor formatting.
 
To review, see Getting Data from the Web.


Unit 6 Vocabulary 

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • application programming interface (API)
  • client
  • data abstraction
  • e-business
  • e-commerce
  • IP Address
  • JSON
  • web crawler
  • XML

Unit 7: Data Sharing

7a. Explain the advantages and disadvantages of sharing data

  • How is data managed in a shared environment?
  • What are some of the advantages and disadvantages of data sharing?

Data sharing is the process of making data available to others. This exchange process allows any organization or individual to use data or metadata. Over time, data sharing has become one of the most essential methods to encourage scientific and organizational progress. Today, organizations and institutions encourage a culture of openness, accountability, and secondary analysis of data. Data sharing is important because it promotes the cross-flow of information and builds partnerships between researchers and organizations. Data sharing allows others to further investigate previous research or reveal new insights about an event from previously collected data.
 
Data sharing is also common in commercial applications. Organizations share data for a wide variety of reasons. For example, an organization may outsource its payroll processing and then share data with the payroll processor, typically through an API. The degree of coordination between the organizations will vary. The simplest way of sharing data is through an API, with each organization maintaining its own systems and data structures. In other cases, the organizations may jointly design and operate a database management system to serve several organizations in the same industry.
 
When managing data in a shared environment, many users will have the ability to make changes to the data over time. Therefore, managing data integrity, the overall accuracy, completeness, and consistency of data is a significant challenge in a shared-data environment. To address the degradation of data integrity over time, the organization managing the data must implement excellent data governance and integrity policies and procedures. There will also need to be a regular review of the policies and agreements that document how data will be shared between organizations. These agreements can become even more complex when cloud service providers are also a part of the shared data environment.
 
Data sharing among academic and research users is the ability to distribute exact information through multiple applications. Data sharing is essential because it promotes the cross-flow of information and builds partnerships between researchers and organizations. Data sharing allows others to further investigate previous research or reveal new insights about an event from previously collected data. Distributing and sharing data improves our ability to learn more about a topic. Organizations distribute data by publishing it using different platforms. This is also known as discoverability. Publishing makes sharing and locating data by other researchers and industries easier. However, discussing the pros and cons of sharing data before distributing it through publishing is essential.
 
Challenges that arise from data sharing include the ownership of intellectual property, coordination of security systems to avoid data breaches, and the technical complexities of tight interconnectivity between heterogeneous systems. Sharing can also become challenging when the participating organizations are located in different countries or other legal jurisdictions, and the laws among the various jurisdictions are not harmonized. In this case, it may be necessary to comply with the laws of the most stringent jurisdiction, which may impose unnecessary costs on sharing participants in less stringent jurisdictions.
 
To review, see What is Data Sharing?.

 

7b. Identify data reuse, sharing, and access policies from funding agencies, institutions, and publishers

  • Why is data sharing and reuse important in research activities?
  • What are some of the considerations that funding agencies and publishers must consider?

Data sharing has significant advantages for research in government, academic institutions, and industry. First, researchers can further investigate or develop new concepts based on the foundation of previous research when research data is shared. Second, shared data can be more reliable when collected by other researchers. Third, data sharing reduces costs associated with collecting new data. However, data sharing also has disadvantages. Organizations and institutions must work together to agree on policies, guidelines, and data-sharing standards to counter data-sharing disadvantages. This is especially true when data is being shared among industry participants, who may want to keep results proprietary, and academic or government participants who desire to make data and research results freely available through publication to the general population.
 
There have always been advocates for sharing data over the decades. The idea of using data collected for other purposes is known as secondary data analysis. An example of this might be the use of data collected by public health authorities to detect disease patterns by pharmaceutical companies to facilitate the development of new treatments. Although this type of data sharing is widely accepted in science, barriers exist when using data from other sources in a more commercial setting. Some issues and barriers include concerns about data manipulation errors, data possessiveness, data documentation, and data management. When data is being used and manipulated by several different entities, procedures must be put in place so that each user of the data can be confident that the data was properly handled by prior users of the data. Also, many organizations view their data as an asset and are often unwilling to share it without receiving significant compensation.
 
International barriers and obstacles also exist to sharing data. It includes language differences, legal differences, differences in technology advancement, and differences in data documentation and standards. There can also be cultural and other societal differences around how data is copied, how results are attributed, intellectual property law, and others.
 
Laws, frameworks, and expectations about the rights of data providers, or those about whom data is being collected, are constantly evolving. These frameworks define certain rights that data subjects possess. Among them is that the customer or other data provider should always have the right to be informed about how information is to be collected and used. For example, the General Data Protection Regulation (GDPR) is a set of rules and legislation created in the European Union (EU). The legislation contains extensive regulations about handling citizens' personal information. These rules apply to any organization that handles the data of EU citizens, regardless of where the organization is located. The principles of the GDPR are being adopted worldwide, and every organization should consider how they will implement these principles in their data handling practices. In particular, there is an increasing focus on protecting personally identifiable information (PII), any data that could potentially identify a specific individual.
 
To review, see Better Data Sharing Rules.

 

7c. Assess issues/obstacles related to reusing and sharing of data, such as different legal, governance, and ethical systems,

  • What are some differences in international law that relate to the management of data?
  • Why is security so important in the international sharing of data?

Local and international issues and barriers are associated with data sharing. Institutions and publishers revisit publishing requirements continually to address these concerns. Therefore, revisiting publishing requirements is considered an ongoing process improvement strategy to incorporate agreements on data sharing between organizations. In Europe, for example, GDPR. Such regulations are often designed to protect data privacy. Data privacy is the branch of data management that deals with handling personal data in compliance with data protection laws, regulations, and general privacy best practices. Notice that while laws and regulations are usually specified very precisely, the notion of best practices can vary.
 
Companies need to better understand where data exists within their infrastructure for everyone, not only people who live in the EU. They need to know where that data lives, who has access to it, how it's processed, who else it might be transmitted to, how to give it to you when you request it, and how to delete it when you request that it be deleted. One of the friendliest ways to do this is by building diagrams. This exercise is helpful to help visualize how data flows into an organization, where it ends up, how it's used, who knows it's there, and where it is most vulnerable. This helps organizations accomplish other important things, like designing disaster recovery tactics, incident response plans, and overall resilience. Efforts involved in building a better understanding of how an organization works and how it is most vulnerable pay for themselves in a crisis when unplanned events compromise productivity, reputations, and bottom lines.
 
Notice that database security is of even greater importance in a shared environment than in a closed environment. The diversity of different systems, networks, users, and administrators presents additional vulnerabilities. Thus, the security mechanisms to prevent misuse must be even more robust. If the data is transmitted internationally, the number of interception points increases significantly. For this reason, special care must be taken – especially if the data is being transmitted on openly accessible telecommunications systems in countries that are not known for superior performance in securing data. There have even been examples of state espionage, where state actors and security services engage in overt espionage to steal data and gain a competitive advantage for themselves or companies that are closely aligned with the national government. Security techniques like encryption (converting information or data into a code, especially to prevent unauthorized access) of transmitted data can reduce but not eliminate this threat.
 
To review, see Equitable Design.

 

7d. Explain how open access, open science, and open data can lead to process improvements and higher levels of cooperation

  • How does open data lead to improved processes and technologies?
  • What legal structures are in place to balance the needs of developers and users of new technology?

Open access to data and open scientific inquiry can lead to several significant benefits. Indeed, the pace of scientific advancement can be increased if researchers are more willing to collaborate. In addition, if the research results, such as new technologies, are made available to others, the whole economy and society can benefit. Having access to other innovators' data and research results can also often take innovations in new and unanticipated directions. For example, the developer of a new compression algorithm may not have considered how such a technology might improve the process of video distribution in the entertainment industry.
 
The challenge can be constructing legal systems that motivate people to cooperate and share but that do not inhibit the motivation to conduct research in the first place. While many researchers, especially in academic institutions, may be motivated to create new knowledge for the sake of knowledge alone, many are doing so with an expectation that they will retain the financial benefits of new processes, technology, and so forth. This is especially true in corporations, where process and technology improvements are significant ways to increase competitive advantage and profitability.
 
Most nations have created carefully designed legal systems like patents to balance these competing motivations. A patent is a formal grant of rights to an inventor by a government entity. In the United States, for example, the original creator of a new technology or process can make that innovation available to the larger society while retaining a significant financial interest through that patent and trademark system. This typically takes the form of a license, where the inventor of the new technology agrees to share the technology with others in exchange for a fee. Users of the technology would then pay this fee to access and use it while avoiding an infringement lawsuit if they were to use it without this license. To balance the needs of society as a whole, this ownership and licensing structure will have some time limit. After the time limit has expired, the technology enters into the public domain and may be used by anyone.

To review, see Interoperability and Data Sharing.

 

Unit 7 Vocabulary

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • data integrity
  • data privacy
  • data sharing
  • encryption
  • General Data Protection Regulation (GDPR)
  • license
  • personally identifiable information (PII)
  • patent
  • state espionage

Unit 8: Data-Driven Uses and Misuses

8a. Describe how data warehouses bring related information from disparate databases so that it can be analyzed

  • Why do we use data warehouses?
  • How is data brought into a data warehouse from internal and external sources?

The fundamental purpose of a data warehouse is to store data extracted from internal transaction processing systems and external sources. The data will be reformatted to meet the business intelligence needs and other systems that will use it. The data warehouse will not be integrated with and will not contain operational data from the transaction processing systems. A fundamental reason organizations use data warehouses is that the warehouse does not interfere with the operations of the transaction processing systems. In addition, the data warehouse may or may not be segmented into specialized data marts.
 
Data mining is a technique that allows for the modeling and discovery of patterns and information within data. Along with data warehouses, data mining arose primarily to address some limitations in OLTP systems.
 
Data mining is often implemented to populate a data warehouse. Data mining evolved as a mechanism to cater to the limitations of transaction processing systems to deal with massive data sets with high dimensionality, new data types, and multiple heterogeneous data resources. Data mining systems need to be able to perform data extraction from the large diversity of systems and data formats that will likely be found in any organizational environment. For example, a data mining system must not only be able to extract data from a well-designed relational database. Still, it must also be able to extract data such as audio, video, and free-form text comments on social media from sources that may be organized in a completely different way or not organized at all.
 
To review, see Data Warehousing and Data Mining.

 

8b. Explain what data-driven systems can achieve in an organization

  • How do data warehouses support data-driven decision-making?
  • How do decision support and other systems use data to improve decision-making?
  • What role do Key Performance Indicators (KPIs) play in making data-driven decisions?

Once an organization has established a data warehouse, it can use its stored data to make better-informed managerial and operating decisions. Managers can either automate or make decisions using business intelligence or decision support systems. For example, a system that identifies products a customer might be interested in based on what they have already bought and sends them an email might be an example of an automated decision. Such a decision will not require input from any human and thus can be executed very rapidly, often in real time. Managerial decision-making, on the other hand, requires that a human make the decision. To improve the quality of managerial decision-making, the goal of a data warehouse is to feed other types of tools like decision support systems explicitly designed to support humans. There is extensive evidence that humans will make better decisions when they have access to better data and information and use tools to assist them in processing, modeling, and analyzing that data.
 
To achieve objectives, managers need to track how progress is being made against goals. However, these goals need to be relevant to the business setting. If the manager is achieving goals, but they are the wrong goals, that can be worse than not having any goals. For this reason, a common component of organizational strategic planning is identifying critical success factors. The idea of a critical success factor is that these are the small number of strategic factors that will determine the enterprise's success. Things like profits, sales and revenue, customer and employee satisfaction, time to develop new products, and so forth are all commonly encountered success factors.
 
Critical success factors are best accomplished when the success factors can be measured. Measurement allows for the establishment of targets and benchmarks and allows for a higher degree of precision. Setting business Key Performance Indicators (KPIs) is one of the most effective ways to do this. For KPIs to be effective, it's essential that managers consider what information is critical and that the KPIs have certain characteristics.


To review, see Data-Driven Decisions.

 

8c. Identify the most widely-used data warehousing architectures and schemas and how they are applied in practice

  • What are some of the most common data warehouse architectures?
  • How do data warehouses manage the vast diversity of data types they must store?

This figure provides a high-level, conceptual representation of the data warehousing and management architecture and process. Notice that data is fed to the data warehouse from various internal and external sources to the organization. To feed in data from external to the organization, we rely on various techniques like using APIs and web crawlers. Using a data warehouse to collect data from external sources also means that the data will be in many different formats and need to be adapted for inclusion in the warehouse; this is the process of the staging phase.
 
In the staging phase, we take data in different formats and manipulate it to include it in the data warehouse. In addition to data in a relational form that may have come from internal transaction processing databases, we may need to include external data that may take the form of audio files, images, video files, free-form texts, and so forth. For example, if we store information about feedback that our customers may have left about us on social media platforms, we have to have a way to organize and sort this. We may use things like keywords (words or phrases that users type into search engines to find relevant content for their queries), tags (terms assigned to a piece of information), and so on to help up, but in general, this data will not be that well formatted.
 
The warehouse itself may be subdivided into smaller sections called data marts. The structure of these data marts would depend on the user's needs. By using smaller data marts for specific types of problems – for example, one for finance, one for marketing, one for customer service, and so on – we can reduce the amount of time and system resources that any given user might need to find a particular piece of information relevant to a problem or decision that they are working on.
 
Notice that the data warehouse itself will store different kinds of data. Examples are raw data, which may be quite unstructured, summary data which may be raw data that has been summarized or other summary data, and metadata, or data about the data. These different types of data each require different techniques for management and storage, and the data warehouse has to be able to accommodate these different techniques.

The bottom tier of a three-tier data warehouse architecture is a warehouse database server, typically a relational database. Various tools and utilities, called back-end tools, can feed records into this bottom tier. Operational databases are often the source of this data, but they can also come from other sources, often external to the organization. The middle tier is an OnLine Analytic Processing (OLAP) server executed using either a relational or multidimensional OLAP model. This middle tier then serves as the data management subsystem of a decision support system. The front-end, or client layer, that interfaces with the user is called the top tier. The top tier usually includes query and reporting, analysis, and data mining tools.
 
Another type of tool that is sometimes included in the definition of a data warehouse at the top tier is an analytic dashboard. A significant challenge in supporting human decision-makers is assisting the decision-maker in forming a high-level mental model of the system they are making decisions about. A dashboard can aid in visualization to allow decision-makers to gain this kind of intuitive understanding of the real-world processes their decisions will influence.
 
To review, see Data Warehouse Architecture and Schemas.

 

8d. Use data warehouse and mining techniques to support decision-making based on business objectives

  • How is data mining used to populate a data warehouse?
  • How are data mining and data warehouses used to support managerial decision-making?

We store big data in the data warehouse and then use data mining techniques to extract data for use by business intelligence systems to support decision-making. Data mining systems are designed to find patterns and correlations in data from data warehouses and generally prepare data for use in the decision support systems used by decision-makers.
 
Along with data warehouses, data mining arose primarily to address some limitations in online transaction processing (OLTP) systems, which are designed primarily to automate business processes. Customer order entry might be an example. They were not designed specifically to support decision-making. While it is true that we can run SQL queries against OLTPs, this presents several problems. The first of these problems is that SQL queries can slow down the operations of a database. This is the last thing we need in an OLTP.
 
The second problem is that data in an OLTP is often much more detailed and granular than is needed for decision-making. In making decisions, we might be interested in monthly sales, for example, and the OLTP may process thousands of transactions a second. What we want is summary data. Finally, an OLTP will not contain data from outside the organization. Yet this data is required for decision-making. Hence the need for a data warehouse.
 
To begin making data-driven decisions, the organization must start with a clear objective as to what they are trying to accomplish. It could be increased sales, reduced manufacturing costs, improved process efficiency, or any measurable outcomes. Once the objective(s) have been determined, the organization gathers and analyzes the available data to make decisions. After the decision is implemented, it is essential to determine if the analysis validated the results.
 
Business intelligence (BI) combines analytics, data warehousing and mining, visualization, and data infrastructure to help organizations make effective data-driven decisions. This is a different focus than Information Management and Data Processing. Information management generally includes the management and reporting of transactions and the operational systems necessary to operate the business. This is different from the focus of BI on support for decision-making. Data processing, as generally defined, would involve using relational database systems and transaction processing. There might also be some use of SQL for queries and reports.
 
Since the advent of data-driven decision-making in the 1950s, business intelligence has always striven to collect and analyze the largest amount of data possible. BI focuses on the data needed to make decisions; this is not necessary, only the most current data. There is often a need for historical data to make forecasts and to support decision-making. The focus of a BI system would not be on the sheer amount and variety of data but rather on the data most relevant to the decision-maker's needs.
 
To support the needs of business intelligence and decision-making systems, the DBA must ensure that the data stored in operational and transaction processing systems can be extracted and moved to the BI system's data warehouse. This extraction process must also allow for the conversion of the operational data into whatever format meets the needs of the warehouse and the BI system.
 
To review, see Data Mining Techniques in Analyzing Process.

 

Unit 8 Vocabulary

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • automated decision
  • business intelligence
  • critical success factor
  • data extraction
  • data mart
  • data mining
  • decision support system
  • key performance indicator (KPI)
  • keyword
  • information management
  • managerial decision-making
  • online analytic processing (OLAP)
  • online transaction processing (OLTP)
  • operational database
  • raw data
  • summary data
  • tag
  • three-tier data warehouse architecture