CS403 Study Guide

Site: Saylor Academy
Course: CS403: Introduction to Modern Database Systems
Book: CS403 Study Guide
Printed by: Guest user
Date: Thursday, April 25, 2024, 5:36 AM

Navigating the Study Guide

Study Guide Structure

In this study guide, the sections in each unit (1a., 1b., etc.) are the learning outcomes of that unit. 

Beneath each learning outcome are:

  • questions for you to answer independently
  • and a brief summary of the learning outcome topic with linked resources.

At the end of each unit, there is also a list of suggested vocabulary words.

 

How to Use the Study Guide

  1. Review the entire course by reading the learning outcome summaries and the linked resources.
  2. Test your understanding of the course information by answering questions related to each unit learning outcome and defining and memorizing the vocabulary words at the end of each unit.

By clicking on the gear button on the top right of the screen, you can print the study guide. Then you can make notes, highlight, and underline as you work.

Through reviewing and completing the study guide, you should gain a deeper understanding of each learning outcome in the course and be better prepared for the final exam!

Unit 1: Introduction to Modern Database Systems

1a. explain the difference between data and information and give examples of each

From your study of the prerequisites for this course – Introduction to Computer Science and Elementary Data Structures – you were introduced to data types in the context of programming, and you have worked with numerous examples of them. 

A data type is a definition of properties for a group of values that enables program computations to be performed on those values. The values are referred to as program data. Some programs perform computations on large amounts of data. If we were to extract that data and store it apart from its program, we would have a simple database. In this course, we work with large collections of data, but the context is no longer a specific program. What is the context for our collections of data? It's the corporation, the organization, the enterprise, an industry, a domain of application - however, large.

  • What is the relationship between data and information?
  • Is there a simple answer to this question? 
  • Is this just a question of terminology, or is there a greater distinction that will affect all the other units of this course?

To address these questions, first, define data, and then define information. What are data and information? To review, read Chapter 1 in Database Design, which also identifies problems when an application is dependent on large amounts of data. How do databases help address those problems? Also, be sure to read the definitions of data and information on page 1-3 of Database Systems for Management.

The difference between data and information is not simple. It is not just a question of terminology. It involves important concepts of computer science, including programming language syntax, semantics, application domain modeling, and computational problem-solving. Information is data plus context. Context includes how data is represented (its data type or syntax), how it relates to other data, objects, concepts, or actions (its semantics), and the computations that data enables to solve problems (to use the data). Often the distinction between data and information is relative because it depends on the use of the data.

Think of a variety of different data types and values. Which of the examples are information, which are data, and why? The semantics associated with the data determines whether or not data is useful, and, therefore, whether or not it is information. Note that 'useful' is relative: what is useful to you may not be useful to me. To review, read this chapter.

 

1b. contrast file processing systems and database systems, relating problems of the former that are addressed by features of the latter

  • Why are database systems more useful than file processing systems? 
  • What are some of the problems with file processing? Are they solved by database systems? 

Early applications stored large amounts of data in files, which were used by one or more related programs to perform computations. This required semantic dependencies among the files and programs, resulting in significant effort when a change was made. Fewer dependencies result in easier changes. However, larger and more complicated applications introduce many dependencies. How would you remove dependencies so that many programs could use easily use the data in the files? If the semantic dependencies were removed from the programs and stored with the data files, then changes to the data would not require changes to the programs and more programs could more easily use the data. How would you do that? You could add semantics to the data to make it usable (that is, to make it information). How could you turn data into information? Simply by adding relationships among the data and descriptions of the data – that is, properties, restrictions, and constraints on the data. These are called metadata. To review, study the diagram in Chapter 2 of Database Design.

In addition to high maintenance costs, there are other limitations of file-based systems that prompted the development of databases. What are some of these limitations? Review Disadvantages of the File-Based Approach on page 1. Disadvantages of file-based applications are balanced by advantages of database systems, as described on pages 109-115 under "Objectives of Database Systems" and "User's View".

 

1c. describe what a database management system is and demonstrate how it functions

  • Why does one need a DBMS?
  • What are the functions of a database as compared to the functions of a DBMS?

A database helps address the challenges we encounter when we use data, particularly large collections of data. Solutions to those challenges involve software that is not part of the database itself and is implemented as a logically separate and related subsystem called a database management system. Review these in the second diagram in Chapter 2.

When we need to perform a task, we first plan the task, determine roles, assign responsibilities, perform the tasks, control the tasks, and overview the performance of the tasks. These are generic management functions. Specific management functions for using a database are performed by a database management system. What are some database specific management functions? Look at the description of a database on page 6, under "Database Properties", which lists properties of a database. Do the second and third bullets describe what is in a database? What functions of a DBMS correspond to those two bullet points?

The functions of a DBMS derive from the requirements for a database and are often listed as necessary properties of a database. Those properties are the positive counterparts to the challenges (negative counterparts) we encounter when we use large collections of data. What are some functions of a DBMS that derive from the necessary (positive) properties of a database? Review pages 1-8 to 1-14 about the Objectives of a Database System, which can serve as an overview of our requirements of a DBMS. Some of the requirements and functions of a DBMS, like concurrency, are supported by the operating system and other related systems. Those other systems are applications that utilize the database, as seen in Database Systems and Other Organizational Information Systems on page 1-5. Be sure to review Figure 1-3 on page 22. Where would you insert 'Operating System' into that diagram?

 

1d. compare the various database models

A database is a shared collection of common data plus context. Phrased more formally, a database is a representation of a model (actually, several models). If we wanted to rephrase to use a hierarchical software engineering view of the development activities of a software system, we'd identify requirements analysis, design, implementation, and operation. These each have a corresponding hierarchy of data models. The corresponding data model hierarchy consists of an application domain model, conceptual model, logical model, and physical model. Unit 1 introduces us to several of these models. The terminology for common data models varies and is more specific than the generic software engineering names. 

  • What benefit do data models provide us?
  • What are some types of data models?
  • What are some common database models?

The software engineering and database levels are levels of abstraction that we use to understand an operational database and its development. It is primarily a top-down approach that builds understanding, provides a framework, and guides efficient and effective database development. A database can be viewed from user or external, or internal perspectives, each of which has a corresponding type of model. Database development proceeds from a user view (application domain data model), to design views (conceptual and logical database models), to an internal view (physical database model). Thus, database models provide us abstraction levels that we use to understand databases and to develop databases.

To review, see the second paragraph on page 2-7 and 2-8 on "Database Design Techniques". Also review "Degrees of Abstraction" on pages 15-18. Note Figure 5-1 on page 17. 'Internal view' is a term often used for physical view. We have terms for software engineering database development phases, for views, for data model types, and for specific database models. Can you relate all those terms? Where do the terms schema and subschema fit in the hierarchies? What are some specific common database models? Here is a table that summarizes these ideas:

 

Unit 1 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Think of the list of terms as a data dictionary, of pointers to key topics in the course content. 

Try to think of the reason why each term is included. 

  • File processing system or file-base system
  • Data
  • Abstraction
    • Redundancy
    • Integrity
  • Levels
  • Semantics 
  • Information
  • Database
  • Database management system (DBMS)
  • Data model
  • Database model
  • Database properties
    • Consistency
    • Isolation
  • Schema
  • Subschema
  • Metadata
  • View
  • Conceptual model
  • Entity-Relationship model
  • Hierarchical database model
  • Network database model
  • Relational database model

Unit 2: Database Architecture and Data Languages

2a. assess the three-schema database architecture, comparing levels relative to roles, data independence, and abstraction

Two important principles that we encounter throughout computer science and software engineering are abstraction and hierarchy. Here we see them in the form of the ANSI-SPARC (American National Standards Institute/Standards Planning and Requirements Committee) three-level architecture for database systems.

  • Using abstraction and hierarchy, explain the three views of the ANSI-SPARC standard architecture for database systems?
  • What benefits does that architecture provide? 

The table in the study guide for Unit 1 presents a picture of a three-level hierarchy of software engineering development phases, views, and data models. The middle and right columns of that table show the 3 levels of the ANSI-SPARC database architecture. What is the software engineering name for each of the 3 levels? What is the ANSI-SPARC view name for each level? What is the model name for each level?

A slightly different (but compatible) hierarchy diagram is shown in this article. How does this diagram differ from the hierarchy table in the study guide for Unit 1? What is the relationship between adjacent levels of each hierarchy (software engineering, view, and model)? For example, what is the relationship of requirements to designs, 1:1 or 1:n? What is the relationship of designs to implementations, 1:1 or 1:n? The same question can be asked of the view hierarchy and the model hierarchy; the answers are the same for each hierarchy.

Think about this relative to both software engineering and to the view and model hierarchies. An application has numerous requirements. There is no unique design that satisfies a given requirement, and there is no unique implementation for a given design. If we define a design for each requirement and then integrate them, we have one overall design. If we define an implementation for each design and then integrate them, we have one implementation. 

As we move down the hierarchy, from level to level, detail is added while maintaining the semantics of the higher level (its data types, data relationships, and constraints). A higher level is more abstract than lower levels. Abstractions provide a guided framework that helps us move from the application or problem domain to a desired technology solution domain. Each abstraction level has common types of models that are applicable to each level. For example, an Entity-Relationship model is applicable to application domains. What are some specific models for Detailed Design / Logical Database Design / Logical Model? To review, see the table from the study guide for Unit 1, this diagram, and pages 15-17 of Database Design.

Another benefit of abstraction pertains to maintenance. What is that benefit? To review, read page 18 of Database Design and page 1-11 of Database Systems for Management. Note that 'architecture' as used here means hierarchy, and embodies several additional concepts.

Roles are also included in this unit. In the real world, we frequently use goals to help us focus on results. In addition, we utilize processes to help us determine what we need to achieve the results. Further, we utilize procedures to help us determine how we could achieve the results. In the real world, this general approach is called problem-solving; in business, it has been called business process; in software engineering, it is called software process.

The definition of a process includes: 

  1. purpose of the process
  2. roles, responsibilities, and resources
  3. a list of the work products used and produced by the process
  4. ways to monitor, review, measure, and check the performance of the process

The software engineering terminology for these parts of a process definition are:

  1. requirements
  2. project planning
  3. software work products (also called artifacts or elements)
  4. verification and validation

Roles are included in step 2. What are some roles in the development and use of a database? To review, read page 1-5 of Database Systems for Management. Database plans are discussed on page 2-3, and roles on pages 2-11 to 2-15. The other aspects of a process are discussed throughout Chapter 2. As you read, also think about the levels at which the user, management (top and systems), database administrator, information analyst, database analyst, and programmer roles occur.

 

2b. use examples to describe and differentiate between tables, views, and indexes

  • What are the elements of a database?
  • What are some database specific models?
  • What are: tables, views, and indexes?

A database is composed of data elements: fields, records, and tables. A data item of an application domain is represented using a data type, which is stored in digital memory, called a field. A combination of related fields is called a record. A collection of records having a defined relationship is called a table. A collection of tables is called a database. In the relational model, a table is also called a 'relation'. To review, read pages 13-14 of Database Design.

A very thorough discussion of three specific database models is presented in "Database Systems for Management". In the early history of databases in the early 1970s, there were competing implementations and designs for database systems. The American National Standards Institute 

Formed the software planning and requirements committee (SPARC) to specify a set of requirements that expressed the needs of organizations. To get started, SPARC formed a study group on database management systems to examine features of competing systems. The study group identified goals for database management systems, such as ease of use, data security, data integrity, control of redundancy, data independence. Motivated by the realization that the evolution of database technology was progressing, the study group proposed a database design framework that supported the database goals and provided a high degree of independence to facilitate continuous improvement from evolutionary changes in the technology. The ANSI/SPARC framework has 3 levels, called views, where each view is independent of changes in the other views. 

The ANSI/SPARC views are summarized in Database Systems for Management, pages 4-1 to 4-3. Three early database models for the conceptual view are described on pages 4-4 to 4-7.

A view is a perspective that focuses on a subset of the functions or records of a database. A view is sometimes called a schema when used in the context of a data or database model. A schema is a data model (such as the E-R model) or a database model (such as hierarchical, network, or relational model) description for a level (user/requirements, design, or physical/implementation level). A subschema is a model description for a subset of the data of a database. A subschema definition is contained in the definition of a schema. You can review schemas and subschemas for each level of the ANSI/SPARC 3 level architecture and the roles that use them, as well as specific database model schema/subschema descriptions and examples of user view schemas, on pages 4-2 to 4-11 of Database Systems for Management.

Pages 10-1 to 10-13 of Database Systems for Management present a thorough discussion of topics relevant to the physical or internal view. Data structure implementation decisions directly affect the efficiency of database performance. Does database design affect efficiency? How? An index (also called a key) is a field (a column or an attribute in the relational model) that is used to efficiently locate records having a specific value for that field. You can review indices in physical models on pages 15-18 of Database Design. Data Definition and Data Manipulation commands, covered in later units of this course, define keys and utilize them to find information.

 

Unit 2 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • Schema
  • Database architecture
  • Roles
  • Data independence
  • Abstraction
  • Table
  • View
  • Index
  • Database administrator
  • Database analyst
  • Top-level management
  • Project management
  • Database plan
  • Users
  • Subschema
  • Effectiveness
  • Efficiency

Unit 3: Database History

3a. discuss the evolution of database technology and relate it to the evolution of computing

  • What are some major milestones in the history of databases?
  • What are specific drivers or causes that led to those milestones?
  • Outline the evolution of data models from about 1960 to the present. 

Databases as we know them arose around the year 1900, when early computing technology in the form of mechanical tabulating equipment was used for calculating data on punched cards. Punched paper tape soon followed, and then magnetic tape and disks. Around 1950, large data processing computers began to become available. Computing capability increased significantly around this time, as large data processing computers were quickly accepted by business and finance to automate record and file systems. Manufacturing, engineering, and scientific applications arose shortly afterwards, and required manipulating ever-increasing amounts of data.

Early computers had a few registers, just thousands of words of expensive main memory (a typical large computer of the 1960s had 64k or 128k of memory), and relied on slow external media for large amounts of application data. This need for more, cheaper, and faster memory led to improvements in mechanical and magnetic storage media, optical and laser devices, and semiconductor memory. These advances in media technology found rapid application for larger and more complex systems, which revealed the limitations of the file-based paradigm and the need for new models to overcome those limitations. What are some of those 'new' models and the limitations they overcame? To review, read A Brief History of Databases.

Technology milestones usually involve changes in data models or in the technology that implements them. These changes are driven by the need for larger scope, speed, better accessibility, or a higher level of abstraction – which has an associated reduction in effort. The period from 1960 to the 1980s saw the development of database models to satisfy these needs for the applications of that period. What models were developed in that time period? What needs did they address? Even with the evolution of database technology, those needs still persist, because the problems have gotten larger and more complex and the computing technology has changed.

In the 1980s a new programming paradigm, called object-oriented programming, became popular and, soon after, a new conceptual model – the object model, which provided a database model that satisfied the need for compatibility with object-oriented systems. Object database models are not as widely used as other models, in particular the relational model, because of their dependence on object programming languages or APIs (application program interfaces). All models, except for the relational model, provided solutions for the representation and storage of data at the expense of complicating queries. Which query languages are used for relational databases?

In the first decade of the 2000s, the internet boom brought exponential growth in the amount of all types of data and in the number of users for OLTP (online transaction processing), which in turn motivated the need for new data models. This period did not see the end of relational and SQL database systems, but there candidate new models did appear. The 2000s also saw the rise of distributed databases, data warehouses, and OLAP (online analytical processing). Relational databases and SQL were considered too slow for the scale and complexity of these applications. NoSQL databases emerged. NoSQL DBMSes improved performance at the expense of complete ACID features (Atomicity, Consistency, Isolation, and Durability). These features were not considered required, but just desirable. NewSQL data models emerged in the 2010s. What needs did they satisfy? What is a Hybrid DBMS and what desires does it address?

The history of databases has been a recurring loop, going as follows:

  1. data needs
  2. database model solutions for current applications
  3. advances in computing
  4. new applications
  5. go to step 2 for the new applications, but keeping 1 unchanged if possible.

Sometimes the requirements are relaxed to make some gains. The needs for database systems don't often change. They are generic and are often summarized by the acronym ACID: Atomic, Consistent, Isolated, and Durable. Some newer database systems relaxed some of these features to achieve acceptable performance for new applications. This 'relaxation' is summarized as BASE, meaning 'BAsically available', 'Soft state' (not always consistent), and 'Eventually consistent'. We'll go into these features in more detail in later units, but start to think about them now.

To review, watch History of Databases.

 

3b. evaluate the four common data models relative to significant criteria for a selected application

  • What are some common data models? What are the four historical most common?
  • What level of the three levels of data architecture do they correspond to?
  • What are some significant features that represent common data requirements and needs?
  • What features are satisfied by the four historical common models? What features are not satisfied by them?

The most common four data models are hierarchical, network, relational, and object models. Each is capable for organizing data, storing and retrieving it, and satisfying generic database goals for some applications. Each of these models is primarily a model for database design. However, some also correspond to the internal/physical, or application dependencies. Hierarchical and network are best suited to problems that utilize tree/graph data structures. Object models have a dependency on object languages and APIs (application program interfaces). The relational model supports user views, conceptual and logical design, and adheres to the independence of the ANSI/SPARC framework. Moreover, it has a formal, mathematical foundation that enables SQL for data definition and manipulation. Can you describe the four common data models?

Unlike the file-base model, all four data structures (tree, network, relational, object) are independent of the application. Object models have some language dependencies. A hierarchical, network, or relational database for a given domain can be used by many applications. An object database, however, should be used by an object-based application for consistency of the domain model with the design model and access to proprietary program language libraries. 

A hierarchical data model uses a tree structure to organize data, which is a natural representation for many real-world applications. The hierarchy order is used as the physical order of the data on the storage media. The hierarchy is accessed by traversing the paths of the tree. Thus, there is some dependency of the physical structure on the design structure.

A network model extends a hierarchical model by allowing a node to have more than one parent, which enables redundant data to be stored efficiently in a network model (a data item on more than one path can be represented once and referenced whenever it is on a path). Thus, the data is organized as a graph structure. The paths in the graph structure specify the order in which the data nodes are accessed in physical storage. Thus, the physical implementation of a network database is dependent on the network design model.

The elements of a relational model are tables, rows, columns, and table relationships. The implementation of those elements is not specified by the model, and therefore the physical model is independent of the relational design model.

An object design model does not specify the organization or order of the data in the physical model. For operational efficiency, the data design model and the application object model should be compatible, and their physical models should also be compatible with each other. 

Redundancy, data dependency, application dependency, and natural or efficient representation of an application domain can be used to evaluate a data model. Can you think of some other criteria that might be used? Features or criteria are just general requirements, and are often called quality requirements. The development of a database system starts with requirements, and the design model selected is the one that best satisfies the requirements. Database requirements originate with sponsors and stakeholders and include functions, performance, quality requirements. Functions depend on the type of systems that will use the database – for example, information systems, transaction processing, or decision support. Performance (that is, the time to process a transaction or search, or the storage space) depends on how the data is organized, the size of the database, the complexity of the relationships, and how the database is used (such as for storage and retrieval, transaction process, searches, discovery of relationships, or analysis). What sets of requirements might be satisfied by a hierarchical model? A network model? A relational model? An object model?

To review, read pages 113-116 of Database Design and consider the example for student, enrollment, and course entities. How might you outline a hierarchical design for those relationships? A network design for those relationships? An object design for those relationships? We will be studying relational models in great detail in later units.

To review, also see Database Models, Hierarchical Model, Network Model, Relational Model, and Object Model.

 

Unit 3 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • Punched card
  • Punched paper tape
  • Magnetic tape
  • Data Drum
  • File System
  • Hierarchical Model
  • Network Model
  • Relational Model
  • Object Model
  • IMS
  • DB2
  • Oracle
  • SQL
  • NoSQL
  • NewSQL
  • Criteria
  • Database features
  • Database requirements
  • ACID
  • BASE
  • OLAP
  • OLTP

Unit 4: The Entity-Relationship Model

4a. discuss the need for an entity-relationship model

  • What are the purpose and the benefit of an E-R model?
  • What are the symbols for entities and relationships, including their syntax and semantics?
  • What is an example of a high level E-R model?

An E-R model stands for Entity-Relationship model. It views the world as a collection of entities (things or objects) and the relationships among them. An entity can be a physical object in the real world or a concept, and is represented by a noun. Relationships or relations are associations between entities. The term 'relation' refers to the manner in which the entities are associated; the term 'relationship' refers to the association itself. Relations are represented by verbs. Thus, an E-R model focuses on the whats of the world, and is well-suited to model data. Actions of the real world are represented by relationships.

The context for discussing E-R models is the database development process. This process is an adaptation of the software development process: database requirements analysis that produces the requirements and conceptual model and the user schemas; database design that produces the database design or logical model and design schemas; database implementation that produces the internal or physical model and the database code. Which database development work product produces an E-R diagram and which stakeholders read it? Can you give a small example of an E-R model? What is a single characteristic that distinguishes an entity?

To review, watch this video on Conceptual Design and read pages 113-116 of Database Design.

 

4b. explain the entity-relationship model and all of its associated symbols

  • What additional information in the domain of application should be represented in an E-R model?
  • Describe additional symbols used in E-R modeling to capture that information?

In addition to the main symbols for an entity and a relation, additional symbols are needed to represent information in real-world domains of application. Entities have names that are nouns. In our daily speaking and writing, we use qualifiers to add information to constrain nouns. Qualifiers, which we call constraints, are represented in an E-R model by attribute types, entity types, entity sets, derived attributes, and key attributes. A key attribute uniquely identifies one and only entity; each entity is uniquely identified by a key attribute value. An entity can be a weak or strong entity. The values of an entity and of attributes are constrained by a domain of values. How large, relatively, is a domain of a set of attributes compared to the domain of a single-valued attribute? It is multiplicatively larger. If an entity has 4 attributes, A1, A2, A3, A4, the domain for the values of the set is the cartesian product of the value sets of the 4 attributes; and the size of the value set is the product of the sizes of the 4 value sets. In practice, large databases can have hundreds of attributes. 

Relationships constrained by relationship type, degree of a relationship, attributes. Attributes constrain an entity; in the same manner, attributes can constrain a relationship. Note that as we study several models, the boundary between entity and relationship can get blurred. Indeed, in the relational model an entity, which is the model calls a table, is a relationship. Also, in some models, relationships are represented as attributes, such as in function modeling. In object modeling, relationships are attributes of an object, which are a reference to another object. In relational modeling, this is called a foreign key attribute or reference.

 

4c. describe relationship constraints

  • What are some other ways of constraining a relationship? Give an example of each.

Typically, an entity represents a group or class of individual instances and is similar to a type. Relationships can also be constrained by attributes. Cardinality and participation are two other ways of constraining a relationship. When two entities are related (say A is related to B), two obvious questions are "how many instances of B are related to a given instance of A?" (this is the perspective from A to B) and "how many instances of A are related to a given instance of B?" (this is the perspective from B to A). If the answer is 1, we say the cardinality is 1:1. If the answer is n, we say the cardinality is 1:n. If the answer to the first question cannot be 0, we say that the existence of an instance of A depends on the existence of an instance of the relationship, or that A totally participates in the relationship. If the answer to the second question cannot be 0, we say that the existence of an instance of B depends on the existence of an instance of the relationship, or that B totally participates in the relationship.

How are cardinality and participation constraints represented in an E-R diagram? What is an identifying relationship; what cardinality characterizes an identifying relationship? To review, watch this video from 43:00 to 51:00.

The above video gives a good overall review of the E-R model. In addition, this unit has several good written resources that you should review. The other resources for Unit 4 provide helpful suggestions and complementary explanations. When you are studying a unit, you should create a small directory of the resources you find helpful, annotate them with a few keywords on important points, and jot down a few notes of what is important to you. They will be very valuable when you review for the final exam. 

An E-R model is a representation of an application domain, in terms of entities and relationships among the entities. Qualifications, properties, restrictions that add detail to the entities and relationships are called constraints. There are entity constraints and relationship constraints. Constraints are E-R representations of some of the semantics of the application domain.

Entity constraints include domain integrity or attribute value constraints, entity integrity or primary key constraint, and referential integrity or foreign key constraints. Relationship constraints include business rules, cardinality and participation constraints, and relationship types. Entity and relationship are types or collections of instances, called entity set and relationship set, respectively. Sometimes a lack of precision blurs the distinction between them.

When we study, it is easy to get lost in details. Here are some important concepts and principles to keep in mind

  • E-R models address:
    1. structure (produce normalized schema)
    2. integrity (focus on data relationships vs. business relationships)
    3. manipulation (OLTP manipulation rather than OLAP)

E-R models, and other formal models, are languages that represent information. E-R symbols can be thought of as pairing to nouns, verbs, active voice, or passive voice in natural languages.

To become familiar with E-R modeling in preparation for the final exam, you should study example E-R models in the resources and practice creating E-R diagrams. Use the examples, exercises, and problems in this resource and Advanced Topics in ER Modeling.

 

Unit 4 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included. 

  • Entity
  • Relation
  • Nouns and verbs
  • Symbols/syntax and semantics
  • Database development process
  • Conceptual design
  • Schema
  • Stakeholders/nontechnical
  • Independent existence
  • Qualifier
  • Types
  • Entity type
  • Attribute type
  • Derived attribute
  • Key attribute
  • Entity set
  • Relationship set
  • Weak entity
  • Strong entity
  • Cartesian product (D1 x D2)
  • Relationship
  • Relationship type
  • Degree of a relationship
  • Attribute vs. relationship
  • Cardinality of a relationship
  • Participation of a relationship
  • Identifying relationship
  • Entity integrity
  • Domain integrity
  • Referential integrity
  • Additional symbols, such as minimum, maximum, for optional and mandatory participation
  • Grammar analogy
  • Business rules
  • 'Existence'
  • 'Connectivity'
  • Disjoint entity types
  • Overlapping entity types
  • Enhanced ER, such as specialization, inheritance, or hierarchy
  • ER to SQL DDL

Unit 5: The Relational Database Model

5a. describe the relational database model

  • What are the elements of a relational model?
  • Which ER elements do they correspond to?
  • What is the mathematical foundation of relational modeling?
  • What is the benefit of a mathematical foundation?

We can think of the relational database model as a 'transition' model, for several reasons:

  1. An ER model can be easily converted to a relational model.
  2. A relational model can be used to communicate with both technical and non-technical stakeholders.
  3. Relational models are logical models which 'connect' conceptual design to database high level and detailed design. 
  4. Relational modeling has a formal mathematical foundation, which enables formal methods, techniques, and tools to support the transition from requirements to database high level design, detailed design, implementation, and operation.
  5. Relational models have an associated database language for creating and using a relational database, SQL (Structured Query Language). 

As with ER modeling, relational modeling begins with a real world domain of application. A primary element of an ER model is 'entity', which represents a collection of specific instances from a domain. For example, 'Student' is an entity because it has a separate existence. Given a set of constraints, 'Student', represents a collection of specific students that satisfy the constraints. Constraints include attributes and various other properties (such as cardinality, participation, or business rules). Another primary element of ER modeling is the relationship. On the other hand, relational modeling has just one primary element, the relation, from which all the other elements are derived. What are some other elements of a relational model? Which correspond to ER relationships, ER attributes, and ER constraints? To review, see pages 24-26 of Database Design. This chapter does not explicitly outline the correspondence between ER model elements and relational model elements, but the names of the elements should make that clear to you.

Formally, a relation is a subset of the universe of discourse, which is a subset of D1 x D2 x D3 x ... x Dm, the cartesian product of domains of application. A relation is represented by a table of n rows and m columns, written as an n x m table. A cell of a table (the intersection of a row and a column) represents an atomic value from a domain of application. Rows and columns are also just tables themselves. A row is a 1 x m table (using the first component to represent the number of rows and the second to represent the number of columns), and a column is an n x 1 table. Operations on tables combine tables to produce a result table. Thus, the name for the mathematical description of relations is relational algebra.

Each fundamental concept of an ER model corresponds to a formal relational model concept. For example, an attribute corresponds to a column. An entity in an application domain (or in an ER model) corresponds to a table in relational algebra. A relationship in a domain of application (or in an ER model) also corresponds to a table in relational algebra! To think about this, consider a university application where 'Student' is an entity. Constraints restrict the entity to a particular class of student, such as by year, by major, or by advisor - (attributes). The addition of a student ID number identifies a specific student in the particular class. Therefore, 'Student' model consists of a related set of attributes. But, an attribute is just a n x 1 table; and, thus, 'Student' (in the E-R model) is an n x m table (in the relational model). Now, think about a relationship in the E-R model, for example, 'Enrolls', which associates two entities, 'Student' and 'Course'. We have seen that an Entity corresponds (called mapping discussed in the next learning objective, below) to a table in the relational model. Since an E-R relationship is an association of entities, it corresponds to an association of tables in the relational model. Since an association of tables is just another table, an E-R relationship corresponds to a table in the relational model!

To review, read page 26 of Database Design and pages 6-1 to 6-4 of Database Systems for Management.

 

5b. discuss how to map an entity-relationship diagram to a relational model

  • What relational model elements correspond to the ER model elements?
  • How are constraints in an ER model mapped to a relational model?

The mapping from ER to relational is defined in stages: single entity and single-valued attributes, composite attribute, 1:1 relationship, 1:n relationship, m:n relationship, a multi-valued attribute, and n-ary relationship. A single entity with attributes is mapped to a table whose columns are the ER attributes and whose primary key is the entity primary key.

The mapping for a composite key is the same as for a single entity, except the name of the composite attribute is replaced by the names of the composites. The mapping for a 1:1 binary relationship has several cases, depending on which entity is a totally participating entity. If an ER binary relationship is an association of two ER entities where one has total participation in the other, the association naturally maps to two tables that are associated by adding a foreign key to one of them. It is represented by adding the primary key of the strong entity as a foreign key to the weak entity. If both of the entities are totally participating, a new entity is created by merging the two original entities. In each case, the ER relationship is mapped to a table that uses foreign keys to represent the relationships.

The mapping for a 1:n relationship to a relational model is similar to the mapping of a 1:1 relationship, except that the primary key of the 1-side of the relationship is added as a foreign key to the n-side. An m:n relationship is mapped to a new relational table that includes two attributes, the primary key from one of the original entities and the primary key for the other. The primary key of the new relational table is a composite key of those two primary keys. The portions of the composite key serve as foreign keys to the respective original tables. Any ER relationship attributes become attributes of the new relational table.

The next case is the mapping of an ER multi-valued attribute. It is mapped to a new table, having these attributes: the primary key of the original ER entity and the name of the multi-valued ER attribute. Both become the composite primary key of the new table. The primary key portion of this composite key is a foreign key to the original entity table. 

An n-ary relationship is mapped to a new relational table whose attributes are the primary keys of the entity tables in the n-ary relationship.

To review E-R to relational mapping, watch this video, which covers the above cases in detail.

 

5c. explain the various types of integrity constraints

  • What is integrity? 
  • What are some integrity constraints?

For this objective, we examine constraints that are used by database analysts, administrators, and designers to specify semantics for a database, and that are used by a database to check that the semantics are satisfied. 

Integrity for a database means that the information in the database is correct, satisfies the requirements (valid), and is consistent. Constraints and rules are used to assure integrity when data is entered into a database and when data is modified. 

Many integrity constraints are defined during database requirements analysis or design. They can be described ER modeling or relational modeling terminology. If ER modeling is performed, the ER to relational mapping is used to identify corresponding relational model constraints. Here, constraints are defined using both either ER terminology (if they come from the application problem domain) and relational terminology (if they come from the design solution domain).

In the context of a domain of application, familiar to users and database administrators, we almost always encounter a rich language that spans from the very general to the very detailed. For example, the word 'car' includes all models, colors, engine types, and so on. If we are interested in a specific model or a specific car, then we have to add detail that restricts or constrains the scope of 'car' to clearly identify the car we're interested in. These details are entity constraints and relationship constraints. 

In the context of conceptual modeling for high-level design and logical modeling for detailed design, both used by database administrators and designers, the 'language' we use to describe constraints includes several E-R and relational terms: domain value constraints for table columns; uniqueness constraints via primary and foreign keys that correspond to the ER term, relationship; column constraints; strong and weak types of inter-table relationships via foreign key references from one table to another; minimum and maximum association restrictions; and 'relationship' tables where related foreign keys are the column names. This video explains integrity constraints for a relational model, from the perspective of a mapping from ER model integrity constraints. 

The integrity constraints summarized above are primarily constraints on data. Chapter 9 of Database Design also describes information constraints, including enterprise rules and business rules, for example, 'an instructor cannot teach more than 3 courses'. In this resource, the constraints mentioned above, in the preceding paragraph, are called domain integrity constraints, entity constraints, referential integrity constraints, primary key rules, foreign key rules. Enterprise and business rules are primarily addressed by cardinality and participation constraints, and relationship types (optional or mandatory). 

Database requirements, in addition to effectiveness also address efficiency. Data and information integrity fall under effectiveness. Performance (speed and amount of storage) and cost of resources fall under efficiency. Often, effectiveness and efficiency are a tradeoff, and one must be 'reduced' to achieve the other, such as by giving up data consistency for 'eventual' data consistency to improve performance.

 

Unit 5 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included. 

  • Relational database
  • Map (model to model; language to language)
  • Syntax
  • Relational model
  • E-R model
  • Integrity
  • Semantics
  • Integrity constraint
  • Relational algebra
  • Entity
  • Relation
  • Relationship
  • Attribute
  • Cartesian product
  • Table
  • Row
  • Column
  • Key
  • Relationship types
  • 1:1 relationship
  • 1:n relationship
  • m:n relationship
  • Degree of a relationship
  • N-ary relationship
  • Data correctness
  • Data validity
  • Data consistency
  • Data constraints
  • Entity constraints
  • Attribute constraints
  • Relationship constraints
  • Information constraints
  • Enterprise rules
  • Business rules
  • Effectiveness
  • Efficiency
  • Performance
  • Tradeoffs of effectiveness and efficiency

Unit 6: Relational Algebra

6a. explain the basic relational algebra operations

  • What is an algebra? 
  • What is the relational algebra?
  • What are the elements of the relational algebra?
  • What are the operations of the relational algebra?
  • What are some theorems of the relational algebra?
  • What is the benefit of relational algebra for databases?
  • Explain relational algebra calculations.

Algebra is a field of mathematics that studies a defined set together with defined operations closed on that set (closed means that the result of an operation is a member of the set). Given the definitions of the set and its operations, theorems pertaining to calculation and equality of expressions, are discovered and proved. An expression is a combination of set elements and operations that satisfy the definitions. For example, commutative and associative properties of an operation, say '*', may be proved. Commutative means that, given a and b in the set, a * b = b * a; associative means that, given a, b, and c in the set, a * (b * c) = (a * b) * c.

Relational algebra is a study of a set of tuples (a tuple is a list of elements, such as, '(a, b, c, d)', where a, b, c, and d are members of defined sets) and a collection of tuple operations. Moreover, a specific relational algebra models a portion of an application domain, and serves as a logical design for database language operations.

For generality, elements of a relational algebra are usually defined as tables, which are sets of tuples. A general table has n rows and m columns of tuples, which we denote, as an 'n x m ' table. As we have seen in earlier units, rows and columns are also called records and attribute values, respectively. Examples of tables, include; 1 x n or row tuple, n x 1 or column tuple, n x m or table, and 1 x 1 or atomic value. The list of attributes is denoted, 'name' (A1, A2, A3, ...., An) where 'name' is taken from the domain of application. The attributes are names for the columns, A1, A2, ... , An. Be careful to clearly distinguish records from record values; attributes, from attribute values. Values are members of the set, called the set of domain values, used to form the components of a tuple. 

A relational algebra operation may be unary (operates on 1 table) or binary (operates on 2 tables), and results in a table that is in the defined set of tuples (that is, the operations are closed). Relational algebra operations of interest include: projection, selection, set union, set intersection, set cross product, set difference, and join. An operation may be primitive or derived; a derived operation is one that can be written as an expression using primitive operations, for example, join. If an expression occurs frequently, it is given a name and, thereby, is a derived operation.

It is surprising to learn that numerous theorems can be discovered, given only a few definitions. For example, review Relational Algebra, which gives examples of equivalent expressions, one of which may be more efficient or effective than the other.

It is not necessary that we know a lot of theorems or how to prove them, for several reasons. First, keep in mind that our goals for a database are achievement of requirements at reasonable cost, which we have called effectiveness and efficiency, respectively. The relational calculus is an effective model, as are other models. For example, we can write relational algebraic expressions that specify the operational steps that result in the retrieval of desired information from a database. A main benefit of relation algebra is its formality, or algebraic mathematical foundation. If the attributes and the domain values constitute an 'appropriate set' of tables, then the mathematical foundation enables us to write relational expressions in a database language (SQL) and utilize software tools to compile them into optimized implementation steps that efficiently retrieve desired information from a database. Software tools help determine the 'appropriate set' of tables for a relational algebra, by a process called normalization, described in a later Unit in this course. The translation of SQL commands into optimal code by software tools supports execution efficiency and effective satisfaction of database requirements.

From the database development process, we know that effectiveness and efficiency are affected by key development activities of other development activities, in addition to implementation. Analysis, design, implementation, test, and maintenance activities impact general effectiveness and efficiency. Knowledge of relational rules and theorems helps database users, designers, analysts, and administrators understand the best formulation of a SQL relational expression to optimize usability, design, verification, validation, change, and performance. 

In summary, to explain relational calculus computations, let's take two perspectives, the design of a relational database and the use of the database for query transactions. When we design a relational database, we need to make decisions on the atomic elements, the base tables, and the table relationships. Those design decisions will affect the operation of the database. The decisions are made using knowledge of the application and business domain, supported by normalization processes and tools that help produce base tables that are optimal for effectiveness and efficiency of performance. When the database is used, understanding of relational operations and rules for transforming relational expressions, helps users formulate optimal queries, again supported by software translators that optimize the queries for efficient performance.

 

6b. discuss relational algebra set operations

The relational algebra is built on set theory and some of the relational algebra operations correspond to set theory operations. Those relational operations are called table operations.

  • Which relational operations are table operations?
  • What is union - compatibility? 

The relational operators Union, Intersection, and Difference are the table or set operations of relational algebra. They are operations that operate on two tables; and they result in the set union, intersection, or difference of sets of rows. For example, the Union, 'U', of the two tables, T1 and T2, is the table T3 whose rows are the set union of the rows of T1 and rows of T2. Thus, the definitions of the relational set operations are specified in terms of the mathematical set operations of the same names. 

The set union, intersection, or difference of sets of rows must result in a valid relational table. A condition, called union compatibility, is placed on the tables T1 and T2, to ensure that the combination of their rows is a valid relational table. Union compatibility means that the two tables have the same number of columns and the value domains of the corresponding columns are the same. For example, the values in the nth columns are of the same type.

 

6c. explain what a derived operation is

As with all algebras, operations can be written in several equivalent ways or forms. For the relational calculus, we want to write a query expression in a form that enables the most effective and efficient database development and performance. Usually, the simplest or atomic, i.e. 'smallest', form is the most desirable. Projection and selection are two atomic operations. 

Typically, a derived operation or expression is a query that is used so often, that it is given a name. Then a user can specify that query by name, instead of writing out a complicated expression involving several tables and multiple operations. It is not always obvious which form for a query will result in greater effectiveness and/or efficiency. Sometimes the smallest/shortest, or simplest form should be used. For other situations, the most desirable form depends on the organization and complexity of the database tables and the type of information needed. Consider this example: 

Let T be a table representing Students. Let its columns be represented by Student (Id, Name, Major, Year, Advisor), Suppose the domain of values for Year is {1,2,3,4}. Suppose a user query is 'Find all 2nd, 3rd, and 4th-year students?' Two equivalent query formulations are:

(first query) 𝛔Year=2(Student) U 𝛔Year=3(Student) U 𝛔Year=4(Student) and

(second, equivalent, query) Student - 𝛔Year=1(Student) 

Which is most effective and efficient? What if the value domain for Year was 1,2,3,4,5,6,7,8,9,10 (for college, master's degree, doctorate, or professional degree)? The answer is 'it depends' on the situation and the database and the database requirements. Clearly, the second is more beneficial for a user if the query is submitted many times by many users. The first may be more beneficial if computing resources are limited. Or the second may be more beneficial if computing resources are sufficiently powerful to produce the desired information in real time. Which form is better can be determined either by complexity calculations (counting the number of primitive steps, and the execution time for each step involved in retrieving the information), or by executing test cases to determine the user and computer resources required by each query formulation; and by knowledge of and experience with the database.

 

Unit 6 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • Algebra
  • Relational algebra
  • Relational operations
  • Tuple
  • Table
  • Column
  • Row
  • Unary operation
  • Binary operation
  • Equivalent expressions
  • Closed operation on a set
  • Primitive or base operation
  • Derived operation
  • Domain of values
  • Effectiveness
  • Efficiency
  • Normalization
  • Set or table operations
  • Union compatible
  • 'Best' query expression
  • Complexity of an expression

Unit 7: Introduction to Data Normalization

7a. describe normalization

  • What is normalization?
  • What is its purpose and benefits?
  • Are there disadvantages to normalization?
  • How is normalization performed?

Consistency increases as dependencies decrease and relationships are simplified. More dependencies enable more anomalies. Normalization is a procedure that increases consistency and performance by decomposing complex tables into simpler tables and minimizing functional dependencies. Normalization is performed during model detailed design; it improves design by helping to prevent design problems, including modification anomalies (deletion, insertion, update anomalies). Normalization is highly automated by software tools and therefore the procedure does not require much user effort. The goals of normalization support the goals of database modeling and design. Those goals identify the purpose and benefits of normalization.

The main disadvantage of normalization is 'composition' effort. Since normalization decomposes complex tables into simpler tables, the formation of complex queries takes more user effort and more execution time retrieving information from multiple simpler tables and composing it to obtain the information that answers the complex query. User effort also includes the effort to understand the table structure of the database to compose the needed tables. 

The normalization procedure transforms a table step by step from normal form 0 to normal form n, where n is usually 3 or 4. Normal form 4 is called BCNF (Boyce-Codd Normal Form). Review these normal form definitions and examples in this article.

In summary, normalization is a step by step process for transforming a table into simpler tables, by removing a certain type of dependency at each step. Normalization is usually performed when a table is being designed and when it is being updated. Normalization prevents data anomalies that cause data inconsistencies (update anomaly), unintended loss of data (deletion anomaly), or the inability to add data (insertion anomaly).

To review, read Chapter 9 of Database Systems for Management.

 

7b. discuss and contrast the different types of dependencies

  • What is a functional dependency?
  • What is a database functional dependency?
  • How does a functional dependency affect database integrity and consistency?

In general, a functional dependency refers to a function, denoted F: D ---- >R where D is a set that represents the domain values of the function and R is a set that represents the range of the function. F(d) = r, means that the value r in R is determined by d in D. We say that r is functionally dependent (via F) on d, and d is called the determinant. Given a d in D, there is only one r such that F(d) = r

It is a misconception to assume that a determined attribute value can not appear more than once in a column; and second, it is a misconception that dependency is reflexive – if d determines r, don't assume that r determines d.

The mathematical theory of functional dependencies provides us with rules that help us infer all the functional dependencies in a database. We apply the rules to each table in the database. Five important rules are reflexivity, augmentation, transitivity, union, and decomposition.

As mentioned previously, functional dependencies risk the integrity of a database by introducing risks of anomalies that could lead to data inconsistencies. To avoid those risks, we avoid dependencies when we are designing the tables; and we check for dependencies when we maintain the tables. 

Review Chapter 12 of Database Design and Chapter 5 of Database Systems for Management.

 

7c. construct a dependency diagram

  • What is a dependency diagram?
  • Why use a dependency diagram?

A dependency diagram is simply a diagram that depicts the dependencies among the values of the attributes of a relational table. There are various kinds of representations that can be used to represent any information. For example, there are pictures, graphs, matrices, text, and numbers. They are equivalent, in that they represent the same information. They differ in their suitability for a particular type of user or application. For example, a state diagram is a pictorial representation of a state machine that is suitable for human users and for depicting behavior. A matrix can also be used to represent that state machine and is suitable for calculations, especially, automated calculations, by a computer. Another familiar example is a graph of a curve versus an equation that represents that graph. What purpose does each serve? A dependency diagram is suitable for human users; it depicts the amount and complexity of the dependencies. A dependency is like a wire that connects signals between two devices – imagine the complexity of hundreds of wires interconnecting a relatively small number of devices!

A dependency diagram is a graphic aid that documents dependencies and helps us apply the dependency rules to find all the dependencies, which can then be avoided. Thus, dependency rules and diagrams are database design aids that help us produce normalized designs and databases that have integrity and consistency.

Review Chapter 11 of Database Design and Chapter 5 of Database Systems for Management.

 

7d. normalize a relation to at least third normal form

As with E-R modeling, relational modeling, and the relational algebra, practicing with examples is necessary to understand normalization and to prepare for the final exam. 

  • What is the process and procedure for normalization of a set of tables?
  • Give an example of a table that violates the first, second, third, and BC normal forms.

Normalization of a table starts by assuming that the table is normal form 0. Step 1 checks if it is normal form 1 by checking that each cell (i.e. intersection of a column and a row) has only one value (no repeating groups). If there is a repeating group, the table is not normal form 1. To make it normal form 1, the table is split into two tables, where the first table contains the non-repeating attributes and the second table contains the repeating group. The primary key for the second table is a composite key composed of the primary key of the first table and a key value from the repeating group. Repeat this step for each repeating group.

Step 2 checks if the tables from step 1 satisfy normal form 2 (they are already 1NF from step 1). If each primary key value consists of a single value, the tables are 2NF. If a primary key is a composite value, then each attribute that depends on that composite key must depend on it in its entirety; not on only part of the composite key. If a table is not 2NF, remove the information that is not fully dependent on the primary key, and put it in a separate table. Define a primary key for that second table.

Step 3 checks if the tables from step 2 satisfy normal form 3 (they are already 2NF from step 2).

If the tables from step 2 do not have any transitive dependencies and a non-key attribute is not fully functionally dependent on another non-key attribute, they are 3NF. Otherwise, if a table has a transitive dependency, remove it and put it in a separate table; if a table has a non-key attribute dependent on another non-key attribute, remove it and put it in a separate table. Repeat this step until all tables satisfy 3NF.

Normalization steps continue for each type of normal form. Step 4 checks for BCNF (Boyce-Codd Normal Form), which means every determinant is a candidate key. Steps higher than 3 are not always used. If a table has a determinant that is not a candidate key, split it off and put it in a new table. Note that dependencies often come from business rules. 

To review, read Chapter 12 of Database Design.

 

Unit 7 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • Data Integrity
  • Consistency
  • Functional dependency
  • Anomaly
  • Normalization
  • Dependency diagram
  • 0NF
  • 1NF
  • 2NF
  • 3NF
  • BCNF
  • Reflexive
  • Augmentation
  • Transitive
  • Union
  • Decomposition

Unit 8: Introduction to SQL

8a. compare DDL commands to DML commands

  • What are DDL and DML?
  • What is the general syntax of the commands?

Early versions (1992, 2003) of SQL grouped SQL commands into categories, three of which are widely used: DDL, DML, and DCL. DDL (data definition language) commands are used to create a database and to enter and organize data in tables; DML (data manipulation language) commands are used to add, remove, update, and select the data in the tables; and DCL (data control language) commands are used to manage the DBMS. In this Unit, we address DDL commands, which are used to define the tables in the database, and DML commands, which are used to maintain the data in the tables and to select data from the tables. DML commands correspond to the relational algebra operations. SQL commands or statements are constructed of components called clauses, expressions (which include functions), and predicates or conditions.

 

8b. discuss when to use the appropriate SQL commands

  • Who uses these commands?
  • When and why are they used?
  • How are they used?

SQL has a relatively small number of DDL commands: CREATE, ALTER, DROP for defining the database and its structure. There are a few more DML commands: INSERT, UPDATE, DELETE, SELECT for entering data, maintaining and selecting it.

DDL commands are used by database administrators to initially define and periodically maintain a database and its tables. DML commands are by application users on a daily basis to access the data to perform application-specific functions with the data, and by end-users on a daily basis to query, update, or report the data. Application users include people and application programs.

In the following learning objective discussions, you should use the SQL Sandbox (a learning environment for trying and experimenting with SQL commands). 

 

8c. create a set of database tables using DDL

  • What is the syntax of the CREATE command?
  • What is the semantics of the CREATE command?
  • After a table is created, can it be changed?

The CREATE command or statement is used to create a set of database tables. The syntax of the CREATE command is as follows:

CREATE <table name>
( <column name 1> datatype optional column constraint, 
<column name 2> datatype optional column constraint, 
.
.
.
<column name n>, datatype, optional column constraint Optional table constraints );

The meaning or semantics of the CREATE statement is the definition of a table with the name <table name> and columns, <column name 1>, <column name 2>, ... , and <column name n. The last clause defines any optional table constraints. Any 'name' within < > are application-specific names. Datatype is one of many (twenty or more) data types for the value domain of a column, for example, character or integer type. 

Optional column constraint may be NULL, NOT NULL, UNIQUE, PRIMARY KEY, and DEFAULT.

Optional table constraints may be IDENTITY, UNIQUE, FOREIGN KEY, CHECK, DEFAULT.

After a table is created, the ALTER command is used to remove columns and to add or remove constraints. Don't forget that commercial SQLs may adhere to the ANSI (American National Standards Institute) standards, but may also have extensions and variations depending on the DBMS they support, such as Microsoft SQL Server, MySQL, PostgreSQL, or Oracle.

The DDL DROP statement is used to remove a database or to remove a table from the database:

DROP DATABASE database name;
DROP TABLE table name;

Try the CREATE statement using the sandbox to define and create a database table called Student: use the schema Student(StudentId, StudentName, Department, AdmitDate).

 

8d. describe the various types of integrity constraints and how they are used

  • Describe the SQL column integrity constraints.
  • Describe the SQL table integrity constraints.
  • What SQL commands can include integrity constraints?
  • What is the syntax for a SQL integrity constraint?

Integrity constraints are SQL keywords that restrict the initial value for a field of a record (the row-column intersection of a table) or restricts the values for a column. Constraints are SQL features that represent the semantics of the application. SQL column integrity and table constraints are keywords used in the CREATE and ALTER statements. Constraints are used to check the validity of an action; if a constraint is violated during an action on a table, the constraint aborts the action. Depending on the dialect and version of SQL, more or less constraints may be available. 

Typical column integrity keywords are NULL, NOT NULL, UNIQUE, PRIMARY KEY, DEFAULT. The keyword PRIMARY KEY is followed by the primary key name; the keyword DEFAULT by the default value. Several of these column constraints, like UNIQUE and PRIMARY KEY, can also be specified as table constraints. 

Typical table integrity keywords are IDENTITY, UNIQUE, FOREIGN KEY, PRIMARY KEY, CHECK, and DEFAULT. A table constraint can be given a name using the following syntax: CONSTRAINT <constraint name> <constraint keyword>, with additional syntax depending on the keyword. Each table constraint key may have additional syntax to identify necessary data associated with the constraint.

Try the CREATE statement using the sandbox to define column and table constraints for the database table called Student that you created.

 

8e. describe the various types of Data Manipulation commands and how they are used

  • What are the standard SQL DML statements?
  • What is the generic syntax for the SQL DML statements?
  • What is their semantics?
  • Who uses the DML statements?
  • When do they use them?

After the database and tables are created, we want to populate the tables with actual values, to search the tables for information, and, eventually, to update the tables. We use the DML statements to perform those tasks. Whereas DDL only includes a few statements, CREATE, ALTER, DROP, which pertain to the organization of the data in tables, DML includes SELECT, INSERT, UPDATE, DELETE, and JOIN statements, which pertain to the values stored in the tables.

The syntax of the SELECT command has the form:

SELECT parameters
FROM <table name>
WHERE <condition>;

Optional parameters are written, SELECT * to signify all columns; or, SELECT {DISTINCT <column> AS <alias>,...}, where the braces indicate that a list of parameters, separated by columns can be specified. 'Alias' is an optional keyword used as the column heading for the attribute <column>. An expression, which computes a derived value from the field's values, can also be used in place of a column. DISTINCT is an optional keyword that signifies that only one column value for each record shall be retrieved. If there are duplicate values in a column, only one will be retrieved. The WHERE keyword specifies a condition that filters column values. 

SELECT is a retrieval command that finds data in a table using conditions specified in the parameters, keywords, or clauses.

INSERT INTO <table name> columns
VALUES ( .....); 

INSERT takes three parameters: the name of the table to receive values, the column names where the values will be stored, and the values. If the columns are not specified, the values must comprise one record for the table. If the columns are specified, the values must correspond in order to those columns. Thus, this statement adds a full or partial record to the table.

UPDATE <table name> SET column 1=value 1, column 2=value 2, ..., column n=value n
WHERE <condition>;

UPDATE changes values in a table. The parameters identify the values that update the columns and the WHERE condition identifies the rows that will be updated.

DELETE FROM <table name> 
WHERE condition;

DELETE has one parameter, the name of the table. The WHERE clause is used to specify the record to be deleted. If the WHERE clause is not used, all records will be deleted from the table.

DDL statements are used by database designers and analysts when the database is being developed, tested, and deployed. DML statements are used by database administrators, application users, and end-users when the database is operational.

The syntax of a DDL or DML statement has a meaningful name that suggests its semantics, and has parameters, keywords, and/or clauses that provide the information needed to carry out the behavior of semantics of the command. 

Try the commands with the table you have created in the SQL sandbox.

 

Unit 8 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • DDL
  • DML
  • Syntax
  • Semantics
  • Users
  • CREATE
  • ALTER
  • DROP
  • SELECT
  • INSERT
  • UPDATE
  • DELETE
  • JOIN

Unit 9: Basic Select Statements

9a. construct basic queries using SQL

  • What is a basic query?
  • What SQL statement is used for basic queries?
  • Describe, in general terms, common basic queries and their corresponding SQL parameters, keywords, and clauses used in the SELECT statement.

Queries are user requests for information from the database. A query is formed from the SELECT and/or the JOIN statements. Think of the database as a large electronic filing cabinet of related data. It is assumed we are familiar with the organization of the filing cabinet, so that we can request needed information. Thus, we must be able to identify the tables, columns, and records, and related data, that constitute what we need, transform it into parameters, clauses, and keywords for a SELECT statement. Unit 9 discusses the SELECT statement. 

Sometimes the information we need spans several tables, and retrieving the information involves multiple queries. It is usually more efficient to use a JOIN, which creates a virtual table that incorporates data from several tables and then use one or a few SELECT statements to retrieve the needed information. Unit 10, the next Unit, discusses the JOIN statement.

A basic query is a request for information from a single table. The SELECT statement is used for basic queries. A basic query asks for values for a subset of records from a table, where the subset is determined by the values of a specific column. For example, using the sample Student schema, Student(StudentId, StudentName, Department, AdmitDate), which we defined at the end of the discussion for 8c, a basic query could ask for dates in which students were admitted. The basic query is:

SELECT AdmitDate FROM Students; 

If 10 students were admitted on the same data, that date would be repeated 10 times.

SELECT DISTINCT AdmitDate FROM Students;

lists a date only once.

The WHERE clause selects records that satisfy a specified condition. Records can be grouped into categories using the GROUP BY clause, and sorted in a specified order using the ORDER BY clause.

Within the SELECT statement, complicated expressions can be used. Arithmetic and logical expressions can be used as a column heading and in the WHERE clause to filter the records. Logical operators, for example, comparison operators, result in 'true' or 'false' and can be used to specify conditions to filter desired records in a table. In addition to the typical comparison operators, there are set operators IN, BETWEEN, SOME, ALL, EXISTS, that can do additional filtering of records.

Information sought via queries is often large and results are output as a report. The retrieved records can be sorted or gathered into groups using the SORT BY or GROUP By clauses, respectively. Then, functions including arithmetic, aggregate (such as SUM), and character type functions (for example, UPPER that converts a character string to uppercase), can be used in expressions to further filter records and to summarize the values in a column – or count the number of total records filtered or get subtotals for subsets of the filtered records.

Given that SQL is a programming language for finding and retrieving information from large databases, it is no surprise that SELECT has numerous features, specified using parameters, keywords, clauses, operators, and functions, that support user queries. In addition, since information is often spread across tables, the SELECT statement is recursive (it can contain subqueries). An inner SELECT statement within an outer SELECT statement, which is called nesting of queries. A subquery can be inside a SELECT clause, or a FROM clause, or a WHERE clause. (Note: a SELECT clause is the set of parameters that a user specifies immediately after 'SELECT'.) Lastly, SELECT can be used in other commands also namely, INSERT, DELETE, UPDATE.

Do not forget the best preparation for the final exam includes trying out the SELECT statement features on the sample database you defined and updated in the SQL Sandbox.

 

Unit 9 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • Basic query
  • SELECT FROM
  • SELECT WHERE
  • Parameters
  • Keywords
  • Clauses
  • DISTINCT
  • GROUP BY
  • ORDER BY
  • Arithmetic operator
  • Logical operator
  • Arithmetic expression
  • Logical expression
  • Arithmetic function
  • Aggregate function
  • Character function
  • Recursion/subquery/nested queries
  • Inner and outer SELECT

Unit 10: The Join Statement

10a. explain the different methods for joining tables

  • Which relational algebra operation corresponds to the SQL JOIN statement?
  • Why is there a join statement if the same result can be obtained from subqueries?
  • Describe the various types of JOINs?
  • What is a virtual table?
  • What is a view?
  • Since a SQL JOIN statement corresponds to a derived virtual ( JOIN creates a virtual or temporary table composed of the base tables in the database) operation, why is the JOIN necessary? Couldn't we just use the operations that it is derived from?

A major benefit of a relational database is the capability to relate tables. For integrity, tables are designed to be simple and represent atomic elements of an application domain. However, interesting information is represented by relationships, often complex relationships among the tables. Multiple SELECT statements can be used to join and query multiple tables to retrieve that information. SQL, however, provides The JOIN statement, which is often more effective and efficient, to combine related tables into a virtual or temporary table that incorporates and, from which, it retrieves the information of interest. 

Recall from the relational algebra that one way to combine two tables is via the product operation: if A and B are two tables with columns (A1, A2, ..., An) and (B1, B2, ..., Bm) then the product is the table (A1, A2, ..., An, B1, B2, ..., Bm). However, the product contains all possible relations, n * m. For a given domain of application, only some of the attributes will be related and, thus, only comprise a subset of the product. The relational algebra operation, natural join, combines two relations if they have the same attribute. Suppose Ai is an attribute of A and Bj is the same attribute. Then, a tuple (a1, a2, ..., ai, ..., an, b1, ..., bj-1, bj+1, ..., bm) is a tuple in the natural join of A and B, if and only if ai = bj). The natural join is a reasonable relationship of two tables that has meaning for most application domains. The natural join operation corresponds to the SQL Equi JOIN statement.

A SQL join is faster than subqueries. Moreover, it is more efficient for a user, since the SQL translator checks that conditions are satisfied, the SQL translator optimizes the execution plan which includes selecting from several or more tables (vs. separate execution plans for each subquery select command - this assumes that the SQL translator is a traditional interpreter), and a join lets a user work at the level of relations, instead of the level of columns and conditions. 

There are several types of joins, including, natural, equi, inner, and left outer, right outer, and full outer join. These types are similar, in that, they select rows from two tables, A and B, and combine them to produce a subset of the rows of A x B. They differ in how the rows are selected from A and B, and how the selected rows are combined. Here is a table that summarizes how each type of join selects records from A and B:

 

What columns are checked?

What rows are selected?

How are the selected rows combined?

What is the result table of the statement?

Comment

Natural Join

Column, say Ai, in A and, say Bj, in B that has the same name

For each value in the Ai column, select every row in B where Ai = Bj

Concatenate the select rows but only keep only one of the common column values, ie. the value of Ai or the value of Bj

Project the selected columns from the concatenated rows

Natural Join does not use an ON clause to specify what is compared and what comparison operator is used. The common column values are assumed to be compared using equality.

Equi Join

The columns checked are specified in the ON clause

If a value of one of the specified columns is equal to a value of the other specified column, select their corresponding rows.

Concatenate the selected rows, keep both of the values of the specified columns. 

Same, namely, project the selected columns from the concatenated rows

The columns whose values are checked are specified in the On clause

Inner Join

Same as the Equi Join

Same as the Equi Join except the comparison can be another comparison operator; it is not restricted to the equality operator

Same as Equi Join

Same, namely, project the selected columns from the concatenated rows

See the picture in

Pictorial representation

Left Outer Join

The columns checked are specified in the ON clause

Select all the rows that the Inner Join selects plus all the rows of the left table, i.e. the table in the FROM clause, which do not satisfy the join condition in the ON clause

Concatenate rows of the left table with those of the right table which satisfy the join condition in the ON clause; concatenate rows of the left table that do not satisfy the join condition with null values for the columns of right table, i.e. the table in the join clause

Same, namely, project the selected columns from the concatenated rows

See the picture in

Pictorial representation

Right Outer Join

The columns checked are specified in the ON clause

Select all the rows that the Inner Join selects plus all the rows of the right table, i.e. the table in the join clause, which do not satisfy the join condition in the ON clause

Concatenate rows of the left table with those of the right table which satisfy the join condition in the ON clause; concatenate rows of the right table that do not satisfy the join condition with null values for the columns of left table, i.e. the table in the FROM clause

Same, namely, project the selected columns from the concatenated rows

See the picture in

Pictorial representation

Full Outer

 Join

The columns 

checked are specified in the ON clause

Select all the rows that the Inner Join selects plus all the rows the Left Outer Join selects plus all the rows the Right Outer Join selects

Concatenate rows of the left table with those of the right table which satisfy the join condition in the ON clause; concatenate rows of the left table that do not satisfy the join condition with null values for the columns of right table, i.e. the table in the join clause; concatenate rows of the right table that do not satisfy the join condition with null values for the columns of left table, i.e. the table in the FROM clause

Same, namely, project the selected columns from the concatenated rows

See the picture in

Pictorial representation

A few final comments:

  1. The Cross Join produces the cartesian product of two tables; the cartesian product of two tables is the set of rows where each row of one table is concatenated with each row of the other table, and has n * m rows, where one table has n rows and the other m rows. The cartesian product is the 'universe' set for the joins - i.e. each join is a specific subset of the cartesian product. 
  2. Joining can be performed using just the SELECT statement. In particular, the two 'equivalent' commands, the SELECT query vs the SELECT JOIN query. Recall that the SELECT JOIN has user and execution advantages.

 

10b. construct advanced queries of two or more tables using join operations

  • What are some advanced types of queries using SQL JOIN commands or operations?
  • Describe the ORDER BY clause?
  • Describe the GROUP BY clause?
  • Name the type of key associations used to join multiple tables.

The following clauses are advanced, in that, they involve more than one table or they add features that modify or extend the more commonly used commands:

  • ORDER BY used with the SELECT statement to sort records, after by a column(s), after they have been retrieved
  • GROUP BY used with the SELECT to organize retrieved rows (note: after they are retried) into groups conditioned on the value of a column(s) and on an optional HAVING expression, where the expression often involves an AGGREGATE or COUNT function
  • SQL UNION, which corresponds to the relational algebra union operator; SQL UNION is used to combine the retrieved records from several SELECT queries, assuming that the output tables are union compatible
  • CREATE VIEW is used to create a view, which is a virtual table: it is essentially a query and the result of the query is the content of the view. A view does not have any data content of its own; its data content is stored in other tables. A view can be treated as a regular database table (called a base table), except there are restrictions related to changing data via UPDATE or INSERT. 
  • CREATE INDEX is used to create an index for the columns of a table to immediately access a column, and to order the rows of the column to quickly search the rows of a column.

The WHERE clause filters records from tables, then the GROUP BY clause organizes the records into subgroups, then the HAVING clause filters the subgrouped records, and then the resulting subgrouped records can be sorted by the ORDER BY clause.

When the GROUP BY clause lists more than one column the number of groups is multiplicative – if the first column listed was 'Year' and the second column listed by 'Month', where there are 10 year groups, and there are 12 month groups, then the final organization would have 10 * 12 = 120 subgroups. At the end, when the 120 subgroups have been retrieved, they can be sorted using the ORDER BY clause to order the 12 month subgroups within each year subgroup, and to order the year subgroups into, like ascending order by year.

CREATE INDEX may not be part of a particular ANSI Standard SQL version.

In summary, SQL base commands and operations are sufficient for searching a relational database. Since a relational database is organized as a related collection of normalized tables, needed information is often contained in multiple tables. A user, therefore, needs to understand the table organization in terms of columns and rows, table access via primary and candidate keys, and the relationship of tables via foreign keys. In addition, to the base commands and operations, SQL provides derived operations that are more user convenient and, often, more efficient, to query for information that is spread across multiple tables. JOINS are powerful derived commands. 

The best preparation for the final exam includes trying out the SELECT statement features on the sample database you defined and updated in the SQL Sandbox.

 

Unit 10 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course. 

Try to think of the reason why each term is included.

  • Joining tables
  • SELECT
  • JOIN
  • Query
  • Subquery
  • Virtual table
  • View
  • Product of tables
  • Natural Join
  • Equi Join
  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join
  • Cross Product
  • ORDER BY
  • GROUP BY
  • UNION
  • VIEW
  • INDEX
  • HAVING clause