Wellesley College: Scott D. Anderson's "Relational Algebra"

The elements of relational algebra are 'relations', which are tables. Operations on a table or set of tables result in a table. A relational algebra expression is a combination of operations that specify the order in which the operations are performed. In SQL terminology, this is the order in which basic SQL commands are performed to execute the query. This page explains the basic relational algebra operations.

Relational Algebra

Relational algebra is the mathematical formalization of what happens in relational databases.

Why?

Prior to relational databases, there were various kinds of databases (hierarchical, among others), but there was little regularity among them. The formalization of relational databases led to a big step forward in systematizing database technology.

Why is it still useful? Because it gives us a standard internal representation of a query that allows for query optimization: ways to re-write a query that is:

  • equivalent, but
  • quicker to execute

For example, consider the relative costs of executing the following:

ab+ac
a(b+c)

The latter is equivalent, but a lot faster to execute. The point is that algebraic manipulations of an expression can yield another expression that has the same results, but executes faster. That's how query optimization works.

From Figure 5.1:
query processing pipelint

Defining Relational Algebra

Terminology

  • relations - a relation is a set of tuples
  • operators - take one or more relations as arguments and produce new relations.

The minimal relational algebra operators:

  • Select: returns a subset of the relation, by choosing some of the tuples
  • Project: returns a relation that has all the tuples, but only some of the columns. That is, tuple becomes a subtuple: each N-tuple becomes an M-tuple where M <= N.
  • Union: returns the union of all the relations. Relations must be union compatible. The following are:

    CREATE TABLE elevation ( name CHAR(10), height INT );
    CREATE TABLE population ( name CHAR(10), pop INT );

  • Set difference: Returns the tuples that are in one relation but not in the other
  • Cartesian product: AKA cross-product. The cross-product of relations A and B is a set of tuples where every tuple is the concatenation of a tuple from A and a tuple from B, for all possible pairs.

    Slightly different from mathematical cross-product, but not a concern. Major problem is the size of the result. Quadratic is bad enough!

Additional (derived) operators:

  • intersection
  • division
  • join

Notation

  • select: σcriteria(R)
  • project: πattributes(R)
  • union: R ∪ S
  • set difference: R - S
  • cross product: R × S
  • intersection: R ∩ S
  • join: R ⋈ S

You might try this exercise outside of class. We'll do some more exercises in class.

Write the relational algebra notation for our bi-coastal people. Once you've found your own solution, click on the link to see one possible solution. Assume we have the relations Contacts(Name,ID) and Addresses(ID,City,State)

Joins

A join is just a cross-product followed by a selection, but it's better to define it directly because that's (1) more efficient and (2) more useful.

General Join (AKA theta joins) just puts the selection condition in the join operator. Only conjunction is AND.

Equijoin is a special case where all the conjuncts are equality. This knits tables related by foreign keys together.

Natural Join is the special case that is also the most common. It's an equijoin with equality on all identically named attributes, projecting out the duplicates. Therefore, no subscripts necessary!

To facilitate this, it's a good idea for all attributes for a particular thing to have the same name across different tables. Not StudID here and ID there, but always StudID.

However, as natural as it is, there's no special support for it in standard SQL! You still have to write out all the equality conditions! Insane. Fortunately, MySQL (and other DBMSes) treat us better.

Outer Joins

We saw this before, but just to recap:

Define:

  • R′ = one copy of each tuple of R that didn't match
  • S′ = one copy of each tuple of S that didn't match

Columns from the other table are padded with null.

  • full outer join of R and S is inner join union R′ union S′
  • left outer join of R and S is inner join union R′
  • right outer join of R and S is inner join union S′

Q: Explain the following:

mysql> select count(*) from credit;
+----------+
| count(*) |
+----------+
|      473 |
+----------+
mysql> select count(*) from person natural join credit;
+----------+
| count(*) |
+----------+
|      473 |
+----------+
mysql> select count(*) from movie natural join credit;
+----------+
| count(*) |
+----------+
|      473 |
+----------+
mysql> select count(*) from person natural join credit natural join movie;
+----------+
| count(*) |
+----------+
|      236 |
+----------+
mysql> select count(*) from person,credit,movie where person.nm=credit.nm
and movie.tt=credit.tt;
+----------+
| count(*) |
+----------+
| 473 |
+----------+

Division

The "division" operator is diabolical, but fortunately rare. It also has no natural counterpart in SQL, but we will see how to do queries that demand division. The idea:

  • Relation A has "extra" attributes compared to relation B.
  • You want the tuples from A where the extra attributes cover all the possibilities in B.

Example: Relation R is R(professors,courses) where there is a tuple if professor P teaches course C.

  • R/(πprofessors(R)) finds courses taught by every professor.
  • R/(πcourses(R)) finds professors who have taught every course.

List students who have taken all the CS courses, where there are relations Student(Name,id)Transcript(StudID,crsCode) and Course(title,number,crsCode)

Student/πcrsCodecrsCode='CS%'(Course))

As a query, we list students where there isn't a CS course that they didn't take:

SELECT S.id
FROM Student S
WHERE NOT EXISTS (
      -- all CS courses
      (SELECT C.crsCode
      FROM Course C
      WHERE C.crsCode LIKE 'CS%')
      EXCEPT
      -- This student's courses
      (SELECT R.crsCode
      FROM Transcript R
      WHERE R.StudID = S.id) )

MySQL doesn't seem to have the EXCEPT operator, but we can accomplish the same result using EXISTS and NOT IN:

pub/mysql/division.sql

Query Optimization using Relational Algebra

The idea is to search a space of equivalent algebraic expressions, estimating the run time of each, and trying to find the minimum.

Some equivalences:

  • commutativity of select, which means we can select rows in either order of two conditions:

    σcond Acond B(R)) ≡ σcond Bcond A(R))

  • cascading of select, which means we can split up a conjunctive condition into multiple steps:

    σcond A ∧ B(R) ≡ σcond Acond B(R))

  • cascading of projections, which means we can project out unwanted columns early. Note that X ⊆ Y and both are a list of column names.

    πX(R) ≡ πXY(R))

  • cross products and joins are commutative.

    R × S ≡ S × R

  • cross products and joins are associative:

    R × (S × T) ≡ (S × R) × T

  • selection and projection can be commutative if the relevant attributes are retained:

    πXcond A(R)) ≡ σcond AX(R))

  • selection and cross-product (or join) can be commutative if the selection condition only applies to attributes of one relation.

    σcond A(R × S) ≡ σcond A(R) × S

  • projection is commutative with cross product and join:

    πX(R × S) ≡ πX(R) × πX(S)

  • And others.

Estimation of cost is done essentially with big-O notation, plus estimates of the number of tuples in different relations and in estimated results of sub-queries.

Example:

  • If you think a select will be very narrow and the other relation has an index, do the selection first and then look up the remaining tuples using the index.

    Names of directors of movies named Hamlet

  • If you think a select will be very broad, and both relations are sorted, do the join using the mergesort idea (see below) and ignore the index.

    Names of directors who are under 50 and directed movies released after 1990.

    Movies with a rating of at least 4 and with at least 5 listed cast members.

The Mergesort Idea

To find the intersection or union or difference of two sets, you need to make sure there are no duplicates. In general, we are merging the sets. You can, of course, merge them and remove duplicates afterwards, but it is often faster to remove them as we merge.

One of the most efficient ways to remove duplicates is to sort, so that identical elements end up next to each other. Sorting is O(N log N), and most any other algorithm will end up being O(N2), as you iterate through one set, looking in the other.

The mergesort algorithm has two sorted subsets, and merges them in O(N) time to yield a sorted merger. We can use that idea to produce the union of the two subset, the intersection, and so forth.


Source: http://cs.wellesley.edu/~webdb/lectures/16-Relational-Algebra/
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 1.0 License.

Last modified: Wednesday, December 19, 2018, 5:40 PM