A History of Data Science

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: A History of Data Science
Printed by: Guest user
Date: Sunday, May 12, 2024, 6:11 AM

Description

When learning any new field, context is everything; therefore, we will begin this course by introducing the history of data science. In this way, you will be able to understand how the field became an amalgamation of various areas of science dealing with data in many different forms. In this section, and as the course continues, pay close attention to the various ways data can be represented and analyzed.

A History of Data Science

Chapter Summary

Data Science is a composite of a number of pre-existing disciplines. It is a young profession and academic discipline. The term was first coined in 2001. Its popularity has exploded since 2010, pushed by the need for teams of people to analyze the big data that corporations and governments are collecting. The Google search engine is a classic example of the power of data science.

 

Discussion

Data science is a discipline that incorporates varying degrees of Data Engineering, Scientific Method, Math, Statistics, Advanced Computing, Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data Science is called a Data Scientist. Data Scientists solve complex data analysis problems.

 

Origins

The term "Data Science" was coined at the beginning of the 21st Century. It is attributed to William S. Cleveland who, in 2001, wrote "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". About a year later, the International Council for Science: Committee on Data for Science and Technology started publishing the CODATA Data Science Journal beginning in April 2002. Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science.


Development

During the "dot-com" bubble of 1998-2000, hard drives became really cheap. So corporations and governments started buying lots of them. One corollary of Parkinson's Law is that data always expands to fill the disk space available. The "disk-data" interaction is a positive exponential cycle between buying ever more disks and accumulating ever more data. This cycle produces big data. Big data is a term used to describe data sets so large and complex that they become awkward to work with using regular database management tools.

Once acquired, we have to do something with the big data besides just storing it. We need big computing architectures. Companies like Google, Yahoo!, and Amazon invented the new computing architecture, which we call cloud computing. One of the most important inventions within cloud computing is called MapReduce. MapReduce has been codified into the software known as Hadoop. We use Hadoop to do big computing on big data in the cloud.

  • The normal computing paradigm is that we move data to the algorithm. For example, we read data off a hard drive and load it into a spreadsheet program to process. The MapReduce computing paradigm is just the opposite. The data are so big we cannot put it all into the algorithm. Instead, we push many copies of the algorithm out to the data.

It turns out that Hadoop is difficult to do. It requires advanced computer science capabilities. This opens up a market for the creation of analytics tools - with simpler interfaces - that run on top of Hadoop. This class of tools is called "Mass Analytic Tools" - that is, tools for the analysis of massive data. Examples of these are "recommender systems, "machine learning," and "complex event processing". These tools, while having a simpler interface to Hadoop, have complex mathematical underpinnings, which also require specialization.

So, with the advent of mass analytic tools, we need people to understand the tools and actually do the analysis of big data. We call these people, "Data Scientists". These people are able to tease out new analytic insights never before possible in the world of small data. The scale of problems that are solved by analyzing big data is such that no single person can do all the data processing and analytic synthesis required. Therefore, data science is best practiced in teams.

In sum, cheap disks --> big data --> cloud computing --> mass analytic tools --> data scientists --> data science teams --> new analytic insights.


Popularization

Mike Loukides, Vice President of Content Strategy for O'Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article "What is data science?" In the last few years, data science is increasingly being associated with the analysis of Big data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated by their websites.

There are now several ongoing conferences devoted to big data and data science, such as O'Reilly's Strata Conferences and Greenplum's Data Science Summits.

The job title has similarly become very popular. On one heavily used employment site, the number of job postings for "data scientist" increased more than 10,000 percent between January 2010 and July 2012.

 

Academic Programs

Several universities have begun graduate programs in data science, such as at the Institute for Advanced Analytics at North Carolina State University, the McCormick School of Engineering at Northwestern University, and the now-discontinued six-week summer program at the University of Illinois.

 

Professional Organizations

A few professional organizations have sprung up recently. Data Science Central and Kaggle are two such examples. Kaggle is an interesting case. They crowdsource data science solutions to difficult problems. For example, a company will put up a hard problem with Kaggle. Data scientists from around the world sign up with Kaggle, then compete with each other to find the best solution. The company then pays for the best solution. There are over 30,000 data scientists registered with Kaggle.

 

Case Study

In the mid- to late-1990s, AltaVista was the most popular search engine on the internet. It sent "crawlers" to extract the text from all the pages on the web. The crawlers brought the text back to AltaVista. AltaVista indexed all the text. So, when a person searched for a keyword, Altavista could find the web pages that had that word. AltaVista then presented the results as an ordered list of web pages, with the pages that had the most frequent mentions of the term at the top. This is a straightforward computer science solution, though, at the time, they solved some very difficult scaling problems.

In the late 1990s, the founders of Google invented a different way to do searches. They combined math, statistics, data engineering, advanced computation, and the hacker spirit to create a search engine that displaced AltaVista. The algorithm is known as PageRank. PageRank looks not only at the words on the page but the hyperlinks as well. PageRank assumes that an inbound hyperlink is an indicator that some other person thought the current page was important enough to put a link to it on their own page. Thus the pages with the most hyperlinks end up at the top of the list of search results. PageRank captures the human knowledge about web pages, in addition to the content.

Google not only crawled the web, it ingested the web. That is big data. They then have to calculate the PageRank algorithm across that big data. That requires massive computation. Then they have to make search and search results fast for everyone. Google search is a triumph of data science (though it was not called data science when it started).

 

Assignment/Exercise

Get into groups of 2 or 3 students. Download and install the R programming language on your computer. Help each other get R up and running.

  1. Go to the R website: http://www.r-project.org/
  2. Click on the CRAN mirror link
  3. Click on the Linux, Mac OSX, or Windows link
Linux

Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as typing in sudo aptitude install r-base or sudo apt-get install r-base, or installing the package r-base using your favorite package manager, for example, Synaptic.

There is also a bunch of packages extending R to different purposes. Their names begin with r-. Take a closer look at the package r-recommended. It is a metapackage that depends on a set of packages that are recommended by the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this package, as it depends on r-base.

Installation with apt-get (Debian, Ubuntu, and all Linux distributions based on Debian)

sudo apt-get install r-base
sudo apt-get install r-recommended



Installation with aptitude (Debian, Ubuntu, and all Linux distributions based on Debian)

sudo aptitude install r-base
sudo aptitude install r-recommended

 

 

Mac OS

Installation: Download the disk image (dmg file) and install R.

The default graphical user interface for Mac is much better than the one for Windows. It includes

  • a data frame manager,
  • a history of all commands,
  • a program editor which supports syntax highlighting.

 

Windows

To install R under Windows operating system you have to download the binaries from the web. First, go to the R-Project website (listed above) and click CRAN under the download section on the left panel and select a mirror site, from where you could download the required content. The best idea is to pick a mirror closest to your actual geographical location, but other ones should work as well. Then click Windows and in subdirectories base. The windows binary is the .exe file, in the form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version, the setup has the same steps.

R setup1.JPG

R setup2.JPG

R setup3.JPG

R setup4.JPG

R setup5.JPG

As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems. However, there are a few things that you can alter.

  1. On the welcome screen click Next.
  2. Read or just notice the GNU license, and click Next.
  3. Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc, the default choice will be OK for you.
  4. During the next step, you can specify which parts of R you want to install. Choices are User installation, Minimal user installation, Full installation, and Custom installation. Notice the required space under the selection panel (varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
  5. In this step, you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation (see lower).
  6. You can specify the Start menu folder.
  7. In the next step, you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and specify registry entries.

With these steps, you can customize the R graphical user interface.

  • You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window (SDI).
  • You can select the style and how the Help screen is displayed in R. You will use help a lot, so this may be an important decision. It is up to you, which style you prefer. Please note, that the content of the help file will be the same regardless of your choice. Here you specify just the appearance of that particular window.
  • In the next step, you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard option here.

 

Portable R for Windows

If you want to install R on your USB stick go to the Portable R website. This is useful if you don't have admin rights on a computer. The basic installation requires something like 115 MB but you may need more if you want to install add-on packages.


Source: Calvin Andrus, Jon Cook, and Suresh Sood, https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/A_History_of_Data_Science
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.

A Mash-up of Disciplines

Chapter Summary

This is a very quick overview of the eight "parent" disciplines that contribute to the new Data Science discipline. It suggests generic questions that data scientists should ask as they work through solving problems.


Discussion

As mentioned in Chapter 1, Data Science is a mash-up of several different disciplines. We also noted that an individual data scientist is most likely an expert in one or two of these disciplines and proficient in another two or three. There is probably no living person who is an expert in all these disciplines, and an extremely rare person would be proficient in 5 or 6 of these disciplines. This means that data science must be practiced as a team where, across the membership of the team, there is expertise and proficiency across all disciplines. Let us explore what these disciplines are and how they contribute to data science.

DataScienceDisciplines.png


Data Engineering

Data Engineering is the data part of data science. According to Wikipedia, Data Engineering involves acquiring, ingesting, transforming, storing, and retrieving data. Data engineering also includes adding metadata to the data. Because all these activities are interrelated, a data engineer must solve these issues as an integrated whole. For example, we must understand how we plan to store and retrieve the data in order to create a good ingestion process. Data engineering requires a thorough understanding of the general nature of the data science problems to be solved in order to formulate a robust data acquisition and management plan. Once the plan is well developed, data engineers can begin to implement it into data management systems.

Acquiring - This is the process of laying our hands on the data. The data engineer part of the data scientist needs to ask the questions, "where is the data coming from?", "what does the data look like?". and "how does our team get access to the data?" The data could come from many places such as RSS feeds, a sensor network or a preexisting data repository. The data could be numbers, text documents, images, or videos. The data can be collected by the team or purchased from a vendor. For example, if we are going to investigate highways, we could have sensors on a stretch of freeway that measures how fast cars are going. These sensors send us the data as text messages that include the date, time, lane, and speed of every car that crosses the sensors.

Ingesting - This is the process of getting the data from the source into the computer systems we will use for our analysis. The data engineer part of the data scientist needs to ask the questions, "how much data is coming?", "how fast is it coming?", "where are we going to put the data?", "do we have enough disk space for the data?", and "do I need to filter the incoming data in any way?" Data is measured in bytes. A byte is roughly equivalent to one character of a written word. A one-page document is about 1,000 bytes or one kilobyte (1K). For example, if we are going to investigate highways, we could be receiving car speed data at a rate of 10,000 bytes per second for a 1-week period. There are 604,800 seconds in a week. This means you will receive 6,048,000,000 bytes (6 gigabytes) of data in one week. No problem. That will fit on a thumb drive.

Transforming - This is the process of converting the data from the form in which it was collected to the form it needs to be in for our analysis. The data engineer part of the data scientist needs to ask the questions, "what is the form of the raw data?" and "what does the form of the processed data need to be?" A common raw data format is comma-separated values (CSV) which looks like this:

20120709,135214,157,3,57.4
20120709,135523,13,2,62.1

For example, if we are investigating highways, we might receive data that looks like the example above. The segments in the first row are date, July 9, 2012; time, 1:52.14pm; sensor, #157; lane, #3; and speed, 57.4 mph. The data needs to be transformed from CSV format to something akin to a spreadsheet format like the following:

Year Month Day 24-Hour Minute Second Sensor # Lane # MPH
2012 07 09 13 52 14 157 3 57.4
2012 07 09 13 55 23 13 2 62.1

Understanding the various "from" and "to" formats is very important for a data scientist.

Metadata - Wikipedia says that metadata is commonly called data about data. In our case above, the data is the MPH and the Lane. The Sensor is a proxy for "where" on the surface of the earth the data was collected, and the date and time are data about "when" it was collected. We could add other metadata to our data, like weather conditions at the time and the quality of the road. We could derive other metadata, such as whether it was a weekday, holiday, or weekend, and whether it was a rush hour or not. We might also add metadata that indicates who may see the data under what conditions, like "not for public dissemination until 1 year after collected". Metadata is often added both at ingestion time and at transformation time.

Storing - This is the process of putting the data into a data management system. The data engineer part of the data scientist needs to ask the questions "what kind of a system is best for storing our data?", "how fast will the system be?", and "how much extra space will this system need?" We can store data in files in a file system. File systems are generally very fast, but have very little functionality. We can store data in a database. These are often slower than a file system but have much more functionality. For example, in our highway example, we might have 60 million lines of data in CSV format. (At 100 bytes per line, that would be about 6 gigabytes). We could store it in one big file in the file system. It would be fast to read it, but in that format, we could not compute averages by time and location. Alternatively, we could store it in a database where it would be easy to compute averages by location and time, though it would take more time to read through the data.

Retrieving - This is the process of getting the data back out. In the data engineer part, the data scientist needs to ask the questions, "how will we ask questions about our data?" and "how will we display our data?' We can search the data through a query system and we can display subsets of the data in a table. For example, in our highway example, we might want to search for only those measurements from one sensor during the morning rush hour. We might then want to display a table that shows the average rush hour speed by day. In this case, we would be better off if the data had been stored in a database. Thus, knowing what kind of analysis we want to perform, will help us with our data storage strategy.


Scientific Method

The Scientific Method is the scientific part of data science. According to Wikipedia, the Scientific Method is a process for acquiring new knowledge by applying the principles of reasoning to empirical evidence derived from testing hypotheses through repeatable experiments. When a scientist hears someone make an assertion about a fact, they naturally want to know both what is the evidence and what is the standard of acceptance for that evidence.

Reasoning Principles
- There are two general forms of logical reasoning: inductive, and deductive. Simply stated, inductive reasoning arrives at general principles from specific observations, while deductive reasoning arrives at specific conclusions based on general principles. Consider the following two examples:

Inductive argument:

  1. Every life form that everyone knows of depends on liquid water to exist.
  2. Therefore, all known life depends on liquid water to exist.
Deductive argument:

  1. All men are mortal.
  2. Socrates is a man.
  3. Therefore, Socrates is mortal.
Most of the scientific knowledge we have is based on inductive reasoning. The scientist part of a data scientist needs to ask the question, "what is the reasoning behind a particular conclusion?"

Empirical Evidence - Evidence that is empirical is data produced by observation or experiment. This is in contrast to data that is derived from logical arguments or conclusions that are propagated by myths and legends.

  • The classic example is the trial of Galileo. At the time (1633), the Catholic church held to Aristotle's logical argument that the earth was the center of the cosmos. Galileo's observations with his newly invented telescope provided evidence of Copernicus's assertion that the earth revolved around the sun. The outcome of the trial was that Galileo was sentenced to house arrest for heresy. In 2000, Pope John Paul II apologized for the injustice done to Galileo.

The scientist part of a data scientist needs to ask the question, "what is the evidence that leads to a particular conclusion?"

Hypothesis Testing - This process generally asserts two propositions, only one of which can be true. The scientist gathers empirical evidence for and against each proposition, and then accepts one and rejects the other. Often, one of the hypotheses is known as the null hypothesis, and the other as the alternative hypothesis. The null hypothesis is usually a proposition about the way we currently understand the universe to work. The alternative is a proposition about how we think the universe really works. A criminal trial is a classic analogy to understanding hypothesis testing.

  • A defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is convicted. At the start of the procedure, there are two hypotheses: "the defendant is not guilty", and "the defendant is guilty". The first one is called the null hypothesis and is accepted for the time being. The second one is called the alternative hypothesis. It is the hypothesis one tries to prove. The hypothesis of innocence is only rejected when an erroneous conviction is very unlikely because one doesn't want to convict an innocent defendant.
The scientist part of the data scientist needs to ask the question, "what were the null and alternative hypotheses examined to come to a particular conclusion?"

Repeatable Experiments - According to Wikipedia, an experiment is a methodical trial and error procedure carried out with the goal of verifying, falsifying, or establishing the validity of a hypothesis. Experiments vary greatly in their goal and scale but always rely on repeatable procedures and logical analysis of the results. A child may carry out basic experiments to understand the nature of gravity, while teams of scientists may take years of systematic investigation to advance the understanding of subatomic particles.

  • One prominent example is the "inclined plane," or "ball and ramp experiment". In this experiment, Galileo used an inclined plane and several steel balls of different weights. With this design, Galileo was able to slow down the falling motion and record, with reasonable accuracy, the times at which a steel ball passed certain markings on a beam. Galileo disproved Aristotle's assertion that weight affects the speed of an object's fall. According to Aristotle's Theory of Falling Bodies, the heavier steel ball would reach the ground before the lighter steel ball. Galileo's hypothesis was that the two balls would reach the ground at the same time.
The scientist part of a data scientist needs to ask the question, "is there enough information about the methods and data of this experiment that I can replicate it?"


Math

Mathematics (along with statistics) is the cerebral part of Data Science. According to Wikipedia, mathematics is the study of quantity, structure, space, and change. When these are used to solve practical problems, it is called applied mathematics.

Quantity - By this we simply mean numbers. The mathematician part of the data scientist needs to ask the questions, "how will the thing I am interested in be represented by numbers?" and "what kind of numbers will best represent the thing I am interested in?" The numbers could be integers, fractions, real numbers, or complex numbers. For example, if we are going to investigate highways, we could measure the length of highways in miles as represented by integers. We also need to think about the kinds of operations we will perform on numbers. We use arithmetic to operate on and represent the quantities in our data.

Structure - Most sets of mathematical objects exhibit internal structure. The mathematician part of the data scientist needs to ask the questions, "what sort of internal structure does the thing I am interested in have?" and "what set of equations will expose the structure?" The structures could be a constant progression like 3 , 6 , 9 , 12 , . . ., or a simple linear relationship like Y = X + 3. For example, if we are going to investigate highways, we might like to know the structure of speed limits or the structure of lane widths. We use algebra to operate on and represent the structure of our data.

Space - The things we investigate often have some relationship to two- or three-dimensional space. When thinking like a mathematician, a data scientist needs to ask the questions, "does the thing I am interested have a spatial component, either actual or theoretical?" and "how do I capture and represent that spatial component?" The spatial component could be latitude and longitude or it could have a surface that is important. For example, if we are going to investigate highways, we might like to know exactly where particular highway segments are located or how smooth the surface of the highway is. We use geometry and trigonometry to operate on and represent the spatial components of our data.

Change - The things we investigate often change - possibly over time or over distance. The mathematician part of the data scientist needs to ask the questions, "does the relationship between the things I am interested in change?" and "how will I describe the changing relationship?" The changes could be . . . For example, if we are investigating highways, the sharpness of curves in the road may change with the speed limit at that part of the highway, or the depth of the asphalt may change the number of cars per hour that may safely drive in that location. We use calculus to operate on and represent the changing relationships within our data.

Applied Math - This is math with specialized knowledge. Generally speaking, this is the kind of math that Data Scientists practice.


Statistics

Statistics (along with mathematics) is the cerebral part of Data Science. Wikipedia states that statistics is the study of the collection, organization, analysis, and interpretation of data. It involves methods for exploring data, discovering patterns and relationships, creating models, and making inferences about the future. Statistics is the discipline that has the straightest-line pedigree to data science. The statistician is responsible for understanding the analysis that will be done on the data so that it can be collected and organized appropriately.

Collection - A statistician, working with data engineers, ensures that data generation and collection are undertaken in a way that allows valid conclusions to be drawn. The statistician creates the research design, including, if appropriate, the experimental design, that governs the collection of data. The statistician part of the data scientist needs to ask, "what research procedures will be used to generate the data?"

Organization - A statistician, working with data engineers, ensures that data is coded and archived so that information is retained and made useful not just for analysis internal to the project, but also for sharing with others. The statistician is responsible for creating a data dictionary, which is database neutral. A data engineer would create a database schema, which is database specific, based on the data dictionary compiled by the statistician. The data dictionary specifies the variables, the valid values, and the format of the data. The database schema describes how the particular database management system will store the data. The statistician part of the data scientist needs to ask, "are the data stored in such a way as to facilitate the statistical analysis that will be done?"

Analysis - A statistician, working with a mathematician, summarizes, aggregates, correlates, and creates models of the data. The statistician is an expert in analyzing data using descriptive and inferential statistics. This includes creating summaries of the data (such as averages) as well as testing for differences (is this average significantly higher than that average). The statistician part of the data scientist needs to ask, "given the data, which descriptive and inferential statistics ought to be used to test the hypotheses?"

Interpretation - A statistician, working with both a subject matter expert and a visual artist, reports results and summarised data (tables and graphs) in ways that are comprehensible to those who need to make use of them. The statistician part of the data scientist needs to ask, "who is going to get the results, and what do they want to know?"


Advanced Computing

Advanced computing is the heavy lifting of data science. According to Wikipedia, computer programming (often shortened to programming or coding) is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a set of instructions that computers use to perform specific operations or to exhibit desired behaviors. The process of writing source code often requires expertise in many different subjects, including knowledge of the application domain, specialized algorithms, and formal logic.

Software Design - According to Wikipedia, software design is a process of turning the purpose and specifications of software into a plan that includes low-level components and algorithm implementations in an overall architectural view. Programmers implement the software design by writing source code. Software designers will often use a modeling language, such as UML to create designs. For example,


UML Diagram Deployment.svg

The programmer part of the data scientist needs to ask the question, "what components and algorithms do we need in order to solve the problem we are working on?"

Programming Language
- According to Wikipedia, a programming language is an artificial language designed to communicate instructions to a computer. Programming languages are used to create programs that control the behavior of the computer and external devices such as printers, disk drives, and robots. Programs also express algorithms precisely. Programming languages can be thought of as "low-level," such as "assembly languages" that have a nearly one-to-one correspondence to the machine language functions built into the hardware central processing unit (CPU). More commonly, programmers use "high-level" languages, such as Java, Python, and C++, which aggregate many machine-level functions together into human-level functions such as "read data" and "print". The programmer part of the data scientist needs to ask the question, "which programming language should I use to solve the problem at hand?"

Source Code - According to Wikipedia, source code is any collection of computer instructions (with comments) written using some human-readable computer language, usually as text. When executed, the source code is translated into machine code that the computer can directly read and execute. Programmers often use an integrated development environment (IDE) that allows them to type in, debug, and execute the source code. Here are examples of source code for the traditional "Hello World" program as written in Java and Python:

/**
 * Traditional "Hello World" program In Java
 */

class HelloWorldApp {
 public static void main(String[] args) {
 System.out.println("Hello World!"); // Display the string.
 }
}



#
# Traditional "Hello World" program in Python 2.x
#

print "Hello World!"

The programmer part of the data scientist needs to ask the question, "what source code already exists to help solve the problem we are working on?"


Visualization

Visualization is the pretty face of data science. According to Wikipedia, information visualization is the visual representation of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information. Wikipedia also describes graphic design as a creative process undertaken in order to convey a specific message to a targeted audience. A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.

Creative Process
- Wikipedia defines creativity as the process of producing something that is both original and worthwhile. The process includes divergent thinking, which involves the generation of multiple answers to a problem; conceptual blending, in which solutions arise from the intersection of two quite different frames of reference; and, honing, in which an acceptable solution emerges from iterating over many successive unacceptable versions of the solution. The visual artist part of the data scientists needs to ask, "what are several different ways we can show this data" and "how can we improve this visualization over the next several iterations?"

Data Abstraction - Wikipedia, defines data abstraction as handling data bits in meaningful ways. This implies that we do not want to visualize all the raw data, but that we need to visualize manipulations of the data (aggregations, summarizations, correlations, predictions) that are meaningful in the context of the problem we are trying to solve. The visual artist part of the data scientists needs to ask, "how can we simplify the content of the data so it can be visualized meaningfully?"

Informationally Interesting - According to Wiktionary, humans pay attention to things that are interesting and/or attractive. Something that is attractive or beautiful is pleasing to the senses. While beauty is in the eye of the beholder, there are some more or less agreed-upon principles of beauty, such as symmetry and harmony. Surprise, within the context of harmony, is especially interesting to humans. The visual artist part of the data scientists needs to ask, "how can we visualize the content of the data so it is pleasing with a touch of surprise?"

Consider the following graphic. It is a partial map of the Internet in early 2005. Each line represents two IP addresses. Notice that it abstracts only a subset of data about the internet. It clearly went through a number of iterations to arrive at such a harmonious color scheme. It has an overall symmetry, with some surprises in the details (the bright "stars"). Finally, it is meaningful in the context of understanding the World Wide Web.

Internet map 1024.jpg

Hacker mindset

Hacking is the secret sauce of data science. According to Wikipedia, hacking is modifying one's own computer system, including building, rebuilding, modifying, and creating software, electronic hardware, or peripherals, in order to make it better, make it faster, give it added features, and/or make it do something it was never intended to do. For the data scientist, hacking goes beyond the computer system to the whole enterprise of solving data problems. Think of it as an advanced do-it-yourself (DIY) mode of working.

Data science hacking involves inventing new models, exploring new data structures, and mashing the 8 parent disciplines in unconventional ways. Hacking requires boldness, creativity, vision, and persistence. Here are two examples. (Even though they involve hardware, they are presented because they are readily understandable in a few sentences. More complex data science examples are given in chapter four).

Apple I Computer.jpg

  • A famous example is Steve Wozniak's hand-made Apple I computer. It was built from parts scrounged from Hewlett-Packard's trash and from electronic surplus supply stores. Wozniak wanted to give the plans away, but his partner, Steve Jobs, convinced him that they should sell ready-made machines. The rest, as they say, is history.
  • Another example is the Carnegie Mellon Internet Coke Machine. In the early days of the internet before the web, students at Carnegie Mellon instrumented and wired their local Coke Machine to the internet. The students could check to see which internal dispenser columns had been loaded most recently, so they could be sure to buy cold, not warm, sodas. This was important because the machine sold one Coke every 12 minutes and was re-loaded several times a day.
Data scientists often need the data equivalent to a hackerspace, where they can congregate to help each other invent new analytic solutions. The hacker part of a data scientist needs to ask, "do we need to modify our tools or create anything new to solve our problem?" and "how do we combine our different disciplines to come up with an insightful conclusion?"


Domain Expertise

Domain Expertise is the glue that holds data science together. According to Wikipedia, subject matter or domain expertise is proficiency, with special knowledge or skills, in a particular area or topic. Spoken references to subject matter experts sometimes spell out the acronym "SME" ("S-M-E") and other times it is voiced as a word ("smee"). Any domain of knowledge can be subject to a data science inquiry, including - but not limited to - medicine, politics, the physical and biological sciences, marketing, information security, demographics, and even literature. Every data science team must include at least one person who is a subject matter expert on the problem being solved.

Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know, and how best to package the knowledge so it can be readily absorbed by their customers. For example,

  • Edwin Chen, a data scientist at Twitter, computed and visualized the geographic distribution of tweets that refer to soft drinks as "soda," as "pop," and as "coke". Just observing that the Midwest uses "pop" and the Northeast uses "soda" is interesting but lacks an explanation. In order to understand WHY these geographic divisions exist, we would need to consult with domain experts in sociology, linguistics, US history, and maybe anthropology - none of whom may know anything about data science. Why do you think these geographic linguistic differences exist?
  • Nate Silver is a statistician and domain expert in US politics. His blog regularly combines both the data and an explanation of what it means. In his posting, "How Romney’s Pick of a Running Mate Could Sway the Outcome," he not only tells us what the differences are based on his mathematical model, he explains why those outcomes fell out the way they did.
The domain expert part of the data science needs to ask, "what is important about the problem we are solving?" and "what exactly should our customers know about our findings?"


Assignment/Exercise

Become familiar with the R programming environment. Get into a group of 3 to 4 students from the class. Work in study sessions together as a team on the following items. See if you can explain to each other what you are doing. Help each other understand what is going on. You will have to try some things several ways until it works right. That is ok. Some of you will "get it" faster than others. Please help each other so you all "get it".

  • Print a copy and read over Google's R Style Guide. Right now, most of the guide will not make a lot of sense, but it will make more sense as we progress through the book. Keep the printed copy for future reference.
  • Search the web for "introduction to R," "R tutorial," "R basics," and "list of R commands". Pick four or five of these websites to work on. Try working through the first few examples of each site. Many of the introductions go too fast or assume too much prior knowledge, so when it gets too confusing just try another site.
  • Try the commands:
  • library(help="utils")
  • library(help="stats")
  • library(help="datasets")
  • library(help="graphics")
  • demo()
  • demo(graphics)
  • demo(persp)
  • Write a short 5- to 7-line program that will execute without errors and save it. Be sure to include the names of all those who contributed in the comment section.
  • Make a list of the sites the team worked from, and indicate which was the most helpful.
  • Make a list of the top 10 unanswered questions the team has at the end of the study session.

Definitions of Data

Chapter Summary

The word "data" is a general purpose word denoting a collection of measurements. "Data points" refer to individual instances of data. A "data set" is a well-structured set of data points. Data points can be of several "data types," such as numbers, text, or date-times. When we collect data on similar objects in similar formats, we bundle the data points into a "variable". We could give a variable a name such as 'age,' which could represent the list of ages of everyone in a room. The data points associated with a variable are called the "values" of the variable. These concepts are foundational to understanding data science. There is some quirkiness in the way variables are treated in the R programming language.


Discussion

What is Data?
Wiktionary defines data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.

Wiktionary defines a datum as a measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from 1 to 10 by ones, 1 to 100 by 0.1, or simply true or false, on or off, yes, no, or maybe, etc.; and as a fact known from direct observation.

For our purposes, the key components of these definitions are that data are observations that are measured and communicated in such a way as to be intelligible to both the recorder and the reader. So, you as a person are not data, but recorded observations about you are data. For example, your name when written down is data; or the digital recording of you speaking your name is data, or a digital photograph of your face or video of you dancing is data.

What is a Data Point?
Rather than call a single measurement the formal word "datum", we will use what Wikipedia calls a data point. We may talk about a single data point or several data points. Just remember that when we talk of "data," what we mean is a set of aggregated data points.

What is a Data Set?
The Wiktionary, unhelpfully, defines a data set as a "set of data". Let us define a data set as a collection of data points that have been observed on similar objects and formatted in similar ways. Thus, a compilation of the written names and the written ages of a room full of people is a data set. In computing, a data set is stored in a file on a disk. Storing the data set in a file makes it accessible for analysis.

What are Data Types?
As illustrated earlier, data can exist in many forms, such as text, numbers, images, audio, and video. People who work with data have taken great care to very specifically define different data types. They do this because they want to compute various operations on the data, and those operations only make sense for particular data types. For example, addition is an operation we can compute on integer data types (2+2=4), but not on text data types ("two"+"two"=???). Concatenation is an operation we can compute on text. To concatenate means to put together, so: concatenate(two, two) = twotwo. For the purposes of this introduction, we will just concern ourselves with simple numeric and simple text data types and leave more complex data types - like images, audio, and video - to more advanced courses. Data scientists use various data types from mathematics, statistics, and computer science to communicate with each other.

Data Types in Mathematics

We will introduce just the most commonly used data types in Mathematics. There are many more, but we'll save those for more advanced courses.

  1. Integers - According to Wikipedia, integers are numbers that can be written without a fractional or decimal component, and fall within the set {..., −2, −1, 0, 1, 2, ...}. For example, 21, 4, and −2048 are integers; 9.75, 5½, and \sqrt{2} are not integers.
  2. Rational Numbers - According to the Wikipedia, rational numbers are those that can be expressed as the quotient or fraction p/q of two integers, with the denominator q not equal to zero. Since q may be equal to 1, every integer is a rational number. The decimal expansion of a rational number always either terminates after a finite number of digits or begins to repeat the same finite sequence of digits over and over. For example, 9.75 2/3, and 5.8144144144… are rational numbers.
  3. Real Numbers - According to Wikipedia, real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, plus all the irrational numbers such as \sqrt{2} (1.41421356... the square root of two), π (3.14159265..)., and e (2.71828..).
  4. Imaginary Numbers - According to Wikipedia, imaginary numbers are those whose square is less than or equal to zero. For example, \sqrt{-25} is an imaginary number and its square is -25. An imaginary number can be written as a real number multiplied by the imaginary unit i, which is defined by its property i ^2 = −1. Thus, \sqrt{-25} = 5i.
Data scientists understand that the kind of mathematical operations they may perform depends on the data types reflected in their data.

Data Types in Statistics

We will introduce just the most commonly used data types in statistics, as defined in Wikipedia. There are a few more data types in statistics, but we'll save those for more advanced courses.

  1. Nominal - Nominal data are recorded as categories. For this reason, nominal data is also known as categorical data. For example, rocks can be generally categorized as igneous, sedimentary, and metamorphic.
  2. Ordinal - Ordinal data are recorded in the rank order of scores (1st, 2nd, 3rd, etc). An example of ordinal data is the result of a horse race, which says only which horses arrived first, second, or third but includes no information about race times.
  3. Interval - Interval data are recorded not just about the order of the data points, but also the size of the intervals in between data points. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the temperature difference between the freezing and boiling points of water. The zero point, however, is arbitrary.
  4. Ratio - Ratio data are recorded on an interval scale with a true zero point. Mass, length, time, plane angle, energy, and electric charge are examples of physical measures that are ratio scales. Informally, the distinguishing feature of a ratio scale is the possession of a zero value. For example, the Kelvin temperature scale has a non-arbitrary zero point of absolute zero.
Data scientists know that the kind of statistical analysis they will perform is determined by the kinds of data types they will be analyzing.

Data Types in Computer Science

We will introduce just the most commonly used data types in Computer Science, as defined in Wikipedia. There are many more, but we'll save those for more advanced courses.

  1. Bit - A bit (a contraction of a binary digit) is the basic unit of information in computing and telecommunications; a bit represents either 1 or 0 (one or zero) only. This kind of data is sometimes also called binary data. When 8 bits are grouped together we call that a byte. A byte can have values in the range 0-255 (00000000-11111111). For example, the byte 10110100 = 180.
    • Hexadecimal - Bytes are often represented as Base 16 numbers. Base 16 is known as Hexadecimal (commonly shortened to Hex). Hex uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a–f) to represent values ten to fifteen. Each hexadecimal digit represents four bits, thus two hex digits fully represent one byte. As we mentioned, byte values can range from 0 to 255 (decimal), but may be more conveniently represented as two hexadecimal digits in the range 00 to FF. A two-byte number would also be called a 16-bit number. Rather than representing a number as 16 bits (10101011110011), we would represent it as 2AF3 (hex) or 10995 (decimal). With practice, computer scientists become proficient in reading and thinking in hex. Data scientists must understand and recognize hex numbers. There are many websites that will translate numbers from binary to decimal to hexadecimal and back.
  2. Boolean - The Boolean data type encodes logical data, which has just two values (usually denoted "true" and "false"). It is intended to represent the truth values of logic and Boolean algebra. It is used to store the evaluation of the logical truth of an expression. Typically, two values are compared using logical operators such as .eq. (equal to), .gt. (greater than), and .le. (less than or equal to). For example, b = (x .eq. y) would assign the boolean value of "true" to "b" if the value of "x" was the same as the value of "y," otherwise it would assign the logical value of "false" to "b".
  3. Alphanumeric - This data type stores sequences of characters (a-z, A-Z, 0-9, special digits) in a string--from a character set such as ASCII for western languages or Unicode for Middle Eastern and Asian languages. Because most character sets include numeric digits, it is possible to have a string such as "1234". However, this would still be an alphanumeric value, not the integer value 1234.
  4. Integers - This data type has essentially the same definition as the mathematical data type of the same name. In computer science, however, an integer can either be signed or unsigned. Let us consider a 16-bit (two-byte) integer. In its unsigned form, it can have values from 0 to 65535 (216-1). However, if we reserve one bit for a (negative) sign, then the range becomes -32767 to +32768 (-7FFF to +8000 in hex).
  5. Floating Point - This data type is a method of representing real numbers in a way that can support a wide range of values. The term floating point refers to the fact that the decimal point can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. In scientific notation, the given number is scaled by a power of 10 so that it lies within a certain range - typically between 1 and 10, with the decimal point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×105 seconds. Floating-point representation is similar in concept to scientific notation. The base part of the number is called the significand (or sometimes the mantissa) and the exponent part of the number is unsurprisingly called the exponent.

    • The two most common ways in which floating point numbers are represented are either in 32-bit (4 byte) single precision or in 64-bit (8 byte) double precision. Single precision devotes 24 bits (about 7 decimal digits) to its significand. Double precision devotes 53 bits (about 16 decimal digits) to its significand.
  6. List - This data type is used to represent complex data structures. In its most simple form, it has a key-value pair structure. For example, think of a to-do list:
Key Value
1 Get haircut
2 Buy groceries
3 Take shower

Lists can become and often do become very complex. The keys do not have to be numeric but could be words, such as "one," "two," and "three". The values do not have to be a single data point. The value could be a series of numbers, a matrix of numbers, or a paragraph. For example, the first key in a list could be "Romeo and Juliet," and the first value in the list could be the entire play of Romeo and Juliet. The second key in the list could be "Macbeth," and the second value in the list could be the entire play of Macbeth. Finally, a value in a list could even be another list. At this point do not go down the rabbit hole of "a list within a list within a list . . ". We will leave that to graduate students in computer science.

Data scientists understand the importance of how data is represented in computer science because it affects the results they are generating. This is especially true when small rounding errors accumulate over a large number of iterations.

Data Types in R

There are at least 24 data types in the R language.] We will just introduce you to the 9 most commonly used data types. As you will see they are a blend of the data types that exist in Mathematics, Statistics, and Computer Science. Just what a Data Scientist would expect. The nine are:

  1. NULL - for something that is nothing
  2. logical - for something that is either TRUE or FALSE (on or off; 1 or 0)
  3. character - for alphanumeric strings
  4. integer - for positive, negative, and zero whole numbers (no decimal place)
  5. double - for real numbers (with a decimal place)
  6. complex - for complex numbers that have both real and imaginary parts (e.g., square root of -1)
  7. date - for dates only
  8. POSIX - for dates and times (dates are internally represented as the number of days since 1970-01-01, with negative values for earlier dates)
  9. list - for storing complex data structures, including the output of most of the built-in R functions
You can get R to tell you what type a particular data object is by using the typeof() command. If you want to know what a particular data object was called in the original definition of the S language you can use the mode() command. If you want to know what object class a particular data object is in the C programming language that was used to write R, you can use the class() command. For the purposes of this book, we will mostly use the typeof() command.

  • Just a note about lists in R. R likes to use the list data type to store the output of various procedures. We generally do not perform statistical procedures on data stored in list data types--with one big exception. In order to do statistical analysis on lists, we need to convert them to tables with rows and columns. R has a number of functions to move data back and forth between table-like structures and list data types. The exception we just referred to, is called the data.frame list object. List objects of the class data.frame store rows and columns of data in such a specifically defined way as to facilitate statistical analysis. We will explain data frames in more detail below.
Data scientists must know exactly how their data are being represented in the analysis package, so they can apply the correct mathematical operations and statistical analysis.


What are Variables and Values?

Let us start by noting the opposite of a variable is a constant. If we declare that the symbol "X" is a constant and assign it a value of 5, then X=5. It does not change; X will always be equal to 5. Now, if we declare the symbol "Y" to be a variable, that means Y can have more than one value (see the Wiktionary entry for "variable"). For example, in the mathematical equation, Y^^2=4 (Y squared equals 4), the variable Y can either have the value of 2 or -2 and satisfy the equation.

Imagine we take a piece of paper and make two columns. At the top of the first column we put the label "name" and at the top of the second column we put the label "age". We then ask a room full of 20 people to each write down their name and age on the sheet of paper in the appropriate columns. We will end up with a list of 20 names and 20 ages. Let us use the label "name" to represent the entire list of 20 names and the label "age" to represent the entire list of 20 ages. This is what we mean by the term "variable". The variable "name" has 20 data points (the list of 20 names), and the variable "age" has 20 data points (the list of 20 ages). A variable is a symbol that represents multiple data points which we also call values. Other words that have approximately the same meaning as "value" are measurement and observation. Data scientists use these four terms (data point, value, measurement, and observation) interchangeably when they communicate with each other.

The word "variable" is a general-purpose word used in many disciplines. However, various disciplines also use more technical terms that mean approximately the same thing. In mathematics another word that approximates the meaning of the term "variable" is vector. In computer science, another word that approximates the meaning of the term "variable" is array. In statistics, another word that approximates the meaning of the term "variable" is distribution. Data scientists will often use these four words (variable, vector, array, and distribution) interchangeably when they communicate with each other.

Let us think again of the term data set (defined above). A data set is usually two or more variables (and their associated values) combined together. Once our data is organized into variables, combined into a data set, and stored in a file on a disk, it is ready to be analyzed.

The R programming language is a little quirky when it comes to data types, variables, and data sets. In R we sometimes use the term "vector" instead of "variable". When we combine and store multiple vectors (variables) into a data set in R, we call it a data frame. When R stores vectors into a data frame, it assigns a role to indicate how the data will be used in subsequent statistical analyses. So in R data frames, for example, the "logical," "date/time," and "character" data types are assigned the role of Factor. The "double" data type is assigned the role of num and "integers" are assigned the role of int. (The "complex" data type is assigned the role of "cplx," but don't worry about that now). These roles correspond to the statistical data types as follows: Factor = nominal, int = ordinal, and num = interval. (We usually transform the ratio data type into an interval data type before doing statistical analysis. This is normally done by taking the logarithm of the ratio data. More on this in later chapters). We can discover the roles each variable will play within a data frame by using the structure command in R: str(). We will explain what "factors" are in later chapters.


Assignment/Exercise

This assignment should be done in a group of 3 or 4 students. The groups need to be composed of different people from the previous two homework groups. All should interact with the R programming language. The group can help each other both learn the concepts and figure out how to make R work. Practice with R by trying out different ways of using the commands that are described below.

Find Data Types in R
Use the typeof() command to verify data types. See if you can guess what the output will look like before you press the enter key.

 > a <- as.integer(1)
 > typeof(a)
 > a

 > b <- as.double(1)
 > typeof(b)
 > b

 > d <- as.character(1)
 > typeof(d)
 > d

 > e <- as.logical("true")
 > typeof(e)
 > e

 > f <- as.complex(-25)
 > typeof(f)
 > f

 > g <- as.null(0)
 > typeof(g)
 > g

 > h <- as.Date("2012-07-04")
 > typeof(h)
 > class(h)
 > h

 > i <-as.POSIXct("2012/07/04 10:15:59")
 > typeof(i)
 > class(i)
 > i

 > j <-as.POSIXlt("2012/07/04 10:15:59")
 > typeof(j)
 > class(j)
 > j

 > k <- list("Get haircut", "Buy Groceries", "Take shower")
 > typeof(k)
 > k

If you don't specifically specify a data type through the as.* commands, R tries to figure out what data type you intended. It does not always guess your mind correctly. Play around with R, assigning some values to some variables, and then use the typeof() command to see the automatic assignments of data types that R made for you. Then see if you can convert a value from one data type to another.

Objects, Variables, Values, and Vectors in R
The R language is based on an object-oriented programming language. Thus, things in R are called objects. So, when we assign a value to the letter "X," in R we would say we have assigned a value to the object "X". Objects in R may have different properties from each other, depending on how they are used. For this exercise, we will concern ourselves with objects that behave like variables. Those types of objects are called vector objects. So, when we talk - in the language of data science - about the variable "X," in R we could call it the vector "X". As you remember, a variable is something that varies. Let's create a character vector in R and assign it three values. We will use the concatenate c() command in R. Let's also create an integer vector using the same concatenate command.

 > name <- c("Maria", "Fred", "Sakura") 
 > typeof(name)
 > name

 > age <- as.integer(c(24,19,21))
 > typeof(age)
 > age

Both vectors now have three values each. The character string "Maria" is in the first position of the vector "name," "Fred" is in the second position, and "Sakura" is in the third position. Similarly, the integer 24 is in the first position of the vector "age," 19 is in the second position, and 21 is in the third position. Let's examine each of these individually.

 > name[1]
 > name[2]
 > name[3]
 > age[1]
 > age[2]
 > age[3]

The number within the brackets is called the index or the subscript.


Data Sets and Data Frames

If we had observed the actual names and ages of three people so that name[1] corresponded to age[1], we would have a data set that looks like the following.

Name Age
Maria 24
Fred 19
Sakura 21

Let us put our data set into an R data frame object. We need to think of a name for our data frame object. Let's call it "project". After we put our data set into the data frame, we will inspect it using R's "typeof," "class," "ls," and "structure" commands, str(). Remember, upper and lower cases are meaningful.

 > project <- data.frame(name, age)
 > typeof(project)
 > class(project)
 > ls(project)
 > str(project)

The typeof() function told us we had created a list object. The class() function told us it is a special type of list object known as a data.frame. The ls() function tells us what "key-value" pairs exist inside our list object. Please don't worry too much about all of that detail right now. What is important is what the str() function tells us.

The structure command tells us we have three observations and two variables. That is great. It tells us the names of the variables are $name and $age. This tells us that when we put a data set into an R data frame list object, we need to reference the variable WITHIN the data frame as follows: project$name and project$age. The structure command also tells us that project$name was assigned the role of a "Factor" variable and that project$age was assigned the role of "int." These correspond to the "nominal" and "ordinal" data types that statisticians use. R needs to know the role variables play in order to perform the correct statistical functions on the data. One might argue that the age variable is more like the statistical interval data type than the statistical ordinal data type. We would then have to change the R data type from integer to double. This will change its role to "number" within the data frame.

Rather than change the data type of project$age, it is a good practice to create a new variable, so the original is not lost. We will call the new variable project$age.n, so we can tell that is the transformed project$age variable.

 > project$age.n <- as.double(project$age)
 > str(project)

We can now see that project$age and the project$age.n variables play different roles in the data frame, one as "int" and one as "num". Now, confirm that the complete data set has been properly implemented in R by displaying the data frame object.

 > project
     name age age.n
 1  Maria  24    24
 2   Fred  19    19
 3 Sakura  21    21

Now let's double check the data types.

 > typeof(project$name)
 > typeof(project$age)
 > typeof(project$age.n)

Whoops! We see some of the quirkiness of R. When we created the variable "name," it had a data type of "character". When we put it into a data frame not only did R assign it the role of a "Factor" but it also changed its data type to "integer". What is going on here? This is more than you want to know right now. We will explain it now, but you really don't have to understand it until later.

  • Because all statistical computations are done on numbers, R gave each value of the variable "name" an arbitrary integer number. It calls these arbitrary numbers levels. It then labeled these levels with the original values, so we would know what is going on. So under the covers, project$namehas the values: 2 (labeled "Maria), 1 (labeled "Fred"), and 3 (labeled Sakura). We can convert project$name back into the character data type, but we won't be able to perform statistical calculations on it.
 > project$name.c <- as.character(project$name)
 > typeof(project$name.c)
 > str(project)
 'data.frame':    3 obs. of  4 variables:
  $ name  : Factor w/ 3 levels "Fred","Maria",..: 2 1 3
  $ age   : int  24 19 21
  $ age.n : num  24 19 21
  $ name.c: chr  "Maria" "Fred" "Sakura"

We can now see that project$name.c has a data type of character, and has been assigned a data frame role of "chr."

The Impact of Data Science

Chapter Summary

In this chapter, we explore how data science has revolutionized several different aspects of our world: Baseball, Health, and Robotics.


Discussion

Moneyball

According to Wikipedia, Moneyball (film) refers to a book by Michael Lewis, published in 2003, about the Oakland Athletics baseball team and its general manager Billy Beane. Its focus is the team's analytical, evidence-based, sabermetric approach to assembling a competitive baseball team, despite Oakland's disadvantaged revenue situation. A film based on the book starring Brad Pitt was released in 2011.

MONEYBALLchart.png

The central premise of Moneyball is that the collected wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flawed. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game and the statistics that were available at the time. The book argues that the Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could compete successfully against richer competitors in Major League Baseball (MLB).

Rigorous statistical analysis had demonstrated that on-base percentage and slugging percentage are better indicators of offensive success, and the A's became convinced that these qualities were cheaper to obtain on the open market than more historically valued qualities such as speed and contact. These observations often flew in the face of conventional baseball wisdom and the beliefs of many baseball scouts and executives.

  • By re-evaluating the strategies that produce wins on the field, the 2002 Oakland Athletics, with approximately $41 million in salary, were competitive with larger market teams such as the New York Yankees, who spent over $125 million in payroll that same season. Because of the team's smaller revenues, Oakland is forced to find players undervalued by the market, and their system for finding value in undervalued players has proven itself thus far.
Several themes Lewis explored in the book include insiders vs. outsiders (established traditionalists vs. upstart proponents of sabermetrics), the democratization of information causing a flattening of hierarchies, and "the ruthless drive for efficiency that capitalism demands". The book also touches on Oakland's underlying economic need to stay ahead of the curve; as other teams begin mirroring Beane's strategies to evaluate offensive talent, diminishing the Athletics' advantage, Oakland begins looking for other undervalued baseball skills such as defensive capabilities.

Moneyball also touches on the A's methods of prospect selection. Sabermetricians argue that a college baseball player's chance of MLB success is much higher than a traditional high school draft pick. Beane maintains that high draft picks spent on high school prospects, regardless of talent or physical potential as evaluated by traditional scouting, are riskier than if they were spent on more polished college players. Lewis cites A's minor leaguer Jeremy Bonderman, drafted out of high school in 2001 over Beane's objections, as but one example of precisely the type of draft pick Beane would avoid. Bonderman had all of the traditional "tools" that scouts look for, but thousands of such players have been signed by MLB organizations out of high school over the years and failed to develop. Lewis explores the A's approach to the 2002 MLB Draft when the team had a nearly unprecedented run of early picks. The book documents Beane's often-tense discussions with his scouting staff (who favored traditional subjective evaluation of potential rather than objective sabermetrics) in preparation for the draft to the actual draft, which defied all expectations and was considered at the time a wildly successful (if unorthodox) effort by Beane.

  • In addition, Moneyball traces the history of the sabermetric movement back to such people as Bill James (now a member of the Boston Red Sox front office) and Craig R. Wright. Lewis explores how James' seminal Baseball Abstract, an annual publication that was published from the late 1970s through the late 1980s, influenced many of the young, up-and-coming baseball minds that are now joining the ranks of baseball management.
Moneyball has made such an impact in professional baseball that the term itself has entered the lexicon of baseball. Teams that appear to value the concepts of sabermetrics are often said to be playing "Moneyball". Baseball traditionalists, in particular some scouts and media members, decry the sabermetric revolution and have disparaged Moneyball for emphasizing concepts of sabermetrics over more traditional methods of player evaluation. Nevertheless, the impact of Moneyball upon major league front offices is undeniable.

  • In its wake, teams such as the New York Mets, New York Yankees, San Diego Padres, St. Louis Cardinals, Boston Red Sox, Washington Nationals, Arizona Diamondbacks, Cleveland Indians, and the Toronto Blue Jays have hired full-time sabermetric data scientists.
Since the book's publication and success, Lewis has discussed plans for a sequel to Moneyball called Underdogs, revisiting the players and their relative success several years into their careers. When the New York Mets hired Sandy Alderson – Beane's predecessor and mentor with the A's – as their general manager after the 2010 season and hired Beane's former associates Paul DePodesta and J.P. Ricciardi to the front office, the team became known as the "Moneyball Mets". Michael Lewis has acknowledged that the book's success may have negatively affected the Athletics' fortunes as other teams have accepted the use of sabermetrics, reducing the edge that Oakland received from using sabermetric-based evaluations.


23 and Me

According to Wikipedia, 23 and Me is a privately held personal genomics and biotechnology company based in Mountain View, California that provides rapid genetic testing. The company is named for the 23 pairs of chromosomes in a normal human cell. Their personal genome test kit was named "Invention of the Year" by Time magazine in 2008. The company was founded by Linda Avey and Anne Wojcicki after both recognized the need for a way to organize and study genetic data, the possibility for individual consumers to use the information, and the need for expertise to interpret the results.

DNA orbit animated.gif

23andMe began offering DNA testing services in November 2007, the results of which are posted online and allow the subscriber to view an assessment of inherited traits, genealogy, and possible congenital risk factors. Customers provide a 2.5 mL spit sample which is analyzed on a DNA microarray of Illumina, for 960,000 specific single-nucleotide polymorphisms (SNPs). An eventual goal is to provide whole genome sequencing.

  • In June 2011, 23andMe announced it had accumulated a database of more than 100,000 individuals.

The organization also provides testing for certain research initiatives provides confidential customer datasets and partners with research foundations with the goal of establishing genetic associations with specific illnesses and disorders. Google co-founder Sergey Brin (whose mother suffers from Parkinson's disease and who is married to 23andMe co-founder Anne Wojcicki) underwrote the cost of the company's Parkinson's disease Genetics Initiative to provide free testing for people suffering from the condition. An analysis of the results of research on Parkinson's disease comparing 23andMe with a National Institutes of Health (NIH) initiative suggested that the company's use of large amounts of computational power and data sets might offer comparable results, though in much less time.

The company gathers personal and social data from its subscribers via online surveys. Personal data includes a person's health history, environmental history, and such things as the ability to smell certain odors. The social data includes family histories and the sorts of activities one participates in. The company employs a number of data scientists to work on this wealth of data - a million genetic variables and many hundreds of personal and social variables for over 100,000 people.

  • The company's data scientists are able to correlate and cluster certain personal and social behaviors with genetic markers. When these correlations are shown to be significant, they report the results back to the subscribers, indicating that certain personal or social aspects of their lives may, indeed, have a genetic basis. They also publish results in scientific journals.
(If the instructor has a 23 and Me account, she could log on and project the website to show the results of the 23 and Me data scientists to the students in the class).


Google's Driverless Car

According to Wikipedia, Google's Driverless Car is a project by Google engineer Sebastian Thrun, director of the Stanford Artificial Intelligence Laboratory and co-inventor of Google Street View. Thrun's team at Stanford created the robotic vehicle Stanley which won the 2005 DARPA Grand Challenge and its $2 million prize from the United States Department of Defense.

Jurvetson Google driverless car trimmed.jpg

The U.S. state of Nevada passed a law in June 2011 permitting the operation of driverless cars in Nevada. Google had been lobbying for driverless car laws. Google executives have not stated the precise reason they chose Nevada to be the maiden state for the driverless car. The Nevada law went into effect on March 1, 2012, and the Nevada Department of Motor Vehicles issued the first license for a self-driven car in May 2012. The license was issued to a Toyota Prius modified with Google's experimental driver-less technology. In August 2012, the team announced that they have completed over 300,000 autonomous-driving miles accident-free, typically have about a dozen cars on the road at any given time, and are starting to test them with single drivers instead of in pairs. Three U.S. states have passed laws permitting driverless cars, as of September 2012: Nevada, Florida, and California.

The system combines information gathered from Google Street View with artificial intelligence software that combines input from video cameras inside the car, a LIDAR sensor on top of the vehicle, radar sensors on the front of the vehicle, and a position sensor attached to one of the rear wheels that helps locate the car's position on the map. In 2009, Google obtained 3,500 miles of Street View images from driverless cars with minor human intervention. As of 2010, Google has tested several vehicles equipped with the system, driving 1,609 kilometers (1,000 mi) without any human intervention, in addition to 225,308 kilometers (140,000 mi) with occasional human intervention. Google expects that the increased accuracy of its automated driving system could help reduce the number of traffic-related injuries and deaths while using energy and space on roadways more efficiently.

The project team has equipped a test fleet of at least eight vehicles, each accompanied in the driver's seat by one of a dozen drivers with unblemished driving records and in the passenger seat by one of Google's engineers. The car has traversed San Francisco's Lombard Street, famed for its steep hairpin turns, and through city traffic. The vehicles have driven over the Golden Gate Bridge and on the Pacific Coast Highway, and have circled Lake Tahoe.

Google's driverless test cars have about $150,000 in equipment including a $70,000 lidar (laser radar) system. The system drives at the speed limit it has stored on its maps and maintains its distance from other vehicles using its system of sensors. The system provides an override that allows a human driver to take control of the car by stepping on the brake or turning the wheel, similar to cruise control systems already in cars.

While Google had no immediate plans to commercially develop the system, the company hopes to develop a business that would market the system and the data behind it to automobile manufacturers. An attorney for the California Department of Motor Vehicles raised concerns that "The technology is ahead of the law in many areas," citing state laws that "all presume to have a human being operating the vehicle". According to the New York Times, policymakers and regulators have argued that new laws will be required if driverless vehicles are to become a reality because "the technology is now advancing so quickly that it is in danger of outstripping existing law, some of which dates back to the era of horse-drawn carriages".

In August 2011, a human-controlled Google driverless car was involved in the project's first crash near Google headquarters in Mountain View, CA. Google has stated that the car was being driven manually at the time of the accident. A second incident involved a Google driverless car being rear-ended while stopped at a stoplight.

CONSIDER THIS: In 2005 the DARPA Grand Challenge driverless car winner went 123 miles at an average of 19 miles per hour. Just 5 years later, the Google driverless car had gone hundreds of thousands of miles at the speed limit of 55 to 65 miles per hour. Did the discipline of artificial intelligence advance that much in 5 years? No. The difference was the data science. The Google data scientists made a 3-D wire mesh model of every street the driverless car was going to drive. In real time, the car's data science algorithms compared actual observations against the model (including the white stripes on the road) and made corrections accordingly.


Assignment/Exercise

Get into groups of 4 or 5 students. Together, watch the movie Moneyball. While watching, take brief notes on how data science made a difference to the characters in the movie. After the movie is over, brainstorm as a group, about other areas of life where data science could make a difference. Speculate on the arguments opponents of data science might make to using data science. Pick one for further consideration. As a group, create a 4 slide presentation that introduces the area of life you picked; how data science would make a difference; what the counter-arguments are; and whether or not your group thinks, in the end, it would be a good idea to introduce data science into that area of life.