A History of Data Science
Site: | Saylor Academy |
Course: | CS250: Python for Data Science |
Book: | A History of Data Science |
Printed by: | Guest user |
Date: | Thursday, 10 April 2025, 11:16 AM |
Description
When learning any new field, context is everything; therefore, we will begin this course by introducing the history of data science. In this way, you will be able to understand how the field became an amalgamation of various areas of science dealing with data in many different forms. In this section, and as the course continues, pay close attention to the various ways data can be represented and analyzed.
A History of Data Science
Chapter Summary
Data Science is a composite of a number of pre-existing disciplines. It is a young profession and academic discipline. The term was first coined in 2001. Its popularity has exploded since 2010, pushed by the need for teams of people to analyze the big data that corporations and governments are collecting. The Google search engine is a classic example of the power of data science.
Discussion
Data science is a discipline that incorporates varying degrees of Data Engineering, Scientific Method, Math, Statistics, Advanced Computing, Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data Science is called a Data Scientist. Data Scientists solve complex data analysis problems.
Origins
The term "Data Science" was coined at the beginning of the 21st Century. It is attributed to William S. Cleveland who, in 2001, wrote "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". About a year later, the International Council for Science: Committee on Data for Science and Technology started publishing the CODATA Data Science Journal beginning in April 2002. Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science.
Development
During the "dot-com" bubble of 1998-2000, hard drives became really cheap. So corporations and governments started buying lots of them. One corollary of Parkinson's Law is that data always expands to fill the disk space available. The "disk-data" interaction is a positive exponential cycle between buying ever more disks and accumulating ever more data. This cycle produces big data. Big data is a term used to describe data sets so large and complex that they become awkward to work with using regular database management tools.
Once acquired, we have to do something with the big data besides just storing it. We need big computing architectures. Companies like Google, Yahoo!, and Amazon invented the new computing architecture, which we call cloud computing. One of the most important inventions within cloud computing is called MapReduce. MapReduce has been codified into the software known as Hadoop. We use Hadoop to do big computing on big data in the cloud.
- The normal computing paradigm is that we move data to the algorithm. For example, we read data off a hard drive and load it into a spreadsheet program to process. The MapReduce computing paradigm is just the opposite. The data are so big we cannot put it all into the algorithm. Instead, we push many copies of the algorithm out to the data.
It turns out that Hadoop is difficult to do. It requires advanced computer science capabilities. This opens up a market for the creation of analytics tools - with simpler interfaces - that run on top of Hadoop. This class of tools is called "Mass Analytic Tools" - that is, tools for the analysis of massive data. Examples of these are "recommender systems, "machine learning," and "complex event processing". These tools, while having a simpler interface to Hadoop, have complex mathematical underpinnings, which also require specialization.
So, with the advent of mass analytic tools, we need people to understand the tools and actually do the analysis of big data. We call these people, "Data Scientists". These people are able to tease out new analytic insights never before possible in the world of small data. The scale of problems that are solved by analyzing big data is such that no single person can do all the data processing and analytic synthesis required. Therefore, data science is best practiced in teams.
In sum, cheap disks --> big data --> cloud computing --> mass analytic tools --> data scientists --> data science teams --> new analytic insights.
Popularization
Mike Loukides, Vice President of Content Strategy for O'Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article "What is data science?" In the last few years, data science is increasingly being associated with the analysis of Big data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated by their websites.
There are now several ongoing conferences devoted to big data and data science, such as O'Reilly's Strata Conferences and Greenplum's Data Science Summits.
The job title has similarly become very popular. On one heavily used employment site, the number of job postings for "data scientist" increased more than 10,000 percent between January 2010 and July 2012.
Academic Programs
Several universities have begun graduate programs in data science, such as at the Institute for Advanced Analytics at North Carolina State University, the McCormick School of Engineering at Northwestern University, and the now-discontinued six-week summer program at the University of Illinois.
Professional Organizations
A few professional organizations have sprung up recently. Data Science Central and Kaggle are two such examples. Kaggle is an interesting case. They crowdsource data science solutions to difficult problems. For example, a company will put up a hard problem with Kaggle. Data scientists from around the world sign up with Kaggle, then compete with each other to find the best solution. The company then pays for the best solution. There are over 30,000 data scientists registered with Kaggle.
Case Study
In the mid- to late-1990s, AltaVista was the most popular search engine on the internet. It sent "crawlers" to extract the text from all the pages on the web. The crawlers brought the text back to AltaVista. AltaVista indexed all the text. So, when a person searched for a keyword, Altavista could find the web pages that had that word. AltaVista then presented the results as an ordered list of web pages, with the pages that had the most frequent mentions of the term at the top. This is a straightforward computer science solution, though, at the time, they solved some very difficult scaling problems.
In the late 1990s, the founders of Google invented a different way to do searches. They combined math, statistics, data engineering, advanced computation, and the hacker spirit to create a search engine that displaced AltaVista. The algorithm is known as PageRank. PageRank looks not only at the words on the page but the hyperlinks as well. PageRank assumes that an inbound hyperlink is an indicator that some other person thought the current page was important enough to put a link to it on their own page. Thus the pages with the most hyperlinks end up at the top of the list of search results. PageRank captures the human knowledge about web pages, in addition to the content.
Google not only crawled the web, it ingested the web. That is big data. They then have to calculate the PageRank algorithm across that big data. That requires massive computation. Then they have to make search and search results fast for everyone. Google search is a triumph of data science (though it was not called data science when it started).
Assignment/Exercise
Get into groups of 2 or 3 students. Download and install the R programming language on your computer. Help each other get R up and running.
- Go to the R website: http://www.r-project.org/
- Click on the CRAN mirror link
- Click on the Linux, Mac OSX, or Windows link
Linux
Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as typing in sudo aptitude install r-base
or sudo apt-get install r-base
, or installing the package r-base
using your favorite package manager, for example, Synaptic.
There is also a bunch of packages extending R to different purposes. Their names begin with r-
. Take a closer look at the package r-recommended
. It is a metapackage that depends on a set of packages that are recommended by the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this package, as it depends on r-base
.
Installation with apt-get (Debian, Ubuntu, and all Linux distributions based on Debian)
sudo apt-get install r-base
sudo apt-get install r-recommended
Installation with aptitude (Debian, Ubuntu, and all Linux distributions based on Debian)
sudo aptitude install r-base
sudo aptitude install r-recommended
Mac OS
Installation: Download the disk image (dmg file) and install R.
The default graphical user interface for Mac is much better than the one for Windows. It includes
- a data frame manager,
- a history of all commands,
- a program editor which supports syntax highlighting.
Windows
To install R under Windows operating system you have to download the binaries from the web. First, go to the R-Project website (listed above) and click CRAN under the download section on the left panel and select a mirror site, from where you could download the required content. The best idea is to pick a mirror closest to your actual geographical location, but other ones should work as well. Then click Windows and in subdirectories base. The windows binary is the .exe file, in the form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version, the setup has the same steps.
As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems. However, there are a few things that you can alter.
- On the welcome screen click Next.
- Read or just notice the GNU license, and click Next.
- Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc, the default choice will be OK for you.
- During the next step, you can specify which parts of R you want to install. Choices are User installation, Minimal user installation, Full installation, and Custom installation. Notice the required space under the selection panel (varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
- In this step, you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation (see lower).
- You can specify the Start menu folder.
- In the next step, you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and specify registry entries.
With these steps, you can customize the R graphical user interface.
- You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window (SDI).
- You can select the style and how the Help screen is displayed in R. You will use help a lot, so this may be an important decision. It is up to you, which style you prefer. Please note, that the content of the help file will be the same regardless of your choice. Here you specify just the appearance of that particular window.
- In the next step, you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard option here.
Portable R for Windows
If you want to install R on your USB stick go to the Portable R website. This is useful if you don't have admin rights on a computer. The basic installation requires something like 115 MB but you may need more if you want to install add-on packages.
Source: Calvin Andrus, Jon Cook, and Suresh Sood, https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/A_History_of_Data_Science
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
A Mash-up of Disciplines
Chapter Summary
This is a very quick overview of the eight "parent" disciplines that contribute to the new Data Science discipline. It suggests generic questions that data scientists should ask as they work through solving problems.
Discussion
As mentioned in Chapter 1, Data Science is a mash-up of several different disciplines. We also noted that an individual data scientist is most likely an expert in one or two of these disciplines and proficient in another two or three. There is probably no living person who is an expert in all these disciplines, and an extremely rare person would be proficient in 5 or 6 of these disciplines. This means that data science must be practiced as a team where, across the membership of the team, there is expertise and proficiency across all disciplines. Let us explore what these disciplines are and how they contribute to data science.

Data Engineering
Data Engineering is the data part of data science. According to Wikipedia, Data Engineering involves acquiring, ingesting, transforming, storing, and retrieving data. Data engineering also includes adding metadata to the data. Because all these activities are interrelated, a data engineer must solve these issues as an integrated whole. For example, we must understand how we plan to store and retrieve the data in order to create a good ingestion process. Data engineering requires a thorough understanding of the general nature of the data science problems to be solved in order to formulate a robust data acquisition and management plan. Once the plan is well developed, data engineers can begin to implement it into data management systems.
Acquiring - This is the process of laying our hands on the data. The data engineer part of the data scientist needs to ask the questions, "where is the data coming from?", "what does the data look like?". and "how does our team get access to the data?" The data could come from many places such as RSS feeds, a sensor network or a preexisting data repository. The data could be numbers, text documents, images, or videos. The data can be collected by the team or purchased from a vendor. For example, if we are going to investigate highways, we could have sensors on a stretch of freeway that measures how fast cars are going. These sensors send us the data as text messages that include the date, time, lane, and speed of every car that crosses the sensors.
Ingesting - This is the process of getting the data from the source into the computer systems we will use for our analysis. The data engineer part of the data scientist needs to ask the questions, "how much data is coming?", "how fast is it coming?", "where are we going to put the data?", "do we have enough disk space for the data?", and "do I need to filter the incoming data in any way?" Data is measured in bytes. A byte is roughly equivalent to one character of a written word. A one-page document is about 1,000 bytes or one kilobyte (1K). For example, if we are going to investigate highways, we could be receiving car speed data at a rate of 10,000 bytes per second for a 1-week period. There are 604,800 seconds in a week. This means you will receive 6,048,000,000 bytes (6 gigabytes) of data in one week. No problem. That will fit on a thumb drive.
Transforming - This is the process of converting the data from the form in which it was collected to the form it needs to be in for our analysis. The data engineer part of the data scientist needs to ask the questions, "what is the form of the raw data?" and "what does the form of the processed data need to be?" A common raw data format is comma-separated values (CSV) which looks like this:
20120709,135214,157,3,57.4
20120709,135523,13,2,62.1
For example, if we are investigating highways, we might receive data that looks like the example above. The segments in the first row are date, July 9, 2012; time, 1:52.14pm; sensor, #157; lane, #3; and speed, 57.4 mph. The data needs to be transformed from CSV format to something akin to a spreadsheet format like the following:
Year | Month | Day | 24-Hour | Minute | Second | Sensor # | Lane # | MPH |
---|---|---|---|---|---|---|---|---|
2012 | 07 | 09 | 13 | 52 | 14 | 157 | 3 | 57.4 |
2012 | 07 | 09 | 13 | 55 | 23 | 13 | 2 | 62.1 |
Understanding the various "from" and "to" formats is very important for a data scientist.
Metadata - Wikipedia says that metadata is commonly called data about data. In our case above, the data is the MPH and the Lane. The Sensor is a proxy for "where" on the surface of the earth the data was collected, and the date and time are data about "when" it was collected. We could add other metadata to our data, like weather conditions at the time and the quality of the road. We could derive other metadata, such as whether it was a weekday, holiday, or weekend, and whether it was a rush hour or not. We might also add metadata that indicates who may see the data under what conditions, like "not for public dissemination until 1 year after collected". Metadata is often added both at ingestion time and at transformation time.
Storing - This is the process of putting the data into a data management system. The data engineer part of the data scientist needs to ask the questions "what kind of a system is best for storing our data?", "how fast will the system be?", and "how much extra space will this system need?" We can store data in files in a file system. File systems are generally very fast, but have very little functionality. We can store data in a database. These are often slower than a file system but have much more functionality. For example, in our highway example, we might have 60 million lines of data in CSV format. (At 100 bytes per line, that would be about 6 gigabytes). We could store it in one big file in the file system. It would be fast to read it, but in that format, we could not compute averages by time and location. Alternatively, we could store it in a database where it would be easy to compute averages by location and time, though it would take more time to read through the data.
Scientific Method
The Scientific Method is the scientific part of data science. According to Wikipedia, the Scientific Method is a process for acquiring new knowledge by applying the principles of reasoning to empirical evidence derived from testing hypotheses through repeatable experiments. When a scientist hears someone make an assertion about a fact, they naturally want to know both what is the evidence and what is the standard of acceptance for that evidence.Reasoning Principles - There are two general forms of logical reasoning: inductive, and deductive. Simply stated, inductive reasoning arrives at general principles from specific observations, while deductive reasoning arrives at specific conclusions based on general principles. Consider the following two examples:
Inductive argument:
- Every life form that everyone knows of depends on liquid water to exist.
- Therefore, all known life depends on liquid water to exist.
- All men are mortal.
- Socrates is a man.
- Therefore, Socrates is mortal.
Empirical Evidence - Evidence that is empirical is data produced by observation or experiment. This is in contrast to data that is derived from logical arguments or conclusions that are propagated by myths and legends.
- The classic example is the trial of Galileo. At the time (1633), the Catholic church held to Aristotle's logical argument that the earth was the center of the cosmos. Galileo's observations with his newly invented telescope provided evidence of Copernicus's assertion that the earth revolved around the sun. The outcome of the trial was that Galileo was sentenced to house arrest for heresy. In 2000, Pope John Paul II apologized for the injustice done to Galileo.
The scientist part of a data scientist needs to ask the question, "what is the evidence that leads to a particular conclusion?"
Hypothesis Testing - This process generally asserts two propositions, only one of which can be true. The scientist gathers empirical evidence for and against each proposition, and then accepts one and rejects the other. Often, one of the hypotheses is known as the null hypothesis, and the other as the alternative hypothesis. The null hypothesis is usually a proposition about the way we currently understand the universe to work. The alternative is a proposition about how we think the universe really works. A criminal trial is a classic analogy to understanding hypothesis testing.
- A defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is convicted. At the start of the procedure, there are two hypotheses: "the defendant is not guilty", and "the defendant is guilty". The first one is called the null hypothesis and is accepted for the time being. The second one is called the alternative hypothesis. It is the hypothesis one tries to prove. The hypothesis of innocence is only rejected when an erroneous conviction is very unlikely because one doesn't want to convict an innocent defendant.
Repeatable Experiments - According to Wikipedia, an experiment is a methodical trial and error procedure carried out with the goal of verifying, falsifying, or establishing the validity of a hypothesis. Experiments vary greatly in their goal and scale but always rely on repeatable procedures and logical analysis of the results. A child may carry out basic experiments to understand the nature of gravity, while teams of scientists may take years of systematic investigation to advance the understanding of subatomic particles.
- One prominent example is the "inclined plane," or "ball and ramp experiment". In this experiment, Galileo used an inclined plane and several steel balls of different weights. With this design, Galileo was able to slow down the falling motion and record, with reasonable accuracy, the times at which a steel ball passed certain markings on a beam. Galileo disproved Aristotle's assertion that weight affects the speed of an object's fall. According to Aristotle's Theory of Falling Bodies, the heavier steel ball would reach the ground before the lighter steel ball. Galileo's hypothesis was that the two balls would reach the ground at the same time.
Math
Mathematics (along with statistics) is the cerebral part of Data Science. According to Wikipedia, mathematics is the study of quantity, structure, space, and change. When these are used to solve practical problems, it is called applied mathematics.Quantity - By this we simply mean numbers. The mathematician part of the data scientist needs to ask the questions, "how will the thing I am interested in be represented by numbers?" and "what kind of numbers will best represent the thing I am interested in?" The numbers could be integers, fractions, real numbers, or complex numbers. For example, if we are going to investigate highways, we could measure the length of highways in miles as represented by integers. We also need to think about the kinds of operations we will perform on numbers. We use arithmetic to operate on and represent the quantities in our data.
Structure - Most sets of mathematical objects exhibit internal structure. The mathematician part of the data scientist needs to ask the questions, "what sort of internal structure does the thing I am interested in have?" and "what set of equations will expose the structure?" The structures could be a constant progression like
Space - The things we investigate often have some relationship to two- or three-dimensional space. When thinking like a mathematician, a data scientist needs to ask the questions, "does the thing I am interested have a spatial component, either actual or theoretical?" and "how do I capture and represent that spatial component?" The spatial component could be latitude and longitude or it could have a surface that is important. For example, if we are going to investigate highways, we might like to know exactly where particular highway segments are located or how smooth the surface of the highway is. We use geometry and trigonometry to operate on and represent the spatial components of our data.
Change - The things we investigate often change - possibly over time or over distance. The mathematician part of the data scientist needs to ask the questions, "does the relationship between the things I am interested in change?" and "how will I describe the changing relationship?" The changes could be . . . For example, if we are investigating highways, the sharpness of curves in the road may change with the speed limit at that part of the highway, or the depth of the asphalt may change the number of cars per hour that may safely drive in that location. We use calculus to operate on and represent the changing relationships within our data.
Statistics
Statistics (along with mathematics) is the cerebral part of Data Science. Wikipedia states that statistics is the study of the collection, organization, analysis, and interpretation of data. It involves methods for exploring data, discovering patterns and relationships, creating models, and making inferences about the future. Statistics is the discipline that has the straightest-line pedigree to data science. The statistician is responsible for understanding the analysis that will be done on the data so that it can be collected and organized appropriately.Collection - A statistician, working with data engineers, ensures that data generation and collection are undertaken in a way that allows valid conclusions to be drawn. The statistician creates the research design, including, if appropriate, the experimental design, that governs the collection of data. The statistician part of the data scientist needs to ask, "what research procedures will be used to generate the data?"
Organization - A statistician, working with data engineers, ensures that data is coded and archived so that information is retained and made useful not just for analysis internal to the project, but also for sharing with others. The statistician is responsible for creating a data dictionary, which is database neutral. A data engineer would create a database schema, which is database specific, based on the data dictionary compiled by the statistician. The data dictionary specifies the variables, the valid values, and the format of the data. The database schema describes how the particular database management system will store the data. The statistician part of the data scientist needs to ask, "are the data stored in such a way as to facilitate the statistical analysis that will be done?"
Analysis - A statistician, working with a mathematician, summarizes, aggregates, correlates, and creates models of the data. The statistician is an expert in analyzing data using descriptive and inferential statistics. This includes creating summaries of the data (such as averages) as well as testing for differences (is this average significantly higher than that average). The statistician part of the data scientist needs to ask, "given the data, which descriptive and inferential statistics ought to be used to test the hypotheses?"
Advanced Computing
Advanced computing is the heavy lifting of data science. According to Wikipedia, computer programming (often shortened to programming or coding) is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a set of instructions that computers use to perform specific operations or to exhibit desired behaviors. The process of writing source code often requires expertise in many different subjects, including knowledge of the application domain, specialized algorithms, and formal logic.
Programming Language - According to Wikipedia, a programming language is an artificial language designed to communicate instructions to a computer. Programming languages are used to create programs that control the behavior of the computer and external devices such as printers, disk drives, and robots. Programs also express algorithms precisely. Programming languages can be thought of as "low-level," such as "assembly languages" that have a nearly one-to-one correspondence to the machine language functions built into the hardware central processing unit (CPU). More commonly, programmers use "high-level" languages, such as Java, Python, and C++, which aggregate many machine-level functions together into human-level functions such as "read data" and "print". The programmer part of the data scientist needs to ask the question, "which programming language should I use to solve the problem at hand?"
Source Code - According to Wikipedia, source code is any collection of computer instructions (with comments) written using some human-readable computer language, usually as text. When executed, the source code is translated into machine code that the computer can directly read and execute. Programmers often use an integrated development environment (IDE) that allows them to type in, debug, and execute the source code. Here are examples of source code for the traditional "Hello World" program as written in Java and Python:
/**
* Traditional "Hello World" program In Java
*/
class HelloWorldApp {</span><br><span class="pln"> </span><span class="kwd">public</span><span class="pln"> </span><span class="kwd">static</span><span class="pln"> </span><span class="kwd">void</span><span class="pln"> main</span><span class="pun">(</span><span class="typ">String</span><span class="pun">[]</span><span class="pln"> args</span><span class="pun">)</span><span class="pln"> </span><span class="pun">{</span><br><span class="pln"> </span><span class="typ">System</span><span class="pun">.</span><span class="kwd">out</span><span class="pun">.</span><span class="pln">println</span><span class="pun">(</span><span class="str">"Hello World!"</span><span class="pun">);</span><span class="pln"> </span><span class="com">// Display the string.</span><br><span class="pln"> </span><span class="pun">}
}
#
# Traditional "Hello World" program in Python 2.x
#
print "Hello World!"
Visualization
Visualization is the pretty face of data science. According to Wikipedia, information visualization is the visual representation of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information. Wikipedia also describes graphic design as a creative process undertaken in order to convey a specific message to a targeted audience. A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.Creative Process - Wikipedia defines creativity as the process of producing something that is both original and worthwhile. The process includes divergent thinking, which involves the generation of multiple answers to a problem; conceptual blending, in which solutions arise from the intersection of two quite different frames of reference; and, honing, in which an acceptable solution emerges from iterating over many successive unacceptable versions of the solution. The visual artist part of the data scientists needs to ask, "what are several different ways we can show this data" and "how can we improve this visualization over the next several iterations?"
Data Abstraction - Wikipedia, defines data abstraction as handling data bits in meaningful ways. This implies that we do not want to visualize all the raw data, but that we need to visualize manipulations of the data (aggregations, summarizations, correlations, predictions) that are meaningful in the context of the problem we are trying to solve. The visual artist part of the data scientists needs to ask, "how can we simplify the content of the data so it can be visualized meaningfully?"
Informationally Interesting - According to Wiktionary, humans pay attention to things that are interesting and/or attractive. Something that is attractive or beautiful is pleasing to the senses. While beauty is in the eye of the beholder, there are some more or less agreed-upon principles of beauty, such as symmetry and harmony. Surprise, within the context of harmony, is especially interesting to humans. The visual artist part of the data scientists needs to ask, "how can we visualize the content of the data so it is pleasing with a touch of surprise?"
Consider the following graphic. It is a partial map of the Internet in early 2005. Each line represents two IP addresses. Notice that it abstracts only a subset of data about the internet. It clearly went through a number of iterations to arrive at such a harmonious color scheme. It has an overall symmetry, with some surprises in the details (the bright "stars"). Finally, it is meaningful in the context of understanding the World Wide Web.

Hacker mindset
Hacking is the secret sauce of data science. According to Wikipedia, hacking is modifying one's own computer system, including building, rebuilding, modifying, and creating software, electronic hardware, or peripherals, in order to make it better, make it faster, give it added features, and/or make it do something it was never intended to do. For the data scientist, hacking goes beyond the computer system to the whole enterprise of solving data problems. Think of it as an advanced do-it-yourself (DIY) mode of working.Data science hacking involves inventing new models, exploring new data structures, and mashing the 8 parent disciplines in unconventional ways. Hacking requires boldness, creativity, vision, and persistence. Here are two examples. (Even though they involve hardware, they are presented because they are readily understandable in a few sentences. More complex data science examples are given in chapter four).

- A famous example is Steve Wozniak's hand-made Apple I computer. It was built from parts scrounged from Hewlett-Packard's trash and from electronic surplus supply stores. Wozniak wanted to give the plans away, but his partner, Steve Jobs, convinced him that they should sell ready-made machines. The rest, as they say, is history.
- Another example is the Carnegie Mellon Internet Coke Machine. In the early days of the internet before the web, students at Carnegie Mellon instrumented and wired their local Coke Machine to the internet. The students could check to see which internal dispenser columns had been loaded most recently, so they could be sure to buy cold, not warm, sodas. This was important because the machine sold one Coke every 12 minutes and was re-loaded several times a day.
Domain Expertise
Domain Expertise is the glue that holds data science together. According to Wikipedia, subject matter or domain expertise is proficiency, with special knowledge or skills, in a particular area or topic. Spoken references to subject matter experts sometimes spell out the acronym "SME" ("S-M-E") and other times it is voiced as a word ("smee"). Any domain of knowledge can be subject to a data science inquiry, including - but not limited to - medicine, politics, the physical and biological sciences, marketing, information security, demographics, and even literature. Every data science team must include at least one person who is a subject matter expert on the problem being solved.Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know, and how best to package the knowledge so it can be readily absorbed by their customers. For example,
- Edwin Chen, a data scientist at Twitter, computed and visualized the geographic distribution of tweets that refer to soft drinks as "soda," as "pop," and as "coke". Just observing that the Midwest uses "pop" and the Northeast uses "soda" is interesting but lacks an explanation. In order to understand WHY these geographic divisions exist, we would need to consult with domain experts in sociology, linguistics, US history, and maybe anthropology - none of whom may know anything about data science. Why do you think these geographic linguistic differences exist?
- Nate Silver is a statistician and domain expert in US politics. His blog regularly combines both the data and an explanation of what it means. In his posting, "How Romney’s Pick of a Running Mate Could Sway the Outcome," he not only tells us what the differences are based on his mathematical model, he explains why those outcomes fell out the way they did.
Assignment/Exercise
Become familiar with the R programming environment. Get into a group of 3 to 4 students from the class. Work in study sessions together as a team on the following items. See if you can explain to each other what you are doing. Help each other understand what is going on. You will have to try some things several ways until it works right. That is ok. Some of you will "get it" faster than others. Please help each other so you all "get it".- Print a copy and read over Google's R Style Guide. Right now, most of the guide will not make a lot of sense, but it will make more sense as we progress through the book. Keep the printed copy for future reference.
- Search the web for "introduction to R," "R tutorial," "R basics," and "list of R commands". Pick four or five of these websites to work on. Try working through the first few examples of each site. Many of the introductions go too fast or assume too much prior knowledge, so when it gets too confusing just try another site.
- Try the commands:
library(help="utils")
library(help="stats")
library(help="datasets")
library(help="graphics")
demo()
demo(graphics)
demo(persp)
- Write a short 5- to 7-line program that will execute without errors and save it. Be sure to include the names of all those who contributed in the comment section.
- Make a list of the sites the team worked from, and indicate which was the most helpful.
- Make a list of the top 10 unanswered questions the team has at the end of the study session.
Definitions of Data
Chapter Summary
Discussion
What is Data?
Wiktionary defines data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.Wiktionary defines a datum as a measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from 1 to 10 by ones, 1 to 100 by 0.1, or simply true or false, on or off, yes, no, or maybe, etc.; and as a fact known from direct observation.
What is a Data Point?
What is a Data Set?
What are Data Types?
concatenate(two, two) = twotwo
. For the purposes of this introduction, we will just concern ourselves with simple numeric and simple text data types and leave more complex data types - like images, audio, and video - to more advanced courses. Data scientists use various data types from mathematics, statistics, and computer science to communicate with each other.Data Types in Mathematics
We will introduce just the most commonly used data types in Mathematics. There are many more, but we'll save those for more advanced courses.- Integers - According to Wikipedia, integers are numbers that can be written without a fractional or decimal component, and fall within the set {..., −2, −1, 0, 1, 2, ...}. For example, 21, 4, and −2048 are integers; 9.75, 5½, and
are not integers.
- Rational Numbers - According to the Wikipedia, rational numbers are those that can be expressed as the quotient or fraction p/q of two integers, with the denominator q not equal to zero. Since q may be equal to 1, every integer is a rational number. The decimal expansion of a rational number always either terminates after a finite number of digits or begins to repeat the same finite sequence of digits over and over. For example, 9.75 2/3, and 5.8144144144… are rational numbers.
- Real Numbers - According to Wikipedia, real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, plus all the irrational numbers such as
(1.41421356... the square root of two),
., and e (2.71828..).
- Imaginary Numbers - According to Wikipedia, imaginary numbers are those whose square is less than or equal to zero. For example,
is an imaginary number and its square is -25. An imaginary number can be written as a real number multiplied by the imaginary unit i, which is defined by its property
. Thus,
.
Data Types in Statistics
We will introduce just the most commonly used data types in statistics, as defined in Wikipedia. There are a few more data types in statistics, but we'll save those for more advanced courses.- Nominal - Nominal data are recorded as categories. For this reason, nominal data is also known as categorical data. For example, rocks can be generally categorized as igneous, sedimentary, and metamorphic.
- Ordinal - Ordinal data are recorded in the rank order of scores (1st, 2nd, 3rd, etc). An example of ordinal data is the result of a horse race, which says only which horses arrived first, second, or third but includes no information about race times.
- Interval - Interval data are recorded not just about the order of the data points, but also the size of the intervals in between data points. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the temperature difference between the freezing and boiling points of water. The zero point, however, is arbitrary.
- Ratio - Ratio data are recorded on an interval scale with a true zero point. Mass, length, time, plane angle, energy, and electric charge are examples of physical measures that are ratio scales. Informally, the distinguishing feature of a ratio scale is the possession of a zero value. For example, the Kelvin temperature scale has a non-arbitrary zero point of absolute zero.
Data Types in Computer Science
We will introduce just the most commonly used data types in Computer Science, as defined in Wikipedia. There are many more, but we'll save those for more advanced courses.- Bit - A bit (a contraction of a binary digit) is the basic unit of information in computing and telecommunications; a bit represents either 1 or 0 (one or zero) only. This kind of data is sometimes also called binary data. When 8 bits are grouped together we call that a byte. A byte can have values in the range 0-255 (00000000-11111111). For example, the byte 10110100 = 180.
- Hexadecimal - Bytes are often represented as Base 16 numbers. Base 16 is known as Hexadecimal (commonly shortened to Hex). Hex uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a–f) to represent values ten to fifteen. Each hexadecimal digit represents four bits, thus two hex digits fully represent one byte. As we mentioned, byte values can range from 0 to 255 (decimal), but may be more conveniently represented as two hexadecimal digits in the range 00 to FF. A two-byte number would also be called a 16-bit number. Rather than representing a number as 16 bits (10101011110011), we would represent it as 2AF3 (hex) or 10995 (decimal). With practice, computer scientists become proficient in reading and thinking in hex. Data scientists must understand and recognize hex numbers. There are many websites that will translate numbers from binary to decimal to hexadecimal and back.
- Boolean - The Boolean data type encodes logical data, which has just two values (usually denoted "true" and "false"). It is intended to represent the truth values of logic and Boolean algebra. It is used to store the evaluation of the logical truth of an expression. Typically, two values are compared using logical operators such as .eq. (equal to), .gt. (greater than), and .le. (less than or equal to). For example,
would assign the boolean value of "true" to "b" if the value of "x" was the same as the value of "y," otherwise it would assign the logical value of "false" to "b".
- Alphanumeric - This data type stores sequences of characters (a-z, A-Z, 0-9, special digits) in a string--from a character set such as ASCII for western languages or Unicode for Middle Eastern and Asian languages. Because most character sets include numeric digits, it is possible to have a string such as "1234". However, this would still be an alphanumeric value, not the integer value 1234.
- Integers - This data type has essentially the same definition as the mathematical data type of the same name. In computer science, however, an integer can either be signed or unsigned. Let us consider a 16-bit (two-byte) integer. In its unsigned form, it can have values from 0 to 65535 (216-1). However, if we reserve one bit for a (negative) sign, then the range becomes -32767 to +32768 (-7FFF to +8000 in hex).
- Floating Point - This data type is a method of representing real numbers in a way that can support a wide range of values. The term floating point refers to the fact that the decimal point can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. In scientific notation, the given number is scaled by a power of 10 so that it lies within a certain range - typically between 1 and 10, with the decimal point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×105 seconds. Floating-point representation is similar in concept to scientific notation. The base part of the number is called the significand (or sometimes the mantissa) and the exponent part of the number is unsurprisingly called the exponent.
- The two most common ways in which floating point numbers are represented are either in 32-bit (4 byte) single precision or in 64-bit (8 byte) double precision. Single precision devotes 24 bits (about 7 decimal digits) to its significand. Double precision devotes 53 bits (about 16 decimal digits) to its significand.
- List - This data type is used to represent complex data structures. In its most simple form, it has a key-value pair structure. For example, think of a to-do list:
Key | Value |
---|---|
1 | Get haircut |
2 | Buy groceries |
3 | Take shower |
Data Types in R
There are at least 24 data types in the R language.] We will just introduce you to the 9 most commonly used data types. As you will see they are a blend of the data types that exist in Mathematics, Statistics, and Computer Science. Just what a Data Scientist would expect. The nine are:- NULL - for something that is nothing
- logical - for something that is either TRUE or FALSE (on or off; 1 or 0)
- character - for alphanumeric strings
- integer - for positive, negative, and zero whole numbers (no decimal place)
- double - for real numbers (with a decimal place)
- complex - for complex numbers that have both real and imaginary parts (e.g., square root of -1)
- date - for dates only
- POSIX - for dates and times (dates are internally represented as the number of days since 1970-01-01, with negative values for earlier dates)
- list - for storing complex data structures, including the output of most of the built-in R functions
typeof()
command. If you want to know what a particular data object was called in the original definition of the S language you can use the mode()
command. If you want to know what object class a particular data object is in the C programming language that was used to write R, you can use the class()
command. For the purposes of this book, we will mostly use the typeof()
command.- Just a note about lists in R. R likes to use the list data type to store the output of various procedures. We generally do not perform statistical procedures on data stored in list data types--with one big exception. In order to do statistical analysis on lists, we need to convert them to tables with rows and columns. R has a number of functions to move data back and forth between table-like structures and list data types. The exception we just referred to, is called the data.frame list object. List objects of the class data.frame store rows and columns of data in such a specifically defined way as to facilitate statistical analysis. We will explain data frames in more detail below.
What are Variables and Values?
Let us start by noting the opposite of a variable is a constant. If we declare that the symbol "X" is a constant and assign it a value of 5, then X=5. It does not change; X will always be equal to 5. Now, if we declare the symbol "Y" to be a variable, that means Y can have more than one value (see the Wiktionary entry for "variable"). For example, in the mathematical equation, Y^^2=4 (Y squared equals 4), the variable Y can either have the value of 2 or -2 and satisfy the equation.Imagine we take a piece of paper and make two columns. At the top of the first column we put the label "name" and at the top of the second column we put the label "age". We then ask a room full of 20 people to each write down their name and age on the sheet of paper in the appropriate columns. We will end up with a list of 20 names and 20 ages. Let us use the label "name" to represent the entire list of 20 names and the label "age" to represent the entire list of 20 ages. This is what we mean by the term "variable". The variable "name" has 20 data points (the list of 20 names), and the variable "age" has 20 data points (the list of 20 ages). A variable is a symbol that represents multiple data points which we also call values. Other words that have approximately the same meaning as "value" are measurement and observation. Data scientists use these four terms (data point, value, measurement, and observation) interchangeably when they communicate with each other.
The word "variable" is a general-purpose word used in many disciplines. However, various disciplines also use more technical terms that mean approximately the same thing. In mathematics another word that approximates the meaning of the term "variable" is vector. In computer science, another word that approximates the meaning of the term "variable" is array. In statistics, another word that approximates the meaning of the term "variable" is distribution. Data scientists will often use these four words (variable, vector, array, and distribution) interchangeably when they communicate with each other.
Let us think again of the term data set (defined above). A data set is usually two or more variables (and their associated values) combined together. Once our data is organized into variables, combined into a data set, and stored in a file on a disk, it is ready to be analyzed.
str()
. We will explain what "factors" are in later chapters.Assignment/Exercise
Find Data Types in R
Use the typeof() command to verify data types. See if you can guess what the output will look like before you press the enter key.
> a <- as.integer(1)
> typeof(a)
> a
> b <- as.double(1)
> typeof(b)
> b
> d <- as.character(1)
> typeof(d)
> d
> e <- as.logical("true")
> typeof(e)
> e
> f <- as.complex(-25)
> typeof(f)
> f
> g <- as.null(0)
> typeof(g)
> g
> h <- as.Date("2012-07-04")
> typeof(h)
> class(h)
> h
> i <-as.POSIXct("2012/07/04 10:15:59")
> typeof(i)
> class(i)
> i
> j <-as.POSIXlt("2012/07/04 10:15:59")
> typeof(j)
> class(j)
> j
> k <- list("Get haircut", "Buy Groceries", "Take shower")
> typeof(k)
> k
typeof()
command to see the automatic assignments of data types that R made for you. Then see if you can convert a value from one data type to another.Objects, Variables, Values, and Vectors in R
The R language is based on an object-oriented programming language. Thus, things in R are called objects. So, when we assign a value to the letter "X," in R we would say we have assigned a value to the object "X". Objects in R may have different properties from each other, depending on how they are used. For this exercise, we will concern ourselves with objects that behave like variables. Those types of objects are called vector objects. So, when we talk - in the language of data science - about the variable "X," in R we could call it the vector "X". As you remember, a variable is something that varies. Let's create a character vector in R and assign it three values. We will use the concatenatec()
command in R. Let's also create an integer vector using the same concatenate command.> name <- c("Maria", "Fred", "Sakura")
> typeof(name)
> name
> age <- as.integer(c(24,19,21))
> typeof(age)
> age
Both vectors now have three values each. The character string "Maria" is in the first position of the vector "name," "Fred" is in the second position, and "Sakura" is in the third position. Similarly, the integer 24 is in the first position of the vector "age," 19 is in the second position, and 21 is in the third position. Let's examine each of these individually.
> name[1]
> name[2]
> name[3]
> age[1]
> age[2]
> age[3]
Data Sets and Data Frames
name[1]
corresponded to age[1]
, we would have a data set that looks like the following.Name | Age |
---|---|
Maria | 24 |
Fred | 19 |
Sakura | 21 |
Let us put our data set into an R data frame object. We need to think of a name for our data frame object. Let's call it "project". After we put our data set into the data frame, we will inspect it using R's "typeof," "class," "ls," and "structure" commands,
str()
. Remember, upper and lower cases are meaningful.> project <- data.frame(name, age)
> typeof(project)
> class(project)
> ls(project)
> str(project)
The typeof() function told us we had created a list object. The class() function told us it is a special type of list object known as a data.frame. The ls() function tells us what "key-value" pairs exist inside our list object. Please don't worry too much about all of that detail right now. What is important is what the str() function tells us.
The structure command tells us we have three observations and two variables. That is great. It tells us the names of the variables are
$name
and $age
. This tells us that when we put a data set into an R data frame list object, we need to reference the variable WITHIN the data frame as follows: project$name
and project$age
. The structure command also tells us that project$name
was assigned the role of a "Factor" variable and that project$age
was assigned the role of "int." These correspond to the "nominal" and "ordinal" data types that statisticians use. R needs to know the role variables play in order to perform the correct statistical functions on the data. One might argue that the age variable is more like the statistical interval data type than the statistical ordinal data type. We would then have to change the R data type from integer to double. This will change its role to "number" within the data frame.Rather than change the data type of
project$age
, it is a good practice to create a new variable, so the original is not lost. We will call the new variable project$age.n
, so we can tell that is the transformed project$age
variable.> project$age.n <- as.double(project$age)
> str(project)
We can now see that
project$age
and the project$age.n
variables play different roles in the data frame, one as "int" and one as "num". Now, confirm that the complete data set has been properly implemented in R by displaying the data frame object.> project
name age age.n
1 Maria 24 24
2 Fred 19 19
3 Sakura 21 21
Now let's double check the data types.
> typeof(project$name)
> typeof(project$age)
> typeof(project$age.n)
Whoops! We see some of the quirkiness of R. When we created the variable "name," it had a data type of "character". When we put it into a data frame not only did R assign it the role of a "Factor" but it also changed its data type to "integer". What is going on here? This is more than you want to know right now. We will explain it now, but you really don't have to understand it until later.
- Because all statistical computations are done on numbers, R gave each value of the variable "name" an arbitrary integer number. It calls these arbitrary numbers levels. It then labeled these levels with the original values, so we would know what is going on. So under the covers,
project$name
has the values: 2 (labeled "Maria), 1 (labeled "Fred"), and 3 (labeled Sakura). We can convertproject$name
back into the character data type, but we won't be able to perform statistical calculations on it.
> project$name.c <- as.character(project$name)
> typeof(project$name.c)
> str(project)
'data.frame': 3 obs. of 4 variables:
$ name : Factor w/ 3 levels "Fred","Maria",..: 2 1 3
$ age : int 24 19 21
$ age.n : num 24 19 21
$ name.c: chr "Maria" "Fred" "Sakura"
We can now see that
project$name.c
has a data type of character, and has been assigned a data frame role of "chr."The Impact of Data Science
Chapter Summary
Discussion
Moneyball
According to Wikipedia, Moneyball (film) refers to a book by Michael Lewis, published in 2003, about the Oakland Athletics baseball team and its general manager Billy Beane. Its focus is the team's analytical, evidence-based, sabermetric approach to assembling a competitive baseball team, despite Oakland's disadvantaged revenue situation. A film based on the book starring Brad Pitt was released in 2011.
The central premise of Moneyball is that the collected wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flawed. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game and the statistics that were available at the time. The book argues that the Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could compete successfully against richer competitors in Major League Baseball (MLB).
Rigorous statistical analysis had demonstrated that on-base percentage and slugging percentage are better indicators of offensive success, and the A's became convinced that these qualities were cheaper to obtain on the open market than more historically valued qualities such as speed and contact. These observations often flew in the face of conventional baseball wisdom and the beliefs of many baseball scouts and executives.
- By re-evaluating the strategies that produce wins on the field, the 2002 Oakland Athletics, with approximately $41 million in salary, were competitive with larger market teams such as the New York Yankees, who spent over $125 million in payroll that same season. Because of the team's smaller revenues, Oakland is forced to find players undervalued by the market, and their system for finding value in undervalued players has proven itself thus far.
- In addition, Moneyball traces the history of the sabermetric movement back to such people as Bill James (now a member of the Boston Red Sox front office) and Craig R. Wright. Lewis explores how James' seminal Baseball Abstract, an annual publication that was published from the late 1970s through the late 1980s, influenced many of the young, up-and-coming baseball minds that are now joining the ranks of baseball management.
- In its wake, teams such as the New York Mets, New York Yankees, San Diego Padres, St. Louis Cardinals, Boston Red Sox, Washington Nationals, Arizona Diamondbacks, Cleveland Indians, and the Toronto Blue Jays have hired full-time sabermetric data scientists.
23 and Me
According to Wikipedia, 23 and Me is a privately held personal genomics and biotechnology company based in Mountain View, California that provides rapid genetic testing. The company is named for the 23 pairs of chromosomes in a normal human cell. Their personal genome test kit was named "Invention of the Year" by Time magazine in 2008. The company was founded by Linda Avey and Anne Wojcicki after both recognized the need for a way to organize and study genetic data, the possibility for individual consumers to use the information, and the need for expertise to interpret the results.
23andMe began offering DNA testing services in November 2007, the results of which are posted online and allow the subscriber to view an assessment of inherited traits, genealogy, and possible congenital risk factors. Customers provide a 2.5 mL spit sample which is analyzed on a DNA microarray of Illumina, for 960,000 specific single-nucleotide polymorphisms (SNPs). An eventual goal is to provide whole genome sequencing.
- In June 2011, 23andMe announced it had accumulated a database of more than 100,000 individuals.
The organization also provides testing for certain research initiatives provides confidential customer datasets and partners with research foundations with the goal of establishing genetic associations with specific illnesses and disorders. Google co-founder Sergey Brin (whose mother suffers from Parkinson's disease and who is married to 23andMe co-founder Anne Wojcicki) underwrote the cost of the company's Parkinson's disease Genetics Initiative to provide free testing for people suffering from the condition. An analysis of the results of research on Parkinson's disease comparing 23andMe with a National Institutes of Health (NIH) initiative suggested that the company's use of large amounts of computational power and data sets might offer comparable results, though in much less time.
The company gathers personal and social data from its subscribers via online surveys. Personal data includes a person's health history, environmental history, and such things as the ability to smell certain odors. The social data includes family histories and the sorts of activities one participates in. The company employs a number of data scientists to work on this wealth of data - a million genetic variables and many hundreds of personal and social variables for over 100,000 people.
- The company's data scientists are able to correlate and cluster certain personal and social behaviors with genetic markers. When these correlations are shown to be significant, they report the results back to the subscribers, indicating that certain personal or social aspects of their lives may, indeed, have a genetic basis. They also publish results in scientific journals.
Google's Driverless Car
According to Wikipedia, Google's Driverless Car is a project by Google engineer Sebastian Thrun, director of the Stanford Artificial Intelligence Laboratory and co-inventor of Google Street View. Thrun's team at Stanford created the robotic vehicle Stanley which won the 2005 DARPA Grand Challenge and its $2 million prize from the United States Department of Defense.
The U.S. state of Nevada passed a law in June 2011 permitting the operation of driverless cars in Nevada. Google had been lobbying for driverless car laws. Google executives have not stated the precise reason they chose Nevada to be the maiden state for the driverless car. The Nevada law went into effect on March 1, 2012, and the Nevada Department of Motor Vehicles issued the first license for a self-driven car in May 2012. The license was issued to a Toyota Prius modified with Google's experimental driver-less technology. In August 2012, the team announced that they have completed over 300,000 autonomous-driving miles accident-free, typically have about a dozen cars on the road at any given time, and are starting to test them with single drivers instead of in pairs. Three U.S. states have passed laws permitting driverless cars, as of September 2012: Nevada, Florida, and California.
The system combines information gathered from Google Street View with artificial intelligence software that combines input from video cameras inside the car, a LIDAR sensor on top of the vehicle, radar sensors on the front of the vehicle, and a position sensor attached to one of the rear wheels that helps locate the car's position on the map. In 2009, Google obtained 3,500 miles of Street View images from driverless cars with minor human intervention. As of 2010, Google has tested several vehicles equipped with the system, driving 1,609 kilometers (1,000 mi) without any human intervention, in addition to 225,308 kilometers (140,000 mi) with occasional human intervention. Google expects that the increased accuracy of its automated driving system could help reduce the number of traffic-related injuries and deaths while using energy and space on roadways more efficiently.
The project team has equipped a test fleet of at least eight vehicles, each accompanied in the driver's seat by one of a dozen drivers with unblemished driving records and in the passenger seat by one of Google's engineers. The car has traversed San Francisco's Lombard Street, famed for its steep hairpin turns, and through city traffic. The vehicles have driven over the Golden Gate Bridge and on the Pacific Coast Highway, and have circled Lake Tahoe.
Google's driverless test cars have about $150,000 in equipment including a $70,000 lidar (laser radar) system. The system drives at the speed limit it has stored on its maps and maintains its distance from other vehicles using its system of sensors. The system provides an override that allows a human driver to take control of the car by stepping on the brake or turning the wheel, similar to cruise control systems already in cars.
While Google had no immediate plans to commercially develop the system, the company hopes to develop a business that would market the system and the data behind it to automobile manufacturers. An attorney for the California Department of Motor Vehicles raised concerns that "The technology is ahead of the law in many areas," citing state laws that "all presume to have a human being operating the vehicle". According to the New York Times, policymakers and regulators have argued that new laws will be required if driverless vehicles are to become a reality because "the technology is now advancing so quickly that it is in danger of outstripping existing law, some of which dates back to the era of horse-drawn carriages".
In August 2011, a human-controlled Google driverless car was involved in the project's first crash near Google headquarters in Mountain View, CA. Google has stated that the car was being driven manually at the time of the accident. A second incident involved a Google driverless car being rear-ended while stopped at a stoplight.