A History of Data Science
When learning any new field, context is everything; therefore, we will begin this course by introducing the history of data science. In this way, you will be able to understand how the field became an amalgamation of various areas of science dealing with data in many different forms. In this section, and as the course continues, pay close attention to the various ways data can be represented and analyzed.
A History of Data Science
Data Science is a composite of a number of pre-existing disciplines. It is a young profession and academic discipline. The term was first coined in 2001. Its popularity has exploded since 2010, pushed by the need for teams of people to analyze the big data that corporations and governments are collecting. The Google search engine is a classic example of the power of data science.
Data science is a discipline that incorporates varying degrees of Data Engineering, Scientific Method, Math, Statistics, Advanced Computing, Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data Science is called a Data Scientist. Data Scientists solve complex data analysis problems.
The term "Data Science" was coined at the beginning of the 21st Century. It is attributed to William S. Cleveland who, in 2001, wrote "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". About a year later, the International Council for Science: Committee on Data for Science and Technology started publishing the CODATA Data Science Journal beginning in April 2002. Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science.
During the "dot-com" bubble of 1998-2000, hard drives became really cheap. So corporations and governments started buying lots of them. One corollary of Parkinson's Law is that data always expands to fill the disk space available. The "disk-data" interaction is a positive exponential cycle between buying ever more disks and accumulating ever more data. This cycle produces big data. Big data is a term used to describe data sets so large and complex that they become awkward to work with using regular database management tools.
Once acquired, we have to do something with the big data besides just storing it. We need big computing architectures. Companies like Google, Yahoo!, and Amazon invented the new computing architecture, which we call cloud computing. One of the most important inventions within cloud computing is called MapReduce. MapReduce has been codified into the software known as Hadoop. We use Hadoop to do big computing on big data in the cloud.
- The normal computing paradigm is that we move data to the algorithm. For example, we read data off a hard drive and load it into a spreadsheet program to process. The MapReduce computing paradigm is just the opposite. The data are so big we cannot put it all into the algorithm. Instead, we push many copies of the algorithm out to the data.
It turns out that Hadoop is difficult to do. It requires advanced computer science capabilities. This opens up a market for the creation of analytics tools - with simpler interfaces - that run on top of Hadoop. This class of tools is called "Mass Analytic Tools" - that is, tools for the analysis of massive data. Examples of these are "recommender systems, "machine learning," and "complex event processing". These tools, while having a simpler interface to Hadoop, have complex mathematical underpinnings, which also require specialization.
So, with the advent of mass analytic tools, we need people to understand the tools and actually do the analysis of big data. We call these people, "Data Scientists". These people are able to tease out new analytic insights never before possible in the world of small data. The scale of problems that are solved by analyzing big data is such that no single person can do all the data processing and analytic synthesis required. Therefore, data science is best practiced in teams.
In sum, cheap disks --> big data --> cloud computing --> mass analytic tools --> data scientists --> data science teams --> new analytic insights.
Mike Loukides, Vice President of Content Strategy for O'Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article "What is data science?" In the last few years, data science is increasingly being associated with the analysis of Big data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated by their websites.
There are now several ongoing conferences devoted to big data and data science, such as O'Reilly's Strata Conferences and Greenplum's Data Science Summits.
The job title has similarly become very popular. On one heavily used employment site, the number of job postings for "data scientist" increased more than 10,000 percent between January 2010 and July 2012.
Several universities have begun graduate programs in data science, such as at the Institute for Advanced Analytics at North Carolina State University, the McCormick School of Engineering at Northwestern University, and the now-discontinued six-week summer program at the University of Illinois.
A few professional organizations have sprung up recently. Data Science Central and Kaggle are two such examples. Kaggle is an interesting case. They crowdsource data science solutions to difficult problems. For example, a company will put up a hard problem with Kaggle. Data scientists from around the world sign up with Kaggle, then compete with each other to find the best solution. The company then pays for the best solution. There are over 30,000 data scientists registered with Kaggle.
In the mid- to late-1990s, AltaVista was the most popular search engine on the internet. It sent "crawlers" to extract the text from all the pages on the web. The crawlers brought the text back to AltaVista. AltaVista indexed all the text. So, when a person searched for a keyword, Altavista could find the web pages that had that word. AltaVista then presented the results as an ordered list of web pages, with the pages that had the most frequent mentions of the term at the top. This is a straightforward computer science solution, though, at the time, they solved some very difficult scaling problems.
In the late 1990s, the founders of Google invented a different way to do searches. They combined math, statistics, data engineering, advanced computation, and the hacker spirit to create a search engine that displaced AltaVista. The algorithm is known as PageRank. PageRank looks not only at the words on the page but the hyperlinks as well. PageRank assumes that an inbound hyperlink is an indicator that some other person thought the current page was important enough to put a link to it on their own page. Thus the pages with the most hyperlinks end up at the top of the list of search results. PageRank captures the human knowledge about web pages, in addition to the content.
Google not only crawled the web, it ingested the web. That is big data. They then have to calculate the PageRank algorithm across that big data. That requires massive computation. Then they have to make search and search results fast for everyone. Google search is a triumph of data science (though it was not called data science when it started).
Get into groups of 2 or 3 students. Download and install the R programming language on your computer. Help each other get R up and running.
- Go to the R website: http://www.r-project.org/
- Click on the CRAN mirror link
- Click on the Linux, Mac OSX, or Windows link
Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as typing in
sudo aptitude install r-base or
sudo apt-get install r-base, or installing the package
r-base using your favorite package manager, for example, Synaptic.
There is also a bunch of packages extending R to different purposes. Their names begin with
r-. Take a closer look at the package
r-recommended. It is a metapackage that depends on a set of packages that are recommended by the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this package, as it depends on
Installation with apt-get (Debian, Ubuntu, and all Linux distributions based on Debian)
sudo apt-get install r-base
sudo apt-get install r-recommended
Installation with aptitude (Debian, Ubuntu, and all Linux distributions based on Debian)
sudo aptitude install r-base
sudo aptitude install r-recommended
Installation: Download the disk image (dmg file) and install R.
The default graphical user interface for Mac is much better than the one for Windows. It includes
- a data frame manager,
- a history of all commands,
- a program editor which supports syntax highlighting.
To install R under Windows operating system you have to download the binaries from the web. First, go to the R-Project website (listed above) and click CRAN under the download section on the left panel and select a mirror site, from where you could download the required content. The best idea is to pick a mirror closest to your actual geographical location, but other ones should work as well. Then click Windows and in subdirectories base. The windows binary is the .exe file, in the form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version, the setup has the same steps.
As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems. However, there are a few things that you can alter.
- On the welcome screen click Next.
- Read or just notice the GNU license, and click Next.
- Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc, the default choice will be OK for you.
- During the next step, you can specify which parts of R you want to install. Choices are User installation, Minimal user installation, Full installation, and Custom installation. Notice the required space under the selection panel (varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
- In this step, you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation (see lower).
- You can specify the Start menu folder.
- In the next step, you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and specify registry entries.
With these steps, you can customize the R graphical user interface.
- You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window (SDI).
- You can select the style and how the Help screen is displayed in R. You will use help a lot, so this may be an important decision. It is up to you, which style you prefer. Please note, that the content of the help file will be the same regardless of your choice. Here you specify just the appearance of that particular window.
- In the next step, you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard option here.
Portable R for Windows
If you want to install R on your USB stick go to the Portable R website. This is useful if you don't have admin rights on a computer. The basic installation requires something like 115 MB but you may need more if you want to install add-on packages.
Source: Calvin Andrus, Jon Cook, and Suresh Sood, https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/A_History_of_Data_Science
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.