Thinking about the World

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: Thinking about the World
Printed by: Guest user
Date: Wednesday, May 22, 2024, 3:55 PM

Description

There are two major approaches to data science: analytical mathematics (including statistics) and visualization. These two categories are not mutually exclusive. However, mathematical analysis would be considered more of a "left-brain"' approach, while visualization would reflect a more "right-brain" approach. Both are powerful approaches for analyzing data, and we should not choose one or exclude the other. Visualization is a sensible vehicle for introducing the field because data relationships become immediately apparent to the naked eye. Use the materials in this section to compare and contrast analytic approaches versus visualization approaches. In this course, we will try to strike a healthy balance between the two.

Thinking Like a Visual Artist

Chapter Summary

Five design elements - Harmony, Rhythm, Flow, Balance, and Focus - are essential considerations to think like a visual artist. A data scientist must think creatively to combine these elements together in appropriate proportions to convey the essential conclusions to the audience.


Discussion

There are a dozen or more "Rules of Thumb" to guide the data scientist to think like a visual artist. We will examine just a few.

Harmony

The Wiktionary defines Harmony as a pleasing combination of elements. In visual art, harmony is the concordant use of shape and color. Humans easily consume and thus are attracted to harmonious visualizations. A data scientist will want to make sure the visual representations of the solution blend together well. One aspect of harmony is to consistently use a coherent color scheme across an entire presentation. One popular set of color schemes is to take 3 or 4 adjacent colors from the color wheel, such as the "earth" tones on the upper right or the "cool" tones on the upper left. A "complementary" color scheme will take colors from opposite sides of the color wheel.

BYR color wheel.svg

Similarly, when building a series of charts and graphs, the data scientist will want to have a consistent layout across them all, including using the same symbols and typefaces.


Rhythm

The Wikipedia defines Rhythm as a pattern marked by the regulated succession of strong and weak elements, or of opposite or different conditions across time and/or space. When a data scientist discovers rhythms in the data, she must then consider how to communicate these visually. A sine wave shows a consistent rhythm over time.

Sine wave amplitude.svg

One useful device is to show deviations (blue line) in the data from the background rhythm (black line), such as this chart showing climate change over the last 550 million of years.

Phanerozoic Climate Change.png

Another example is a depiction of the simulation of two waves interfering with each other.

Interferring waves two stimulators simulation.png

A final example shows the correlated distributions between two variables. Because the highs and lows are off center but repeat symmetrically, the correlations are not linear, but regular and offset by a constant amount.

2D CVCNs Patch4.png

Flow

The Wiktionary defines Flow as the movement of a fluid. In visual art, flow is the illusion of movement through the manipulation of color and form. The human eye will naturally follow visual cues in the image. A data scientist will want to make sure the visual representations of the solution tell a "story" by inviting the eye to move from start to finish. Peter Paul Rubens was a baroque Dutch painter that perfected artistic flow. In his painting, "Fall of Phaeton," one cannot but help move the eye toward the upper right.

Rubens-Fall of Phaeton.jpg

So it is with data scientists. Their analysis of the data leads to come conclusion. The trick to is to depict the that helps lead the recipient to the conclusion. Sometimes flow is an inherent component of the data, such as the depiction of a fluid moving around a solid, with the resulting turbulence.

Flow sphere.svg

Many times, the data represent concepts that are inter-related and those relationships are shown as a flow diagram, such as this one from the July 1987 Psychological Review.



Balance: Symmetry

Balance is an important visual cue. The simplest form of balance is symmetry. The Wikipedia defines Symmetry as self similarity across time, space, or scale. Human beings find symmetry comforting. Many living things show symmetry in space such as a butterfly.

Simetria-bilateria.svg

Humans also create objects with symmetry, such as buildings, like the Taj Mahal.

Taj Mahal, Agra views from around (85).JPG

A data scientist will play to the human preferences for symmetry in visually presenting the solutions to problems. In the following example, the four graphs show size symmetry - and three of them show scale symmetry. They are also placed so as to be both horizontally and vertically symmetric. This multiple symmetry aids in the understanding of the information being presented.

Training data NNNarmaL2.JPG

Balance: Rule of Thirds

While symmetrical balance can help an audience to feel comfortable with data presentations, asymmetrical balance is often more visually appealing. We are not talking about just any random asymmetry, but two particular asymmetries. The first one we will exam is the Rule of Thirds and the second is the Golden Ratio. According to the Wikipedia, the "Rule of Thirds" proposes that an image should be imagined as divided into nine equal parts by two equally-spaced horizontal lines and two equally-spaced vertical lines, and that important compositional elements should be placed along these lines or their intersections. Proponents of the technique claim that aligning a subject with these points creates more tension, energy and interest in the composition than simply centering the subject would.

In the example below, the picture has been cropped without and with the rule of thirds. The right hand picture has the mountain tops along the axis of the bottom third, the stone outcropping is positioned along the axis of the left third, and the clouds hover just above the top third.

RuleOfThirds-SideBySide.gif

Thinking like a visual artist means thinking of visualizing data in terms of the "rule of thirds." For example, the chart below takes advantage of the rule of thirds, by placing the horizontal rules in thirds, and added the comment on the upper horizontal rule at about the position of the (imagined) right-most rule.

Chart-Rule-Of-Thirds.png

Balance: Golden Ratio

Although somewhat similar to the Rule of Thirds, the Golden ratio is a much more well developed concept, both in mathematical theory and in application to real world problems. According to the Wikipedia, the "Golden Ratio" is defined as two quantities where the ratio of the sum of the quantities to the larger quantity is equal to the ratio of the larger quantity to the smaller one. The figures below illustrate the relationship.

Golden ratio line.svg

and

SimilarGoldenRectangles.svg


The golden ratio is expressed algebraically as:

\dfrac{a+b}{a} = \dfrac{a}{b} = \phi

where the Greek letter "Phi" ( \phi ) represents the golden ratio. Its value is:

φ = \dfrac{1 + \sqrt{5}}{ 2} = 1.61803 39887 ….

Many artists and architects have proportioned their works to approximate the golden ratio - especially in the form of the golden rectangle, in which the ratio of the longer side to the shorter is the golden ratio - believing this proportion to be aesthetically pleasing. A golden rectangle can be cut into a square and a smaller rectangle with the same aspect ratio. Mathematicians since Euclid have studied the golden ratio because of its unique and interesting properties. The golden ratio is also used in the analysis of financial markets, in strategies such as Fibonacci retracement. The golden ratio is commonly used in everyday design, for example in the shapes of postcards, playing cards, posters, wide-screen televisions, photographs, and light switch plates. Studies by psychologists, have been devised to test the idea that the golden ratio plays a role in human perception of beauty, such as female waist to hip ratio, male shoulder to hip ratios, and forehead to face ratios. While some early studies showed support for this hypothesis, later attempts to carefully test the hypothesis have been inconclusive.

  • One classic example of the application of the golden ratio is the Great Mosque of Kairouan, built in 670 AD in Tunisia. The golden ratio is repeated from the overall design to the individual rooms and columns. Not only is it one of the oldest places of worship in the Islamic world, is one of the most impressive and largest Islamic monuments in North Africa. The mosque is a masterpiece of both architecture and Islamic art.
3D computer modeling of the Great mosquee of Kairouan-en.svg

  • Another example is Salvador Dalí's The Sacrament of the Last Supper. The dimensions of the canvas are a golden rectangle. A huge dodecahedron, in perspective so that edges appear in golden ratio to one another, is suspended above and behind Jesus and dominates the composition. See an image of the painting at the National Gallery of Art.
As a data scientist develops more sophisticated analytical skills, she will also need to develop more sophisticated visual presentation skills. The Golden Ratio is one way to put analytical and visual sophistication in harmony with each other. One simple way is to divide up the presentation charts in sections like example below:

Golden spiral in rectangles.svg

Focus

The Wiktionary defines Focus as the concentration of attention. The data scientist will want to create visualizations to draw the audience's attention to the important point. Visual artists create focus by contrasting size (scale), color, and page position. It is important to make sure the visual elements have a function that supports the content. For example, in Francisco de Goya's painting, "The Vintage," he wants to highlight the grape harvest. First, he uses a triangular positioning of the people so that the grapes are at the apex of the triangle. He also places the grapes in the center of a bright (triangular) cloud that is surrounded by a dark cloud. He masterfully uses both design and color to bring focus to the subject of his painting.

The Vintage, Francisco de Goya.jpg

Here is a good use of focus to highlight conclusions based on data. The following charts depict above and below average country GDP per capita from the World Factbook and malaria risk by country from the Centers for Disease Control.

Average GDP PPP per capita.PNG
Malaria map.PNG


Creativity

The data scientist must use creativity to combine these five elements (Harmony, Rhythm, Flow, Balance, and Focus) together in appropriate proportions to convey the messages in interesting and informative ways to the audience. The creative process includes divergent thinking, which involves the generation of multiple answers to a problem; conceptual blending, in which solutions arise from the intersection of two quite different frames of reference; and, honing, in which an acceptable solution emerges from iterating over many successive unacceptable versions of the solution. In practice, creativity is often a team sport. When several people from diverse backgrounds come together to solve a problem, they can more easily engage in divergent thinking and conceptual blending. Honing is just good, old-fashioned elbow grease. Please do not fall in love with your first attempt. Think of your first attempt as the beginning (not the end) of a conversation that will engage both members of the data science team, as well as members of the potential audience for the team's findings.

The following graph is a good example of a science graphic using all five visual elements. See if you can see how harmony, rhythm, flow, balance, an focus are used.

  • The image shows that the atoms in a molecule can be modeled as charged spheres connected by springs that maintain bond lengths and angles. The charged atoms interact with each other (via Coulomb's law) and with solvent. The shroud represents the region of hydrophobic repulsion, where the strength of the hydrophobic effect is approximately proportional to the surface area of the shroud. The shroud, shown extending only over the back of the molecule actually extends all the way around it. The model shown is called a molecular mechanics potential energy function, and it is used by programs like Folding@Home to simulate how molecules move and behave. The molecule shown is an alanine dipeptide.
MM PEF 3.png


Assignment/Exercise

Use R to create some tables and plots. Get into groups of 2 to 3 students. Try to work with at least one other person you have not been a group with before. Be sure everyone in your group understands all of the R code as you execute it. Let's start by examining a categorical variable.

#Generate Table for Categorical Variable

#Remove Objects in workspace and print date
rm(list=ls())
paste ("Today is:", date())

#Create a nominal (categorical) variable with 10,000 observations
#Use the sample() function
temp.a <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )

#List the values
temp.a

#Get a summary - for nominal variables it gives a count
summary(temp.a)

#Produce a simple table showing the frequency distribution of the values of the variable
table(temp.a)

#Do a summation across all the categories, a total that we would put in the margin
#Note that in order for the margin.table() function to work properly it needs a numeric input
#The table(temp.a) produces four numbers, which are passed to margin.table()
margin.table(table(temp.a))

#Now let's get R to print the frequencies and margins together
addmargins(table(temp.a))

#Simarly, in order to calculate the percentage distribution, we need to pass a numeric argument
prop.table(table(temp.a))
addmargins(prop.table(table(temp.a)))

#The Hmisc library has a function that simply puts all of these together
library(Hmisc)
describe(temp.a)

Let's try exploring a continuous variable.

#Generate Table for Continuous Variable

#Remove Objects in workspace and print date
rm(list=ls())
paste ("Today is:", date())

#Create a continuous (numerical) variable with 10,000 observations
#Use the rnorm() function to generate 10,000 random number with a mean of 0 and a standard deviation of 1
temp.a <- rnorm(10000, mean = 0, sd = 1)

#List the values
temp.a

#Get a summary - for continuous variables it gives quintiles
summary(temp.a)

#Get descriptive statistics
min(temp.a)      #minimum
max(temp.a)      #maximum
range(temp.a)    #range
median(temp.a)   #median
mean(temp.a)     #mean
var(temp.a)      #variance
sd(temp.a)       #standard deviation

# The "describe" function in the "psych" library gives all these with one command
# item name ,item number, nvalid, mean, sd, median, mad (median absolute deviation),
# min, max, skew, kurtosis, se (standard error of the mean)
library(psych)
describe(temp.a)

Now try this simple plot.

#Generate plots

#Remove Objects in workspace and print date
rm(list=ls())
paste ("Today is:", date())

#The plot character (pch=) parameter specifies which symbol will be used in the plot
#Need x, y, and pch vectors from 1 to 25
temp.x<-1:25
temp.y<-1:25
temp.p<-1:25

#Set up plot of y on x using default plot character (no pch= specified)
plot(
     main="Simple Plot of Y on X",
     x=temp.x, xlim=c(0,26), xlab="Independent Variable",
     y=temp.y, ylim=c(0,26), ylab="Dependent Variable"
    )

You will need to add the library "calibrate" to your R workspace in order to use the "textxy" command in the following example. Use the "package manager" and "package installer" commands from the R console pull down menus. The packages chapter in the R programming Wikibook is a place to start.

Now, let's replace that first plot command with the following.

#Set up plot of y on x specifying a different plot character for each point
plot(
     main="Simple Plot of Y on X with Plot Characters",
     x=temp.x, xlim=c(0,26), xlab="Independent Variable",
     y=temp.y, ylim=c(0,26), ylab="Dependent Variable",
     pch=temp.p
    )

#Use textxy to label points with value of pch,
#Shift axes slightly so labels don't sit on top of points
temp.xshift <- temp.x-1.25
temp.yshift <- temp.y-0.2

#Use character expansion (cx=)
library(calibrate)
textxy(temp.xshift, temp.yshift, temp.p, cx=.6)

Be sure to print out a copy of that plot to use for reference.

Now, let's try changing the colors.
#Set up plot of y on x specifying different colors for each point
plot(
     main="Simple Plot of Y on X with Colored Characters",
     x=temp.x, xlim=c(0,26), xlab="Independent Variable",
     y=temp.y, ylim=c(0,26), ylab="Dependent Variable",
     pch=16,
     col=temp.p
    )

#Use textxy to label points with value of pch,
#Shift axes slightly so labels don't sit on top of points
temp.xshift <- temp.x-1.25
temp.yshift <- temp.y-.2

#Use character expansion (cx=)
library(calibrate)
textxy(temp.xshift, temp.yshift, temp.p, cx=.6)

And, now let's change the size of the points in the plot
#Set up plot of y on x specifying different sizes for each point
#Use the character expansion (cex=) parameter
#A cex=1 is the default; a cex=2 is twice the default size; and a cex=.5 is half the default size

#So as not to make the points too big, let's
#Transform the temp.p variable from 1-25 to 0.2-5
temp.c = temp.p/5

plot(
     main="Simple Plot of Y on X with Sized Characters",
     x=temp.x, xlim=c(0,26), xlab="Independent Variable",
     y=temp.y, ylim=c(0,26), ylab="Dependent Variable",
     pch=16,
     col=3,
     cex=temp.c
    )

#Use textxy to label points with value of pch,
#Shift axes slightly so labels don't sit on top of points
temp.xshift <- temp.x-1.25
temp.yshift <- temp.y-.2

#Use character expansion (cx=)
library(calibrate)
textxy(temp.xshift, temp.yshift, temp.p, cx=.6)

Finally, let's draw a line through the points.
#Set up plot of y on x drawing a line through the points

plot(
     main="Simple Plot of Y on X with Sized Characters",
     x=temp.x, xlim=c(0,26), xlab="Independent Variable",
     y=temp.y, ylim=c(0,26), ylab="Dependent Variable",
     pch=15,
     col=5,
     cex=2
    )

#Find the slope and intercept of the line
#Use the linear model (lm()) function and store results into the R object (temp.line)
#The linear model is: Y = a +bX, where a is the intercept and b is the slope

temp.line <- lm(temp.y~temp.x)
temp.line

#Now plot a line with that intercept (a) and slope (b)
abline(temp.line)


Source: Calvin Andrus, Jon Cook, and Suresh Sood, https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/Thinking_Like_a_Visual_Artist
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.

Thinking Like a Data Engineer

Chapter Summary

When a data scientist thinks like a data engineer, they think in terms of tables. The tasks are to define the rows, columns, and cells of the tables; to associated tables with one another; and create systems to ingest, store, and retrieve tables.

Discussion

Data engineering is the data part of data science. According to Wikipedia,data engineering involves acquiring, ingesting, transforming, storing, and retrieving data. Data engineering is closely related to Data Collection, Information Engineering, Knowledge Engineering, Information Management and Knowledge Management.

Data engineering starts with an understanding of the general nature of the problems to be solved. A data acquisition and management plan must be formulated which specifies where the data are coming from (RSS feeds, sensor network, pre-existing data repository), the format of the incoming data (text, numbers, images, video), and how the data will be stored and retrieved (file system, database management system). Raw data is "dirty". There will be records in the raw data that do not conform to data definitions that have been agreed upon. For example, in one hospital data set, several young boys aged 7 to 11 gave birth to babies. Clearly, there are mistakes in this data. Part of the data acquisition and management plan is deciding what to do with dirty data (leave it, erase it, infer corrections).

Most of the time, raw data is not in the format that the analytical tools are expecting to see. Indeed, each tool will want to see data in its own particular way. Therefore, one task of data engineering is to transform the data so that it can be consumed by the analytical tools the data science team will use. For example, a team might receive egg laying data with each observation in its own row like the following:

Chicken Day Eggs
A 1 3
A 2 4
A 3 2
B 1 1
B 2 0
B 3 2

But what the analysis the team wants to do requires all of the observations about each chicken to be in one row only as follows:

Chicken Day1 Day2 Day3
A 3 4 2
B 1 0 2

Good data engineering requires both the ability to manipulate data and an understanding of the analytic purposes to which the data are going to be used.

In the egg laying example above the first table is in a Normalised form that lends enables further analysis, the second table is formatted to present data to the user. Often the formatting makes implicit assumptions about the questions being asked of the data – such as "what are the trends in egg laying by chicken over time?". Other questions such as "on how many occasions did a chicken lay no eggs?" are easier to answer with the data in the normalised form.

Often the sources for an analysis are outputs from another system – so for example an egg-laying database may well internally store data in the 3 column format but export a report in the "many columns" format. One of the tasks of a data engineer is to transform captured data, which may well involve re-normalising data from output reports.

Wikipedia defines database normalization as the process of organizing the fields and tables of a relational database to minimize redundancy and dependency – usually by dividing larger tables into smaller (and less redundant) tables and defining relationships between them. The main objectives of normalisation are to:

  • avoid update and deletion anomalies
  • minimize redesign when extending the database structure
  • support general-purpose querying, including future queries that are not anticipated at design time
Suppose the egg-laying data is extended to store the age and colour of each chicken. This could be represented in a table like this:

Chicken Age Colour Day Eggs
A 2 Brown 1 3
A 2 Brown 2 4
A 2 Brown 3 2
B 1 White 1 1
B 1 White 2 0
B 1 White 3 2

This table now contains redundant information, since we are storing the age and colour of each chicken 3 times. This becomes inefficient if we are storing data for hundreds of days for each chicken. Furthermore, if Chicken B turns 2 years old we would have to synchronise the changes to records 4, 5 and 6 in order to update the age data. The normalised solution would be to have a separate "chicken" table for chicken-related facts that is linked to the "egg laying" table by a unique identifier or key.

Wikipedia defines a primary key as a unique identifier for a record in a table in a relational database relational_database. Some data sets have naturally unique keys (such as employee_id for an employee table) in other cases a unique key will need to be system generated either as an internal 'one-up' counter or by combining several attributes to create one (such as Chicken_Day in the example above). Other tables can cross-reference to a table by using its primary key. For example, a 'project' table could have a column with employee_id for each team member associated with the project. This 'cross referencing' column is known as a foreign key.

Entity relationship diagrams (also known as logical data models) are used to design relational databases and can be a good way of understanding the structures in a data set. The 3 building blocks of an Entity Relationship model are entities, attributes and relationships. An entity is a discrete and recognisable 'thing', either a physical object such as a car (or a chicken), or concept such as a bank transaction or a phone call. Each entity can be physically represented as a table, where each column of the table is an attribute of the entity (such as employee_id, forename, surname, date of joining). A relationship is a verb that links two or more entities. For example, a chicken 'lays' eggs or an employee 'belongs to' a department. Importantly, relationships also have a cardinality that can be 'one to one', 'many to one', 'one to many' or 'many to many'. For example, a chicken can lay many eggs but each egg is laid by only one chicken so the 'lays' relationship is one to many. Many to many relationships are often a sign that a design needs to be further elaborated. For example, the 'teaches' relationship between teachers and students at a university would be many to many and would require the introduction of entities like class and date to fully understand the relationship. An example entity relationship diagram is shown below:

Example entity relationship diagram showing the relationship between students and lecturers

Example entity relationship diagram showing the relationship between students and lecturers

More advanced data engineering also requires knowledge of computer programming and the Structured Query Language, as well as relational and no-SQL database management systems. For the purposes of this book, we will use the R programming language for simple data engineering tasks.


Assignment/Exercise

This assignment is about reading data sets into R data frames. Assemble into groups of 3 or 4 students. Every single student must do every part of this exercise. The purpose of grouping is to help each other understand what is going on. Some of these assignments require some trial and error. Different students will do different trials and errors, thus all will learn from each other's trials and errors.

Part 1 of 3: Within R create 4 variables, each with 12 of observations.

#Create data frame
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#30 August 2012

#Remove Objects in workspace
rm(list=ls())

#Create four variables with 12 oberservations each
#Weather data for Sterling, VA from http://www.weather.com/weather/wxclimatology/monthly/USVA0735
#Retrieved 30 August 2012
#Average Temperature (Farenheit)
#Average Precipitation (inches)
Num <- 1:12
Month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
AveTemp <-c(32, 35, 43, 53, 62, 72, 76, 75, 67, 55, 46, 36)
AvePrcp <-c(2.85, 2.86, 3.60, 3.62, 4.72, 3.92, 3.70, 3.49, 4.00, 3.59, 3.58, 3.09)

#List out the objects that exist in the R environment
ls()

#Verify each variable
Num
Month
AveTemp
AvePrcp

#Link these four variables together into a dataset where each of the 12 observations correspond to each other
#Give the dataset a name (Wthr) using the dataframe command

Wthr <- data.frame(Num, Month, AveTemp, AvePrcp)

#List out the objects that exist in the R environment
ls()

#Notice that the 4 variables are still part of the R environment in addition to the dataframe
#The variables are now also part of the data frame
#Verify the contents of the dataset
Wthr

#Verify the formats within the data frame using the "structure" (str) command
str(Wthr)

#Notice that as part of the data frame the variables have a dollar sign ($) as a prefix
#Compare the Month variable inside and outside the data frame
str(Month)
str(Wthr$Month)

#Whoops! What happened? When we inserted the character variable Month into the data frame, it was converted to a factor variable.
#We call the values of a Factor variable "levels"
#Factor variables are nominal variables, which means the default is that order does not matter, which is called an "unordered" factor.
#Therefore R does two things as a default:
#  1) R prints out the levels in alphbetical order
#  2) R associates an random integer to each level, in this case 5, 4, 8, 1, 9, etc.
#For this particular problem the order of the months does matter.
#We can force an order on a factor by using the factor() function
#This is called an "ordered" factor
levels(Wthr$Month)
Wthr$Month <- factor(Wthr$Month, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

#Note we could have also specified, levels=Month, can you explain why?
#Verify that the factor levels are now ordered properly, with the assigned integers in order
levels(Wthr$Month)
str(Wthr$Month)
Wthr

#We can now remove the redundant variables from the R workspace
rm("AvePrcp", "AveTemp", "Month", "Num")
ls()

#The dataframe is the only object left
#Now let's do some plots
plot(x=Wthr$Month, y=Wthr$AveTemp)
lines(Wthr$Month,fitted(loess(Wthr$AveTemp~Wthr$Num)))
plot(x=Wthr$Month, y=Wthr$AvePrcp)
plot(x=Wthr$AveTemp, y=Wthr$AvePrcp, pch=16, cex=1.5)
abline(lm(Wthr$AvePrcp~Wthr$AveTemp))

Part 2 of 3. Load an example data set into a data frame.

#Put Example Data into Data Frame 
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#31 August 2012

#Remove Objects in workspace
rm(list=ls())

#Find out the available datasets
data()

#Pick a dataset and get the help file
?state

#Load the dataset into the R workspace
data(state)

#Find out what got loaded
ls()

#Examine the objects that were loaded
str(state.abb)
str(state.area)
str(state.x77)

#Notice that the last object was not a simple variable with a single set of observations, but
#it is a matrix that is 50 rows long and 8 columns wide
#Inspect a summary of these data
summary(state.abb)
summary(state.x77)

#Print out the contents of these objects
state.abb
state.x77

#Now let's put these objects into a data frame called "state" and inspect it
state <- data.frame(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
ls()
str(state)

#Remove the old objects, now that we have put the data set into a data frame
rm(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
ls()

#Print out the data frame
state

#Examine the relationships among the variables using table() and plot(), then
#Try about 10 different variations on both the table() and the plot() functions
table(state$state.region,state$state.division)
plot(state$Illiteracy,state$Murder)

Part 3 of 3 - Import an external data set.

  1. Find Fisher's Iris Data Set in the Wikipedia.
  2. Copy the data table and paste it into Microsoft Excel, Apple Numbers, or Google Docs Spreadsheet
  3. Save the dataset in Comma Separated Value (CSV) format on your desktop, with a filename of "iris.csv"
  4. Read the dataset into R
  5. Inspect the data, make sure it is all there, then look at the data using the summary(), table(), and plot() functions
#Read External Data into Data Frame 
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#30 August 2012

#Remove Objects in workspace
rm(list=ls())

#Findout what our default working directory is
getwd()

#Set your working directory to the "desktop" and verify
#You will need to use your own directory structure
setwd("/Users/Calvin/Desktop/")
getwd()

#Read the iris.csv into a dataframe -- and verify
#  The first line of the file should be the variable names, hence header=TRUE
#  Tell R that the separator is a comma
#  If there are other line on top of the variable names, then you will need to skip them
iris <- read.table("iris.csv", header=TRUE, sep=",", skip=0)
str(iris)
iris

#You should have gotten 150 observations on 5 variables
#Explore the data using summary(), table(), and plot()
summary(iris)
table(iris$Species)
plot(iris$Sepal.length,iris$Sepal.width)

#Create a character variable to match a color to the factor variable Species
#Note how the R code implements the follow English statement
#  If the variable "iris$species" has the value "I.setosa" then set the "iris$plotcolor" variable to the value "blue"
iris$plotcolor <- as.character("black")
iris$plotcolor [iris$Species == "I. setosa"] <- "blue"
iris$plotcolor [iris$Species == "I. versicolor"] <- "green"
iris$plotcolor [iris$Species == "I. virginica"] <- "red"

plot(
   main="Plot of Sepal Size for Three Iris Species",
   x=iris$Sepal.width, xlim=c(1,5), xlab="Sepal Width",
   y=iris$Sepal.length, ylim=c(3,8), ylab="Sepal Length",
   pch=16,
   col=iris$plotcolor
  )
legend(1.5, 3.5,"Setosa=Blue, Versicolor=Green, Virginica=Red")

#Now, plot the Petal Length and Width
#Compare Sepal Width with Petal Width
#Compare Sepal Length with Petal Length

Thinking Like a Mathematician

Chapter Summary

When a data scientist thinks like a mathematician, they think in terms of measurement and models. The tasks are to decompose the problem into it basic components; represent those components numerically; and combine the components together into an accurate expression of the problem and its solution.


Discussion

According to the Wikipedia, mathematics is the study of quantity, structure, space, and change. When these are used to solve practical problems it called applied mathematics. In addition to these main concerns, there are also topics dedicated to exploring links from the heart of mathematics to other fields: to logic, to set theory, and more recently to the study of uncertainty. For the purposes of the is book, we will not explore these last three aspects of mathematics.


Quantity

The study of quantity starts with numbers, first the familiar natural numbers and integers ("whole numbers") and the basic arithmetical operations on them, which are characterized in arithmetic. As the number system is further developed, the integers are recognized as a subset of the rational numbers ("fractions"). These, in turn, are contained within the real numbers, which are used to represent continuous quantities. Real numbers are generalized to complex numbers.

1 , 2 , 3 . . .  . . . − 2 , − 1 , 0 , 1 , 2 . . . − 2 ,\dfrac{ 2}{ 3} , 1.21 − e , \sqrt{2} , 3 , π 2 , i , − 2 + 3 i , 2e^{i \dfrac{4 π}{ 3}}
Natural numbers    
Integers
Rational numbers Real numbers     Complex numbers

When thinking like a mathematician, a data scientist needs to ask the questions, "how will the thing I am interested in be represented by numbers?" and "what kind of numbers will best represent the thing I am interested in?"

Structure

Many sets of mathematical objects exhibit internal structure. Mathematics exposes these structures by applying rules (axioms and operations) to the objects. Algebra is a powerful tool to understand mathematical structures. It combines the concept of variables with arithmetic to solve equations. Algebra is applied to many different, and seemingly, unrelated problems. Some of these problems include rings, groups, graphs, and fields.







Sets     Rings     Groups     Graphs 
Fields

When thinking like a mathematician, a data scientist needs to ask the questions, "what sort of internal structure does the thing I am interested in have?" and "what set of equations will expose the structure?"

Space

The study of space' originates with geometry, in particular, Euclidean geometry. Trigonometry is the branch of mathematics that deals with relationships between the sides and the angles of triangles; it combines space and numbers, and encompasses the well-known Pythagorean theorem. The advanced study of space includes include higher-dimensional geometry, non-Euclidean geometries, Differential Geometry, Topology, Fractal geometry, and Wikipedia:Measure theory. For the purposes of this book, we will not cover these more advanced geometries.








Geometry Trigonometry Differential geometry Topology Fractal geometry
Measure theor

When thinking like a mathematician, a data scientist needs to ask the questions, "does the thing I am interested have a spatial component, either actual or theoretical?" and "how do I capture and represent that spatial component?"

Change

Understanding and describing change is a common theme in science, and calculus was developed as a powerful tool to investigate it. Functions are a central concept describing a changing quantity. Many problems lead naturally to relationships between a quantity and its rate of change. That is, for a non-straight line, the slope is different at every point on the line. Understanding these changing slopes are studied in Differential calculus. Finding the areas under a curves is called Integral calculus. Calculus is beyond the scope of this book.



Tangent line at (x, f(x)). The derivative f′(x) of a curve at a point is the slope (rise over run) of the line tangent to that curve at that point.


Integration can be thought of as measuring the area S under a curve, defined by f(x), between two points (here a and b).

Thinking as a mathematician, the data scientist must ask, "does the relationship between the things I am interested change (over time or over distance)?" and "how will I describe the changing relationship?"


Applied Math

Applied mathematics concerns itself with mathematical methods that are typically used in science, engineering, business, and industry. Thus, "applied mathematics" is math with specialized knowledge. Generally speaking, this is the kind of math that Data Scientists practice.



Efficient solutions to the vehicle routing problem require tools from mathematics.
    


Modelling financial markets is done with mathematics.

Assignment/Exercise

This Project #2, which spans four chapters. Assemble into groups of 3 or 4 students. A group of three may not have the same members as the group for Project #1. A group of four may have no more than two students repeating from the group from the Project #1. This group will do the entire project together.

  1. Replicate Galileo's "inclined plane" experiment. Start by designing the research and write down your plan. List materials needed, specify methods to be used, identify variables to be measured, create data recording sheets, etc.
  2. Conduct the experiment according to the design. Take pictures. Record your data results.
  3. Enter the data into R. Use R to produce tables and draw plots of your data. See if you can draw the theoretical curve Galileo was trying to discover on your data plots.
  4. Prepare a slide presentation that includes a description of your methods, pictures of your apparatus, a table of your raw data, a table of your analyzed results, plots of your results, a list of several things the group learned on its own about data science during the course of this project.
Note: Your group can specialize on tasks, but everyone needs to participate in all phases of the assignment. Also, the chapters covered to this point do not teach you everything you need to know to do this assignment. Please do the best you can with what you know. This assignment is not just to show the instructor how much of the previous chapters you have learned, but the assignment is a learning experience in and of itself. The assignment is designed for the students to discover knowledge not contained in the chapters.