Topic: Unit 5: The pandas Module | CS250: Python for Data Science | Saylor Academy

This unit introduces the pandas module, which is necessary for generalizing numpy array operations to dataframes containing things like spreadsheet data. When you finish this unit, you will be able to process pandas dataframes and perform the suites of calculations we outlined in the earlier units.

The pandas module contains many methods that greatly simplify processing data files relevant to data science. Data collected from real-world situations is often messy, and can contain observations you might want to discard. The pandas module offers useful methods that can deal with such situations. We will also discuss methods for visualizing data using pandas.

Completing this unit should take you approximately 4 hours.

Select activity Upon successful completion of this unit, you will ...
Upon successful completion of this unit, you will be able to:

explain similarities and differences between dataframes and arrays;

apply instructions for cleaning data sets;

implement operations on dataframes;

write Python instructions for interacting with spreadsheet files; and

apply the built-in pandas visualization methods to visualize pandas dataframe data.

5.1: Dataframes
- Select activity pandas Dataframes
  pandas Dataframes Book
  
  Students must
  
  Mark as done
  
  The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.
  
  As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv') will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming
  import pandas as pd
  
  has been invoked, you can download the data from the textbook URL as follows
  url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
  df = pd.read_csv(url)
  
  which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
  df.head(10)
  
  which will print out the first 10 rows of the dataframe.
  
  However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:
  
  #read the data from local drive
  import io
  from google.colab import files
  uploaded = files.upload()
  df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
  
  This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.
- Select activity How pandas Dataframes Work
  
  How pandas Dataframes Work Page
  
  Students must
  
  Mark as done
  
  There is no substitute for practice when learning a new programming tool. Use this video to exercise, reinforce and fill in any gaps in your understanding of indexing and manipulating dataframes.
5.2: Data Cleaning
- Select activity Data Cleaning
  
  Data Cleaning Book
  
  Students must
  
  Mark as done
  
  Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.
- Select activity More on Data Cleaning
  
  More on Data Cleaning Page
  
  Students must
  
  Mark as done
  
  These videos give a few more key examples of applying data cleaning methods. They are meant to serve as a summary and review of all pandas concepts we have discussed in this unit.
5.3: pandas Operations: Merge, Join, and Concatenate
- Select activity pandas Data Structures
  
  pandas Data Structures Book
  
  Students must
  
  Mark as done
  
  With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations.
- Select activity Pandas Dataframe Operations
  
  Pandas Dataframe Operations Page
  
  Students must
  
  Mark as done
  
  Use these materials to practice slicing, indexing, and applying syntax for merging and filtering dataframes. You should recognize a measure of syntax consistency by using, for example, dictionary keys or array indices. At this point, it also is clear that the pandas module offers a much larger set of capabilities.
5.4: Data Input and Output
- Select activity Importing and Exporting
  
  Importing and Exporting Page
  
  Students must
  
  Mark as done
  
  We have already shown some initial examples of how to read a CSV file into pandas. This video gives a bit more depth regarding how to import and export dataframes by interacting with your local drive.
- Select activity Loading Data into pandas Dataframes
  
  Loading Data into pandas Dataframes Page
  
  Students must
  
  Mark as done
  
  When using Google Colab, some important differences must be considered when handling files (as discussed in the Python review unit). This video discusses various ways to import and export files when dealing with a web-based Python engine.
5.5: Visualization Using the pandas Module
- Select activity Using pandas to Plot Data
  
  Using pandas to Plot Data Page
  
  Students must
  
  Mark as done
  
  In the next unit, we will discuss the seaborn module for advanced visualization techniques. pandas comes with its own set of visualization and plotting methods (which are mainly derived from matplotlib). A good rule of thumb is that if your data is confined to lists, dictionaries, and numpy arrays, then matplotlib is a good way to go for basic plotting applications. Similarly, pandas offers plotting capabilities for series and dataframes.
- Select activity Plotting with pandas
  
  Plotting with pandas Page
  
  Students must
  
  Mark as done
  
  Practice this code to see more examples of plotting using pandas methods. With these fundamentals in place, you will be well positioned for the next unit dealing with advanced visualization techniques.
Unit 5 Assessment
- Select activity Unit 5 Assessment
  Unit 5 Assessment Quiz
  
  Students must
  
  Receive a grade
  
  Take this assessment to see how well you understood this unit.
  
  This assessment does not count towards your grade. It is just for practice!
  
  You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
  
  You can take this assessment as many times as you want, whenever you want.

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python