CS250 Study Guide

Unit 5: The pandas Module

5a. Explain similarities and differences between dataframes and arrays

  • What is a series?
  • What is a dataframe?
  • How does indexing a pandas dataframe compare with indexing a numpy array?

The pandas module has been built upon the numpy module to handle data like a spreadsheet or a table within a relational database. You can think of a pandas series as holding one-dimensional data. A pandas dataframe can be thought of as a container for two-dimensional data where the column names can be used to refer to data within a row. Generally speaking, the goal of using a numpy one-dimensional array is often similar to that of a series; hence, both data structures are designed to contain homogeneous data.

While the bulk of the course focuses on dataframes, you should feel comfortable creating series using the Series command and creating dataframes using the DataFrame command. You should be aware of the flexibility of using various data structures such as lists, dictionaries, and numpy arrays for initializing a series or a dataframe.

Once you have created a dataframe, the pandas module offers you several different ways of referencing and operating on the data. You should be familiar with referencing data using the column names and the index. Additionally, you should understand how slicing works with dataframes. Slicing rules in pandas work differently than numpy slicing with indices, as the last value is included if pandas slicing is performed. This is critical to understand when applying loc (which extracts elements using the dataframe index and column names with pandas slicing) versus iloc (which extracts elements as if the dataframe is an array using row and column indices with numpy slicing).

To review, see Dataframes.

 

5b. Apply instructions for cleaning data sets 

  • What is data cleaning?
  • What are some methods helpful in identifying and counting missing values?
  • What are some pandas approaches available for cleaning a data set?

On the practical side of sampling a population and dataset creation, it is possible to end up with a dataframe containing missing values. If such values are not dealt with at the start of a data science endeavor, they could result in spurious calculations with unintended consequences (such as an arithmetic exception). Data cleaning simply means that an approach has been taken to identify and either remove or fill in missing values in a principled way. As a first step towards identifying missing data and non-missing data, methods such as isna and notna should be reviewed. Recall that you can use these methods in tandem with other methods, such as sum, to count the number of missing values along a given axis.

Once missing values have been identified, a couple of options are at your disposal. If the percentage of missing values is small, then the dropna method can help remove them while not injuring the overall statistical content of the data. It is important to know how to apply the axis, how, and inplace input parameters to be very specific about how the missing values are to be removed. If the percentage of missing values is too high, then another option is to fill in those values in a principled manner using the fillna method. With this approach, there are many options to choose from. For example, you can use the method input parameter to perform forward or backward fills. Additionally, you could choose a value based upon, for example, the column mean or median. The solution for the data fill is quite data-dependent and, in reality, may not always result in a positive scientific outcome. On the other hand, as a data scientist, you must be aware of the tools at your disposal to deal with missing data.

To review, see Data Cleaning.

 

5c. Implement operations on dataframes

  • What types of models have dataframes been designed to work with?
  • What types of operations are important for working with dataframes?
  • What are the syntax details for implementing dataframe operations?

The concept of the dataframe exists to encompass spreadsheet and relational database models. Therefore, it is essential to understand how to operate on dataframes within this context. On the spreadsheet side of dataframes, arithmetic operations on (sets of) columns or (sets of) rows need to be mastered. On the database side, one thinks of performing queries and operations such as join and concatenate for merging tables.

You can use the query method for dataframe queries. Although a pandas method implementation exists for any query, phrasing a query using the query method can often simplify the syntax. For example, you can reduce an isin method call to the in set operation.

When performing arithmetic operations on dataframes, your indexing skills should be able to weather the task of referring to the appropriate columns. You must be aware of the subtleties and default mode for dataframe operations. For example, consider the addition of a series to a dataframe. In this case, you can think of the series as a row vector that will be broadcast and added elementwise to each row in the dataframe. Additionally, if the dataframe has more columns than the length of the series, then the extra columns in the dataframe will end up with NaN (missing) values.

Finally, you should be clear about the subtleties of merging dataframes. By default, the concat method concatenates dataframes along the row dimension. As with database operations, you can implement the join method in various contexts, such as inner, outer, left, and right, using the how input parameter. After performing these join operations, you should know if and how you will fill in missing values. If you understand the inner, outer, left, and right join operations, you can predict how missing values will be dealt with.

To review, see pandas Operations: Merge, Join, and Concatenate.

 

5d. Write Python instructions for interacting with spreadsheet files 

  • What are some common file formats for handling spreadsheet data?
  • What are some useful pandas methods for reading from and writing to spreadsheet files?
  • What is the syntax for invoking relevant input parameters?

Spreadsheet programs are instrumental for organizing data. Two popular formats for interacting with spreadsheets are Excel format (.xls, .xlsx) and comma separated values (.csv) format. The pandas module is designed to handle these formats (and others, such as SQL and JSON formats). The pandas read_csv and read_excel methods can read spreadsheet files, and the to_csv and to_excel methods will write to them in the appropriate format. These methods are indispensable for interacting with the outside world. You can use them both for local file storage and accessing files via a URL.

In addition to the syntax for calling spreadsheet methods and specifying filenames, it is also important to understand the syntax for various input parameters. Practical experience with spreadsheets is useful for understanding the utility of multiple sheets. The make_tab method allows for creating a sheet in tandem with the to_excel method. On the other hand, when applying the read_excel method, the sheet_name parameter is used to refer to a sheet. Finally, when writing to a spreadsheet using the to_excel method, it is also helpful to know how to suppress the dataframe index from being included within the file using the index parameter.

To review, see Data Input and Output.

 

5e. Apply the built-in pandas visualization methods to visualize pandas dataframe data 

  • Why does pandas include visualization methods?
  • What is the syntax for invoking visualization methods in pandas?
  • How is this syntax similar to other visualization modules such as matplotlib?

There is a fair amount of cross-breeding between pandas and other data science modules such as numpy, seaborn, and matplotlib. This is done for the sake of convenience as it is sometimes simpler, from a programming standpoint, to connect commonly used method calls directly to a dataframe.

Like matplotlib, pandas allows for a spectrum of plotting methods such as line plots (line), box plots (box), bar plots (bar), histograms (hist), scatter plots (scatter), and so on. As with other visualization modules such as seaborn, these techniques can also be invoked using the plot method and the kind input parameter. One important distinction is that dataframe plotting methods allow the option to handle missing values by either dropping them or filling them in. Hence, data cleaning can be applied in tandem with plot method calls.

To review, see Visualization Using the pandas Module.

 

Unit 5 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • concat
  • concatenate
  • data cleaning
  • DataFrame
  • dropna
  • fillna
  • iloc
  • isin
  • isna
  • join
  • loc
  • notna
  • query
  • read_csv
  • read_excel
  • Series
  • to_csv
  • to_excel