Topic outline

  • Unit 5: The pandas Module

    This unit introduces the pandas module, which is necessary for generalizing numpy array operations to dataframes containing things like spreadsheet data. When you finish this unit, you will be able to process pandas dataframes and perform the suites of calculations we outlined in the earlier units.

    The pandas module contains many methods that greatly simplify processing data files relevant to data science. Data collected from real-world situations is often messy, and can contain observations you might want to discard. The pandas module offers useful methods that can deal with such situations. We will also discuss methods for visualizing data using pandas.

    Completing this unit should take you approximately 4 hours.

    • Upon successful completion of this unit, you will be able to:

      • explain similarities and differences between dataframes and arrays;
      • apply instructions for cleaning data sets;
      • implement operations on dataframes;
      • write Python instructions for interacting with spreadsheet files; and
      • apply the built-in pandas visualization methods to visualize pandas dataframe data.
    • 5.1: Dataframes

      • The next step in our data science journey deals with elevating data sets from arrays to other formats typically encountered in many practical applications. For example, it is very common for data to be housed in the form of a spreadsheet (such as in Excel). In such applications, a given column of data need not be numerical (e.g. text, currency, boolean, etc). Additionally, columns are given names for the sake of identification. You will also typically encounter files with the ".csv" extension, which indicates comma-separated data. CSV files are simply text files whose row elements are separated by commas and can usually be read by spreadsheet software. The pandas module is designed to handle these forms of data with indexing and slicing syntax similar to that of Python dictionaries and numpy arrays. The analogous data structures in pandas are series and dataframes. This course will emphasize the use of dataframes since, in practical applications, data will be comprised of several columns having different data types. Work through sections 4.1-4.5 of Chapter 4 and familiarize yourself with the basics of the pandas module.

        As you work through the Python code in this resource, you will find that the instruction pd.read_csv('data/mtcars.csv') will generate an exception because the syntax assumes the data file mtcars.csv is stored on a local drive. Assuming

        import pandas as pd

        has been invoked, you can download the data from the textbook URL as follows
        url = 'https://raw.githubusercontent.com/araastat/BIOF085/master/data/mtcars.csv'
        df = pd.read_csv(url)

        which will create a dataframe named df. You can double-check that the correct data has been loaded by executing
        df.head(10)

        which will print out the first 10 rows of the dataframe.

        However, if you have downloaded a .csv file to your local drive and wish to load the data into a dataframe, the following instructions can be used:

        #read the data from local drive
        import io
        from google.colab import files
        uploaded = files.upload()
        df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))

        This extra care must be taken for local files because Google Colab is a web-based engine and does not know where your local drive is located. This set of commands will generate a basic file interface from which you can select your file named "filename.csv". Obviously, you will need to edit the filename to match the name of the file you wish to upload.

      • There is no substitute for practice when learning a new programming tool. Use this video to exercise, reinforce and fill in any gaps in your understanding of indexing and manipulating dataframes.

    • 5.2: Data Cleaning

      • Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.

      • These videos give a few more key examples of applying data cleaning methods. They are meant to serve as a summary and review of all pandas concepts we have discussed in this unit.

    • 5.3: pandas Operations: Merge, Join, and Concatenate

      • With the basics of pandas dataframes and series in hand, we can now begin to operate on these data structures. If data is numerical, it is acceptable to think of a dataframe or series as an array, and arithmetic operations obey a set of rules similar to that of numpy. In addition, pandas has been designed with database programming features such as the merge, join, and concatenate operations. Study these sections to practice and get an introduction to some basic pandas operations.

      • Use these materials to practice slicing, indexing, and applying syntax for merging and filtering dataframes. You should recognize a measure of syntax consistency by using, for example, dictionary keys or array indices. At this point, it also is clear that the pandas module offers a much larger set of capabilities.

    • 5.4: Data Input and Output

      • We have already shown some initial examples of how to read a CSV file into pandas. This video gives a bit more depth regarding how to import and export dataframes by interacting with your local drive.

      • When using Google Colab, some important differences must be considered when handling files (as discussed in the Python review unit). This video discusses various ways to import and export files when dealing with a web-based Python engine.

    • 5.5: Visualization Using the pandas Module

      • In the next unit, we will discuss the seaborn module for advanced visualization techniques. pandas comes with its own set of visualization and plotting methods (which are mainly derived from matplotlib). A good rule of thumb is that if your data is confined to lists, dictionaries, and numpy arrays, then matplotlib is a good way to go for basic plotting applications. Similarly, pandas offers plotting capabilities for series and dataframes.

      • Practice this code to see more examples of plotting using pandas methods. With these fundamentals in place, you will be well positioned for the next unit dealing with advanced visualization techniques.

    • Unit 5 Assessment

      • Take this assessment to see how well you understood this unit.

        • This assessment does not count towards your grade. It is just for practice!
        • You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
        • You can take this assessment as many times as you want, whenever you want.