CS250 Study Guide

Unit 2: Python for Data Science

2a. Create a Python notebook using Google Colab 

  • What is Google Colab?
  • Why are Python notebooks useful?
  • How can Python modules be installed in Google Colab?
  • Where are Google Colab notebooks stored?

Python notebooks are extremely useful for testing blocks of Python code and annotating the notebook with text cells. You can easily share them with other users who can experiment with the Python code contained within the notebook. Google Colaboratory is an online Python notebook environment that bypasses the need to install modules commonly applied by the Python community for a large number of cases. If it becomes necessary to install a module, the command

!pip install module_name

can be used to invoke the installation.

Using Google as the online service requires an awareness of where and how notebooks are stored. You must be familiar with your Google drive. Colab will create a folder for your notebooks as the default destination directory on your Google drive. Notebooks are also downloadable from Colab so that they can be stored or used locally on a platform such as Jupyter notebooks. Within the Google Colab environment, you must understand how to create and delete code cells. In addition, there are several options for running code cells depending on how you have arranged your code. Navigating these various options must be part of your Google Colab skill set.

To review, see Introduction to Google Colab.

 

2b. Execute instructions using built-in Python data and control structures 

  • What are some important built-in Python data types?
  • What are some important built-in Python data structures?
  • What kind of loop structures does the Python language support?
  • Why is operator precedence important for if-else-elif statements?

This course depends upon having basic operating knowledge of the Python programming language. Python supports basic data types such as integers (int), floating-point (float), strings (str), and boolean (bool). Lists and tuples are ordered containers, meaning their elements can be referred to using an index. Lists are mutable objects meaning you can modify their elements. Tuples are immutable objects meaning their elements, once initialized, cannot be modified. Sets and dictionaries are unordered containers, but dictionaries contain "key:value" pairs where values can be referenced by their keys.

It is important to understand relational, boolean, and arithmetic operators. Often, when applying if-elif-else control structures, complex boolean expressions are necessary. Constructing these expressions requires understanding the precedence of relational operators (==, !=, >, <, >=, <=) over boolean logical operators (and, or, not) and the precedence of and over or.

Although this course does not emphasize object-oriented programming and class design, the syntax of method (and function) calls using the def keyword and accessing class data attributes should be well understood. Finally, Python supports "for" loops and "while" loops. Therefore, familiarity with iteration using these loop structures is necessary for your Python programming capacity.

To review, see Data Types in Python, Functions, Loops, and Logic, Data Structures in Python, and Sets, Tuples, and Dictionaries.

 

2c. Apply methods for random numbers within the random module 

  • What is the random module?
  • What are some useful methods contained in the random module?
  • How can the random module be applied?

Random number generation is a significant facet of computer science. Simulations involving random events such as data communications, sunspot activity, the weather, and data science simulations often require some form of random number generation (RNG). Several ways of approaching the issue of generating random numbers are introduced within this course, but the starting point is the random module.

While there are many methods contained within the random module, there are some that you must be familiar with. The seed method allows you to set the random seed so that the RNG can be set to the same starting point. Data scientists must understand two of the most basic probability distributions: the uniform distribution and the normal distribution. The random method generates numbers from a uniform distribution within the interval [0.0, 1.0), and the uniform method generates numbers from a uniform distribution within the interval [a, b. The randint uniformly generates integers in the interval [a,b]. The gauss method generates numbers from a normal distribution with the mean and standard deviation values as input parameters. The setstate and getstate methods allow you to either set the state of the RNG or read the state so that it can be saved and used for later use. These methods are the minimal set to get up and running to run basic simulations in applied statistics.

To review, see Python's random Module.

 

2d. Implement basic plotting and data rendering instructions using the matplotlib module 

  • What is matplotlib?
  • How can matplotlib be applied?
  • What are some important methods in the matplotlib module?

A critical component of data science is visualization. This course introduces several modules for this purpose, but matplotlib is the starting point. Specifically, the pyplot portion of matplotlib is emphasized within this module to exercise a set of introductory commands for getting the plot up and running. Furthermore, without having introduced numpy yet, lists are used as the data structure for creating two-dimensional plots; therefore, you must be clear on the syntax for plotting list data.

While there are many methods contained within the matplotlib.pyplot module, there are some that you must be familiar with. Line plots using the plot method and scatter plots using the scatter method are fundamental. You should also know how to choose colors and plot markers such as dashed lines. Annotating plots requires methods like title, xlabel, ylabel, grid, and legend. These methods are the minimal set to get up and running to begin the journey of data visualization.

To review, see Precision Data Plotting with matplotlib.

 

Unit 2 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • boolean
  • def
  • dictionary
  • floating point
  • for loop
  • function
  • gauss
  • getstate
  • if-elif-else
  • integer
  • list
  • logical operator
  • matplotlib
  • normal distribution
  • random
  • relational operator
  • set
  • setstate
  • string
  • tuple
  • Uniform
  • while loop