CS250 Study Guide

Unit 4: Applied Statistics in Python

4a. Apply methods for random numbers within the numpy module

  • How does vectorized programming work with numpy random number generation?
  • What are some similarities between the Python random module and the numpy random class?
  • What are some important numpy random class methods?

The key to mastering programming using numpy is to migrate from scalar thinking to vectorized thinking. Methods from the random module can be used to generate a single random number, the numpy.random class is designed to generate vectors and arrays of random numbers. Both contain methods to generate random numbers from basic distributions such as uniform, normal, lognormal, exponential, beta, and gamma. However, the numpy.random class is equipped with a much larger class of distributions and is designed for vectorized programming. Therefore, you should be familiar with computing important quantities such as the sum, max, min, and mean using the axis parameter.

The set of random number generators for numpy.random is quite large. As a data scientist, you may not have an immediate need for all of them, but there are some distributions that you should be aware of (such as uniform, normal, lognormal, logistic, Poisson, and binomial). This means not only understanding what the methods compute, but you should also be highly familiar with their input parameters, their syntax, and the order for referring to them during a method call. For example, randint will uniformly generate random integers with the input parameters low and high. Additionally, the array dimensions can be specified with the size parameter as a tuple. The normal method will generate array data from a normal distribution where the mean and standard deviation can be specified. The main goal is to connect your understanding of basic probability distributions with the necessary input parameters. For instance, the poisson method allows for the input parameter lam. Other methods for random sampling such as shuffle and choice should also be reviewed.

To review, see Random Number Generation and Using np.random.normal.

 

4b. Apply statistical methods within the scipy.stats module 

  • How is scipy.stats similar to numpy?
  • What are some useful methods for applying the scipy.stats module?
  • What are some important statistical tests that can be implemented using the scipy.stats module?

The scipy.stats module is built upon the numpy module. Both modules can generate random numbers for random simulations from various probability distributions. The scipy.stats module goes a bit further as it can perform a wide range of statistical tests and build statistical models.

Concerning scipy.stats usage, you should be comfortable with methods for generating summary statistics such as mode, tmin, tmax, tmean, tvar, skew, kurtosis, moment, and entropy. You should also recognize the consistency of the syntax amongst various random variables for method calls, such as rvs, mean, std, ppf, pmf (for discrete distributions), and pdf (for continuous distributions). When using a method such as rvs for a given distribution, you must be familiar with the input parameter syntax relevant to the specific distribution. In other words, the rvs method will require different parameters for generating random data using distributions such as chi2, norm, f, binom, and lognorm.

Lastly, scipy.stats can perform a breadth of statistical tests. You should be aware of tests such as the t-test for the mean of one group of scores (ttest_1samp), the t-test for the means of two independent samples of scores (ttest_ind), the Shapiro-Wilk test for normality (shapiro), the one-way chi-square test (chisquare) and skewtest, which tests whether the skewness is different from the normal distribution.

To review, see Descriptive Statistics in Python and Statistical Modeling with scipy.

 

4c. Apply the scipy.stats module for solving data science problems 

  • What statistics quantities are important for solving data science problems?
  • How are such quantities computed using the scipy.stats module?
  • What types of problems can be modeled and simulated using the scipy.stats module?

In addition to understanding the syntax for invoking scipy.stats methods, as a data scientist, your goal is to understand how to apply them to data science problems. This means you should be clear about computing quantities such as the Z-score using the zscore method and confidence intervals based upon the empirical mean and standard deviation.

Part of your skill set as a data scientist is knowing how to apply your knowledge of probability distributions to model a given set of data. According to the Central Limit Theorem, problems involving samples from sums of independent random variables can be modeled using the normal distribution. The binomial distribution can be useful for modeling sums of Bernoulli variables (such as a series of coin flips). Stochastic processes involving bursts of an event (such as modeling phone call arrival times) can be modeled using the Poisson distribution. Models involving dice rolls can be modeled using a uniform distribution. Mastering this aspect of data science will enable you to create simulations that can model and explain real data.

To review, see Statistics in Python and Probabilistic and Statistical Risk Modeling.

 

Unit 4 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • Bernoulli distribution
  • binomial distribution
  • Central Limit Theorem
  • chisquare
  • choice
  • confidence interval
  • poisson
  • rvs
  • shapiro
  • shuffle
  • skewtest
  • summary statistics
  • ttest_1samp
  • Z-score