Data Cleaning

Data cleaning is one of the initial steps in the data science pipeline. In practical applications, we do not always need to collect data in a pristine form, and the associated dataframe can therefore contain potential anomalies. There can be missing cells, cells that have nonsensical values, and so on. The pandas module offers several methods to deal with such scenarios.

Introduction

Cleaning data is one part of the process of preparing real-world data for data analysis. Python and Pandas are really great tools for cleaning data, and a basic background in coding can make the process of cleaning much faster, more efficient, more accurate, and more replicable than just about any other approach.

Dirty data involves a host of different issues that undermine the accuracy and reliability of your data. Examples include missing values, incorrectly input values, incorrectly repeated values, values with extraneous characters, improperly formatted values, and misidentified data types.

Before we get into approaches to fixing these issues, it's important to recognize that many people who deal with data are not in the habit of checking to see if their data needs cleaning before they do analyses. Not verifying the integrity of your data before conducting analyses is deeply problematic as any of the issues listed above may substantially alter the conclusions we draw from the data and the actions we take. The analyses themselves will not necessarily show underlying issues. You have to stop and make sure you verify your data before you start your analysis.

At the beginning of the semester, students took an anonymous survey with some random question designs to produce both some clean and some dirty data. We will use those survey responses to learn some new data cleaning tools.

# import pandas
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# ignore this code entirely
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
plt.rcParams['figure.figsize'] = [8, 6]
plt.rcParams['figure.dpi'] = 100


# results url
url1 = 'https://docs.google.com/spreadsheets/d/e/'
url2 = '2PACX-1vRkK73xD192AdP0jZe6ac9cnVPSeqqbYZmSPnhY2hnY8ANROAOCS'
url3 = 'tRFdvjwFoapv3j2rzMtZ91KXPFm/pub?output=csv'

# create data frame from url
df = pd.read_csv(url1 + url2 + url3)

# assign original headers to list
survey_questions = df.columns.to_list()

# replace with column names easier to work with
renamelist = ['Timestamp', 'musicartist', 'height', 'city', '30min', 'travel',
              'likepizza', 'deepdish', 'sport', 'spell', 'hangout', 'talk',
              'year', 'quote']
df.columns = renamelist

# print new column labels and original
for i in range(len(renamelist)):
  print(f'{renamelist[i]:15} {survey_questions[i]}')
Timestamp       Timestamp
musicartist     Who is your favorite music artist (broadly defined)?
height          What physical height would you like to be?
city            If you had to live in city, but could pick any city in the world, what city would you live in?
30min           If you could have 30 minutes to talk with any person, living or dead, who would you pick?
travel          If you could travel to any location in the world for vacation, where would you go?
likepizza       On a scale of 1 (gross) to five (awesome) how much do you like pizza?
deepdish        Is Chicago-style deep dish actually pizza or is it really casserole?
sport           What sport do you most enjoy watching?
spell           Which is the most difficult to spell?
hangout         What is the optimal number of people to hang out with?
talk            Do you think you talk more or less than the average person?
year            If you had a time machine and could visit any year you like, which year would you pick?
quote           What's your favorite inspirational quote? You don't need to do this from memory.
                Feel free to look something up and then paste it in here.


Source: Mark Liffiton and Brad Sheese, https://snakebear.science/10-DataCleaning/index.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.