Data Structures and Types

Missing Data

Both numpy and pandas allow for missing values, which are a reality in data science. The missing values are coded as np.nan. Let's create some data and force some missing values

df = pd.DataFrame(np.random.randn(5, 3), index = ['a','c','e', 'f','g'], columns = ['one','two','three']) 
# pre-specify index and column names 
df['four'] = 20 # add a column named "four", which will all be 20
df['five'] = df['one'] > 0
df
      one       two     three  four   five
a -0.706987 -0.821679  1.441257    20  False
c  1.297128  0.501395  0.572570    20   True
e -0.761507  1.469939  0.400255    20  False
f -0.910821  0.449404  0.588879    20  False
g -0.718350 -0.364237  1.793386    20  False
df2 = df.reindex(['a','b','c','d','e','f','g'])
df2.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cbd6040>

The code above is creating new blank rows based on the new index values, some of which are present in the existing data and some of which are missing.

We can create masks of the data indicating where missing values reside in a data set.

df2.isna()
    one    two  three   four   five
a  False  False  False  False  False
b   True   True   True   True   True
c  False  False  False  False  False
d   True   True   True   True   True
e  False  False  False  False  False
f  False  False  False  False  False
g  False  False  False  False  False
df2['one'].notna()
a     True
b    False
c     True
d    False
e     True
f     True
g     True
Name: one, dtype: bool

We can obtain complete data by dropping any row that has any missing value. This is called complete case analysis, and you should be very careful using it. It is only valid if we believe that the missingness is missing at random and not related to some characteristic of the data or the data gathering process.

df2.dropna(how='any')
       one       two     three  four   five
a -0.706987 -0.821679  1.441257  20.0  False
c  1.297128  0.501395  0.572570  20.0   True
e -0.761507  1.469939  0.400255  20.0  False
f -0.910821  0.449404  0.588879  20.0  False
g -0.718350 -0.364237  1.793386  20.0  False

You can also fill in, or impute, missing values. This can be done using a single value.

out1 = df2.fillna(value = 5) 
out1.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cf5fca0>

or a computed value like a column mean

df3 = df2.copy()
df3 = df3.select_dtypes(exclude=[object])   # remove non-numeric columns
out2 = df3.fillna(df3.mean())  # df3.mean() computes column-wise means

out2.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cf830d0>

You can also impute based on the principle of last value carried forward, which is common in time series. This means that the missing value is imputed with the previous recorded value.

out3 = df2.fillna(method = 'ffill') # Fill forward
out3.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11cbeca60>
out4 = df2.fillna(method = 'bfill') # Fill backward
out4.style.applymap(lambda x: 'background-color:yellow', subset = pd.IndexSlice[['b','d'],:])
<pandas.io.formats.style.Styler object at 0x11c