Unit 3: Data Preprocessing
3a. Apply data cleaning techniques to improve dataset quality
- What are the three primary methods for handling missing data, and when should each be applied?
- How does partial match deduplication differ from an exact match, and why is careful column selection critical?
- What risks do outliers pose to statistical analysis, and how can their impact be quantified?
- Why are validation rules essential for preventing data entry errors during collection?
Data cleaning is the systematic process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to transform raw, error-prone datasets into reliable inputs for analysis through systematic techniques. For missing data, which refers to absent or unavailable values in a dataset, three core methods exist: removal, the technique of deleting rows or columns containing missing values (deleting rows/columns when missingness is minimal/random, like eliminating records with empty salary fields), imputation, the process of replacing missing values with estimated or calculated substitutes (replacing gaps with statistical substitutes like mean/median values, such as filling missing ages with the average age), and flagging, the method of creating indicator variables to mark the presence or absence of missing data (marking missing entries via binary indicators like "Missing"/"Not Missing" to preserve data absence patterns).
Duplicates, which are identical or near-identical records that appear multiple times in a dataset, require careful handling. Exact match removal deletes fully identical rows, while partial match targets specific columns (such as deduplicating by ID only), with incorrect column selection risking valid record loss (such as distinct Alices with different IDs). Normalizing formats is the process of standardizing data representation to ensure consistency across the dataset, such as through date standardization (converting "Dec 5, 2023" to "2023-12-05"), text case normalization (lowercasing "Banana" → "banana"), and unit conversion (transforming "160 lbs" → "72.57 kg").
Outlier detection is the process of identifying data points that significantly deviate from the typical pattern or distribution in a dataset, employing visual methods, which are graphical techniques for identifying anomalies through data visualization (box plots revealing income anomalies) or statistical methods, which are mathematical approaches that use numerical criteria to identify unusual values (IQR/z-scores flagging values beyond ±1.5*IQR), with handling choices keeping verified anomalies (such as valid executive salaries), removing distorting values, or log-transforming skewed distributions directly impacting metrics (for example, a single $199k outlier inflating mean income by $10.4k).
Finally, data entry errors are minimized via validation rules, which are predefined constraints or checks applied to data inputs to prevent invalid or incorrect entries (blocking future birth dates), manual review for small datasets, and automated checks (format/range verification). Many people often underestimate partial match risks and outlier handling trade-offs. Context determines techniques; for example, mean imputation simplifies but distorts variance, while flagging preserves data integrity for bias analysis.
To review, see:
3b. Apply normalization techniques to preprocess datasets for effective analysis
- What are the four key benefits of normalizing numerical features?
- When should you choose Z-score scaling over linear scaling?
- Why is log scaling particularly effective for power law distributions?
- How does clipping mitigate the impact of extreme outliers without removing them?
Normalization transforms numerical features onto comparable scales to enhance model effectiveness, offering four key benefits: it accelerates convergence during training by stabilizing gradient descent, improves prediction accuracy by balancing feature influence, prevents NaN errors from floating-point overflows, and ensures equitable weight assignment across features. For features with uniform distributions (like ages ranging from 0–100), linear scaling, which is a normalization technique that applies a mathematical transformation to map data values from their original range to a new target range, such as min-max normalization to a [0,1] range, is effective. However, Z-score scaling, which is a standardization method that transforms data by subtracting the mean and dividing by the standard deviation (standardization using mean and standard deviation), is preferable for normally or near-normally distributed data (like adult height), as it centers the data around a mean (μ) of 0 with a standard deviation (σ) of 1, highlighting outliers while standardizing bulk values.
Log scaling, which is a transformation technique that applies the natural logarithm function to compress the range of values (that is, scaling using the natural log), is especially effective for power law distributions such as income, book sales, or movie ratings, where most values are small and a few are extremely large. This compresses the range (for example, ln(100) ≈ 4.6 vs. ln(1,000,000) ≈ 13.8), enabling better model learning from skewed data. For extreme outliers (such as a roomsPerPerson variable with a long-tailed distribution up to 17), clipping, which is a technique that limits values by setting maximum and minimum thresholds to constrain data within specified bounds, caps values at thresholds (like max = 4.0) to mitigate their influence without deleting them. Crucially, the same normalization logic must be applied to both training and inference data; otherwise, model predictions become invalid.
Be sure you understand the trade-offs: linear scaling distorts skewed data, Z-score normalization assumes Gaussian-like distributions, and aggressive clipping can suppress valid signals. Log transformation combined with clipping is often the most robust strategy for skewed industrial datasets such as housing prices, sales, or server loads.
To review, see:
3c. Apply encoding techniques to transform categorical data for machine learning models
- Why can't machine learning models directly use raw string values like "Red" or "Blue"?
- How does one-hot encoding represent categorical variables numerically?
- When should you use sparse representation instead of full one-hot vectors?
- How are high-dimensional categorical features (like words or postal codes) handled differently from low-dimensional ones?
Since models only process floating-point numbers, categorical data encoding converts non-numerical features into machine-readable formats. For low-dimensional features (when there are few categories, like car_color with eight colors), a vocabulary encoding assigns each category a unique integer index (like Red=0, Blue=2).
However, raw indices imply false ordinal relationships (like Blue being "greater than" Red), so one-hot encoding transforms each index into a binary vector where only the category's position is 1.0, and others are 0.0; for example, "Blue" becomes [0,0,1,0,0,0,0,0]. This allows models to learn distinct weights per category.
For efficiency, sparse representation stores only the position of the 1.0 (like "Blue" → 2) rather than the full vector, saving memory while ensuring identical model input. Outliers (rare categories like "Mauve") are binned into an out-of-vocabulary (OOV) bucket (like "Other"), sharing a single weight.
High-dimensional features (such as words_in_english with 500k categories) make one-hot encoding impractical due to excessive memory use. Instead, embeddings, which are learned dense vector representations that map categorical variables to continuous low-dimensional spaces where similar categories have similar vector values, or hashing, which is a technique that applies a hash function to map categorical values into a fixed number of buckets, effectively reducing the feature space by binning categories into predetermined buckets, reduce dimensionality, improving training speed and inference latency. Many people often confuse one-hot with label encoding. One-hot prevents false ordinal assumptions but scales poorly beyond ~10k categories.
To review, see:
Unit 3 Vocabulary
This vocabulary list includes terms you will need to know to successfully complete the final exam.
- clipping
- data cleaning
- duplicate
- embedding
- flagging
- hashing
- imputation
- linear scaling
- log scaling
- missing data
- normalizing format
- one-hot encoding
- outlier detection
- removal
- sparse representation
- statistical method
- validation rule
- visual method
- Z-score scaling