Unit 2: Machine Learning Workflow

2a. Explain the ML pipeline from data collection to evaluation

What are the three main artifacts in a machine learning project, and how do they correspond to workflow phases?
Why is data preparation considered the most resource-intensive phase?
How do model serving and performance monitoring ensure real-world effectiveness post-deployment?

The ML pipeline is a structured, end-to-end process that systematically transforms raw data into predictive insights through three core phases, each aligned with a primary artifact: data engineering, model engineering, and code engineering.

In the data engineering phase, which encompasses all activities related to collecting, cleaning, and preparing data for machine learning, teams perform data acquisition, the process of collecting data from various sources via APIs or CSVs, followed by data preparation, which refers to the comprehensive process of transforming raw data into a format suitable for machine learning and is the most time-intensive stage. It includes exploratory analysis, data validation (the process of checking data format, structure, and quality to ensure it meets requirements), data wrangling (the process of cleaning and transforming data by handling missing, incorrect, or inconsistent values), data labeling (the process of assigning target categories or values to data points for supervised learning), and data splitting (the practice of dividing datasets into separate portions, like training, validation, and test sets, to enable proper model development and evaluation). This phase ensures that only clean, relevant data reaches the next stage, reducing the risk of flawed outcomes.

Model engineering focuses on developing and optimizing machine learning models by selecting and applying ML algorithms via steps like model training, feature engineering, hyperparameter tuning, and model evaluation against performance metrics. After evaluation, the model testing step uses holdout data to check generalization, and model packaging prepares the trained model (using ONNX format) for deployment.

In the final code engineering phase, which involves deploying and maintaining models in production environments, models are deployed via model serving, integrated into applications, and monitored through performance monitoring, the ongoing process of tracking model behavior to detect issues like data drift or accuracy degradation, and performance logging, the systematic recording and storage of model inference results and metadata for analysis and auditing purposes. Continuous feedback enables re-training if the model's accuracy degrades in real-world settings.

Data preparation can take 60% to 80% of the total effort, but it is foundational to successful modeling. Post-deployment, monitoring and retraining are critical because real-world data evolves, and models must adapt to remain effective.

To review, see:

ML Pipeline

2b. Explain the significance of each stage in the pipeline

Why is data engineering considered the most critical and time-intensive phase in ML workflows?
How does model evaluation prevent flawed deployments, and what techniques ensure reliability?
What risks does performance monitoring address in production environments?

Each phase of the machine learning (ML) pipeline plays a vital role in ensuring models remain accurate and reliable, with data engineering standing out as the most labor-intensive (consuming up to 80% of effort) and foundational stage because it transforms raw data into clean, consistent inputs through validation (checking formats/distributions), wrangling (fixing missing values), and splitting, preventing "garbage in, garbage out" outcomes where even sophisticated algorithms fail on flawed data. Model engineering then converts this curated data into predictive models through training and rigorous evaluation techniques like cross-validation (testing on multiple data subsets) and holdout datasets (unseen data testing), which act as quality control to detect overfitting (memorizing training data) or bias before deployment.

Finally, code engineering operationalizes models via serving while performance monitoring tracks real-world accuracy decay from data drift (evolving user behaviors) and performance logging records predictions, enabling timely re-training to combat model degradation.

Skipping any stage risks systemic failures: neglected data engineering propagates hidden errors, weak model evaluation permits biased deployments, and absent monitoring creates "zombie models" that produce increasingly inaccurate results.

To review, see:

Significance of Each Stage in ML Software Development

Unit 2 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

code engineering
cross-validation
data acquisition
data drift
data engineering
data labeling
data preparation
data splitting
data validation
data wrangling
ML pipeline
model engineering
performance logging
performance monitoring