Machine Learning Fundamentals

What You’ll Learn

By the end of this chapter, you’ll understand three fundamental ML approaches: regression, classification, and clustering. You’ll see how each method maps to specific utility use cases like load forecasting, failure prediction, and customer segmentation. You’ll learn to evaluate model performance using appropriate metrics such as MSE, accuracy, and classification reports. Most importantly, you’ll recognize when to use supervised versus unsupervised learning, and you’ll build practical models using scikit-learn on utility data.


The Business Problem: Predicting and Classifying in Complex Environments

Utilities operate vast networks of physical assets that must balance supply and demand in real time. Predicting how these systems behave under different conditions is critical. Grid operators must forecast tomorrow’s load to schedule generation. Maintenance planners must decide which transformers are at greatest risk of failure. Customer engagement teams need to identify which customers are likely to participate in demand response programs.

Traditionally, these tasks have relied on deterministic engineering models or static business rules. These approaches work in stable, predictable conditions but struggle in the face of variability and uncertainty. Weather changes hourly, demand fluctuates daily, and aging equipment deteriorates in nonlinear ways. The complexity of modern power systems makes it impractical to encode every rule explicitly or to manually sift through massive datasets.

Machine learning addresses this by learning patterns directly from data rather than relying solely on pre-programmed rules. Instead of hand-coding equations to model every scenario, we allow algorithms to find statistical relationships between inputs and outputs. This is particularly powerful in utilities, where data from smart meters, SCADA systems, asset registries, and customer programs contains rich but underutilized signals about system behavior and risks.

Here’s what I’ve learned: you don’t need to understand every possible scenario upfront. Let the data tell you what matters.


The Analytics Solution: Core Learning Methods

Machine learning is not a single technique but a collection of methods that fall into several broad categories. This chapter focuses on foundational approaches that recur throughout utility applications.

Regression predicts continuous outcomes like future load in megawatts, transformer oil temperature, or customer energy usage. Linear regression can relate temperature and time of day to hourly demand, providing forecasts that help balance supply and demand.

Classification assigns categories—determining whether equipment is healthy or likely to fail, or whether a pattern is normal or anomalous. This underpins predictive maintenance, cybersecurity detection, and many operational workflows.

Clustering groups similar observations together without labels. Clustering smart meter profiles can reveal natural customer segments—those with high evening peaks versus flat daytime usage—informing rate design and demand response targeting.

Understanding these core methods and their differences is essential before tackling more advanced techniques. They provide a common language between data science and engineering teams and form the backbone of most practical machine learning pipelines in utilities.


Connecting Methods to Real Utility Scenarios

Consider transformer failure prediction. We may have sensor data on temperature, vibration, and load, combined with asset attributes like age and manufacturer. A classification model trained on historical failure records can learn to distinguish healthy transformers from those approaching failure. By scoring current assets, it flags those at highest risk for inspection or replacement.

For load forecasting, regression models link weather variables, calendar effects, and historical demand patterns to predict consumption at different time horizons. These forecasts drive market bidding strategies and generator commitment decisions. Even basic models deliver significant operational improvements compared to heuristic forecasts.

In customer analytics, clustering can segment households based on usage profiles from AMI data. These segments inform demand response outreach, such as targeting high-peak households with incentives for load shifting. Clustering can also uncover emerging patterns, like neighborhoods adopting electric vehicles, before they show up in feeder overloading alarms.

These examples illustrate how simple machine learning concepts map directly to real problems. By framing utility questions in terms of prediction, classification, and grouping, we create clear pathways from business needs to analytic solutions.


Model Evaluation Primer

Before diving into the code, it’s important to understand how we assess model quality. For regression, we use Mean Squared Error and R² score. For classification, we use accuracy, precision, recall, and F1-score—especially important when classes are imbalanced, like when failures are rare.

We split data into training and testing sets to detect overfitting. For small datasets, we use k-fold cross-validation to get more reliable performance estimates.


When to Use Each Method

Use regression for continuous values like load forecasting or temperature prediction. Use classification for categories like failure prediction or anomaly detection. Use clustering to find patterns without labels, like customer segmentation or asset grouping.


Building ML Models for Utilities

Let’s walk through fundamental ML approaches using realistic utility scenarios. Each example is self-contained and shows the complete workflow from data preparation to model evaluation. These are the foundation—everything else builds on them.

Regression: Temperature to Load

First, we generate regression data and fit a linear model:

def generate_regression_data():
    """Generate synthetic regression data: temperature vs. daily load (MW)."""
    samples = config["data"]["regression_samples"]
    temp = np.random.normal(config["regression"]["temp_mean"], 
                           config["regression"]["temp_std"], samples)
    load = (config["regression"]["base_load"] + 
            config["regression"]["temp_coef"] * temp + 
            np.random.normal(0, config["regression"]["noise_std"], samples))
    return pd.DataFrame({"Temperature_C": temp, "Load_MW": load})


def regression_example(df):
    """Fit and visualize linear regression (Temperature -> Load)."""
    X = df[["Temperature_C"]]
    y = df["Load_MW"]
    model = LinearRegression()
    model.fit(X, y)
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)

    fig, ax = plt.subplots(figsize=config["plotting"]["figsize_regression"])
    ax.scatter(X, y, color=config["plotting"]["colors"]["observed"], 
                alpha=0.7, label="Observed")
    ax.plot(X, y_pred, color=config["plotting"]["colors"]["regression"], 
             label="Regression Line")
    ax.set_title(f"Load (MW) vs Temperature (°C) - Linear Regression (MSE = {mse:.2f})")
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.legend()
    plt.tight_layout()
    plt.savefig(config["plotting"]["output_files"]["regression"])
    plt.close()

This demonstrates how continuous predictions work. The regression line’s slope indicates how strongly temperature drives load (e.g., “each degree Celsius increases load by X MW”). Points scattered far from the line indicate prediction error, which we quantify with MSE. I’m starting with regression because it’s simple, interpretable, and often works better than you’d expect.

Classification: Equipment Failure Prediction

Next, we generate classification data and train a logistic regression model:

def generate_classification_data():
    """Generate synthetic classification data: equipment age, load -> failure probability."""
    samples = config["data"]["classification_samples"]
    age = np.random.uniform(config["classification"]["age_min"], 
                           config["classification"]["age_max"], samples)
    load = np.random.uniform(config["classification"]["load_min"], 
                            config["classification"]["load_max"], samples)
    failure_prob = 1 / (1 + np.exp(-(0.1 * age + 0.002 * load - 7)))
    failure = np.random.binomial(1, failure_prob)
    return pd.DataFrame({"Age_Years": age, "Load_kVA": load, "Failure": failure})


def classification_example(df):
    """Train and evaluate logistic regression for equipment failure prediction."""
    X = df[["Age_Years", "Load_kVA"]]
    y = df["Failure"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=config["model"]["test_size"], 
        random_state=config["model"]["random_state"], stratify=y
    )
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=["Healthy", "Failure"]))

This shows how to handle binary classification problems. The classification report displays precision, recall, and F1-score for each class (Healthy vs. Failure). High precision means few false alarms; high recall means we catch most failures. The balance depends on operational priorities—utilities often prefer higher recall to avoid missing actual failures. I’ve seen teams get excited about 95% accuracy, then realize the model just predicts “healthy” for everything. That’s why precision and recall matter more than accuracy for imbalanced problems.

Clustering: Customer Segmentation

Finally, we group customers by their daily load profiles using K-means:

def clustering_example():
    """Apply clustering to synthetic smart meter load profiles (daily kWh)."""
    np.random.seed(config["model"]["random_state"])
    means = config["clustering"]["cluster_means"]
    stds = config["clustering"]["cluster_stds"]
    n_samples = config["clustering"]["samples_per_cluster"]
    
    clusters = [np.random.normal(m, s, (n_samples, 24)) for m, s in zip(means, stds)]
    data = np.vstack(clusters)
    
    kmeans = KMeans(n_clusters=config["model"]["n_clusters"], 
                    random_state=config["model"]["random_state"])
    labels = kmeans.fit_predict(data)

    fig, ax = plt.subplots(figsize=config["plotting"]["figsize_clustering"])
    for i in range(config["model"]["n_clusters"]):
        cluster_profiles = data[labels == i]
        ax.plot(cluster_profiles.T, color="gray", alpha=0.2)
        ax.plot(cluster_profiles.mean(axis=0), label=f"Cluster {i+1}", linewidth=2)
    ax.set_title("Daily Load Profiles (kWh) by Hour - Customer Segmentation")
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.legend()
    plt.tight_layout()
    plt.savefig(config["plotting"]["output_files"]["clustering"])
    plt.close()

This demonstrates unsupervised learning. The clustering plot displays multiple daily load profiles overlaid, with distinct cluster centroids highlighted. Each cluster represents a customer segment with similar usage patterns. This visualization helps identify which segments to target for demand response programs. I’ve seen utilities discover EV adoption patterns this way—clusters that look different from historical norms often signal new technology adoption.

The complete, runnable script is at content/c3/ML4U.py. Run all three examples and see how they differ.


What I Want You to Remember

Three methods cover most utility ML needs: regression for continuous predictions, classification for categorical outcomes, and clustering for pattern discovery without labels. Model evaluation is essential. Always split data into train/test sets and use appropriate metrics. Overfitting is a constant risk—models that look perfect on training data often fail in production. I’ve seen this happen too many times: a model that’s 99% accurate on training data but useless in production.

Utility context matters. The same ML technique, such as classification, applies differently to failure prediction versus customer churn. Domain knowledge guides feature selection and interpretation. Start simple, evaluate thoroughly. Linear regression and logistic regression are interpretable and often perform well. Only move to complex models when simple ones prove insufficient. I’m bullish on starting simple—you can always add complexity later, but you can’t add interpretability.

Visualization aids interpretation. Plots reveal model behavior that metrics alone miss. Always visualize predictions, residuals, and clusters to understand what your model is actually learning. I’ve said this before, and I’ll keep saying it: always plot your data.


What’s Next

In Chapter 4, we’ll dive deeper into time series forecasting—a critical utility application. You’ll learn ARIMA models and see how to extend regression approaches to handle temporal dependencies in load forecasting. The principles are the same, but time series adds complexity.


Code

"""Chapter 3: Machine Learning Fundamentals for Power and Utilities."""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yaml
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, classification_report

# Load config
config_path = Path(__file__).parent / "config.yaml"
with open(config_path) as f:
    config = yaml.safe_load(f)

np.random.seed(config["model"]["random_state"])


def generate_regression_data():
    """Generate synthetic regression data: temperature vs. daily load (MW)."""
    samples = config["data"]["regression_samples"]
    temp = np.random.normal(config["regression"]["temp_mean"], 
                           config["regression"]["temp_std"], samples)
    load = (config["regression"]["base_load"] + 
            config["regression"]["temp_coef"] * temp + 
            np.random.normal(0, config["regression"]["noise_std"], samples))
    return pd.DataFrame({"Temperature_C": temp, "Load_MW": load})


def regression_example(df):
    """Fit and visualize linear regression (Temperature -> Load)."""
    X = df[["Temperature_C"]]
    y = df["Load_MW"]
    model = LinearRegression()
    model.fit(X, y)
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)

    fig, ax = plt.subplots(figsize=config["plotting"]["figsize_regression"])
    ax.scatter(X, y, color=config["plotting"]["colors"]["observed"], 
                alpha=0.7, label="Observed")
    ax.plot(X, y_pred, color=config["plotting"]["colors"]["regression"], 
             label="Regression Line")
    ax.set_title(f"Load (MW) vs Temperature (°C) - Linear Regression (MSE = {mse:.2f})")
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.legend()
    plt.tight_layout()
    plt.savefig(config["plotting"]["output_files"]["regression"])
    plt.close()


def generate_classification_data():
    """Generate synthetic classification data: equipment age, load -> failure probability."""
    samples = config["data"]["classification_samples"]
    age = np.random.uniform(config["classification"]["age_min"], 
                           config["classification"]["age_max"], samples)
    load = np.random.uniform(config["classification"]["load_min"], 
                            config["classification"]["load_max"], samples)
    failure_prob = 1 / (1 + np.exp(-(0.1 * age + 0.002 * load - 7)))
    failure = np.random.binomial(1, failure_prob)
    return pd.DataFrame({"Age_Years": age, "Load_kVA": load, "Failure": failure})


def classification_example(df):
    """Train and evaluate logistic regression for equipment failure prediction."""
    X = df[["Age_Years", "Load_kVA"]]
    y = df["Failure"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=config["model"]["test_size"], 
        random_state=config["model"]["random_state"], stratify=y
    )
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=["Healthy", "Failure"]))


def clustering_example():
    """Apply clustering to synthetic smart meter load profiles (daily kWh)."""
    np.random.seed(config["model"]["random_state"])
    means = config["clustering"]["cluster_means"]
    stds = config["clustering"]["cluster_stds"]
    n_samples = config["clustering"]["samples_per_cluster"]
    
    clusters = [np.random.normal(m, s, (n_samples, 24)) for m, s in zip(means, stds)]
    data = np.vstack(clusters)
    
    kmeans = KMeans(n_clusters=config["model"]["n_clusters"], 
                    random_state=config["model"]["random_state"])
    labels = kmeans.fit_predict(data)

    fig, ax = plt.subplots(figsize=config["plotting"]["figsize_clustering"])
    for i in range(config["model"]["n_clusters"]):
        cluster_profiles = data[labels == i]
        ax.plot(cluster_profiles.T, color="gray", alpha=0.2)
        ax.plot(cluster_profiles.mean(axis=0), label=f"Cluster {i+1}", linewidth=2)
    ax.set_title("Daily Load Profiles (kWh) by Hour - Customer Segmentation")
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.legend()
    plt.tight_layout()
    plt.savefig(config["plotting"]["output_files"]["clustering"])
    plt.close()


if __name__ == "__main__":
    # Regression
    df_reg = generate_regression_data()
    regression_example(df_reg)

    # Classification
    df_class = generate_classification_data()
    retries = 0
    while df_class["Failure"].nunique() < 2 and retries < 5:
        df_class = generate_classification_data()
        retries += 1
    classification_example(df_class)

    # Clustering
    clustering_example()