Machine Learning Fundamentals

The Business Problem: Predicting and Classifying in Complex Environments

Utilities operate vast networks of physical assets that must balance supply and demand in real time. Predicting how these systems behave under different conditions is critical. Grid operators must forecast tomorrow’s load to schedule generation. Maintenance planners must decide which transformers are at greatest risk of failure. Customer engagement teams need to identify which customers are likely to participate in demand response programs.

Traditionally, these tasks have relied on deterministic engineering models or static business rules. These approaches work in stable, predictable conditions but struggle in the face of variability and uncertainty. Weather changes hourly, demand fluctuates daily, and aging equipment deteriorates in nonlinear ways. The complexity of modern power systems makes it impractical to encode every rule explicitly or to manually sift through massive datasets.

Machine learning addresses this by learning patterns directly from data rather than relying solely on pre-programmed rules. Instead of hand-coding equations to model every scenario, we allow algorithms to find statistical relationships between inputs and outputs. This is particularly powerful in utilities, where data from smart meters, SCADA systems, asset registries, and customer programs contains rich but underutilized signals about system behavior and risks.

The Analytics Solution: Core Learning Methods

Machine learning is not a single technique but a collection of methods that fall into several broad categories. In this chapter, we focus on three foundational approaches that recur throughout utility applications: regression, classification, and clustering.

Regression predicts a continuous outcome. It is used when we want to estimate a value, such as future load in megawatts, transformer oil temperature, or customer energy usage. Linear regression, for example, can relate temperature and time of day to hourly demand, providing forecasts that help balance supply and demand.

Classification predicts a discrete outcome. It is used to assign categories, such as determining whether a piece of equipment is healthy or likely to fail, or whether an observed pattern is normal or anomalous. This underpins predictive maintenance, cybersecurity detection, and many operational workflows.

Clustering is an unsupervised technique that groups similar observations together. It is particularly useful when labels are not available. For example, clustering smart meter profiles can reveal natural segments of customers—such as those with high evening peaks versus those with flat daytime usage—informing rate design and demand response targeting.

Understanding these core methods and their differences is essential before tackling more advanced techniques. They provide a common language between data science and engineering teams and form the backbone of most practical machine learning pipelines in utilities.

Connecting Methods to Real Utility Scenarios

Consider transformer failure prediction. We may have sensor data on temperature, vibration, and load, combined with asset attributes like age and manufacturer. A classification model trained on historical failure records can learn to distinguish healthy transformers from those approaching failure. By scoring current assets, it flags those at highest risk for inspection or replacement.

For load forecasting, regression models link weather variables, calendar effects, and historical demand patterns to predict consumption at different time horizons. These forecasts drive market bidding strategies and generator commitment decisions. Even basic models deliver significant operational improvements compared to heuristic forecasts.

In customer analytics, clustering can segment households based on usage profiles from AMI data. These segments inform demand response outreach, such as targeting high-peak households with incentives for load shifting. Clustering can also uncover emerging patterns, like neighborhoods adopting electric vehicles, before they show up in feeder overloading alarms.

These examples illustrate how simple machine learning concepts map directly to real problems. By framing utility questions in terms of prediction, classification, and grouping, we create clear pathways from business needs to analytic solutions.

Transition to the Demo

In this chapter’s demo, we will build three small but representative models using synthetic utility datasets. We will:

Apply regression to model the relationship between weather and load, producing short-term demand forecasts.
Train a classification model to predict transformer failures using sensor and asset attributes.
Use clustering to group customers based on their smart meter profiles, revealing distinct consumption behaviors.

Each example mirrors a real-world utility use case while remaining accessible enough to understand the mechanics. Through these exercises, we will see how data is transformed into predictions, how models are evaluated, and how results connect back to operational decision-making.

This chapter bridges the gap between theory and applied practice, grounding machine learning fundamentals in the context of utility operations. It demonstrates that these techniques are not abstract—they are tools directly applicable to solving persistent industry challenges using the data utilities already collect.

pyfile shortcode: missing param 'file'. Example: {{< pyfile file="script.py" >}}

Code

"""
Chapter 3: Machine Learning Fundamentals for Power and Utilities
Covers supervised regression/classification, clustering, and model evaluation.
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, classification_report

def generate_regression_data(samples=200):
    """
    Generate synthetic regression data: temperature vs. daily load (MW).
    """
    temp = np.random.normal(20, 8, samples)  # Ambient temperature (C)
    load = 900 + 15 * temp + np.random.normal(0, 30, samples)  # Linear relationship
    return pd.DataFrame({"Temperature_C": temp, "Load_MW": load})

def regression_example(df):
    """
    Fit and visualize linear regression (Temperature -> Load).
    """
    X = df[["Temperature_C"]]
    y = df["Load_MW"]
    model = LinearRegression()
    model.fit(X, y)
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)

    plt.figure(figsize=(8, 5))
    plt.scatter(X, y, color="gray", alpha=0.7, label="Observed")
    plt.plot(X, y_pred, color="black", label="Regression Line")
    plt.xlabel("Temperature (°C)")
    plt.ylabel("Load (MW)")
    plt.title(f"Linear Regression (MSE = {mse:.2f})")
    plt.legend()
    plt.tight_layout()
    plt.savefig("chapter3_regression.png")
    plt.show()

def generate_classification_data(samples=200):
    """
    Generate synthetic classification data: equipment age, load -> failure probability.
    """
    age = np.random.uniform(1, 30, samples)  # Equipment age (years)
    load = np.random.uniform(500, 1000, samples)  # Transformer load (kVA)
    failure_prob = 1 / (1 + np.exp(-(0.1 * age + 0.002 * load - 7)))  # Sigmoid risk
    failure = np.random.binomial(1, failure_prob)
    return pd.DataFrame({"Age_Years": age, "Load_kVA": load, "Failure": failure})

def classification_example(df):
    """
    Train and evaluate logistic regression for equipment failure prediction.
    """
    X = df[["Age_Years", "Load_kVA"]]
    y = df["Failure"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=["Healthy", "Failure"]))

def clustering_example():
    """
    Apply clustering to synthetic smart meter load profiles (daily kWh).
    """
    np.random.seed(42)
    cluster1 = np.random.normal(15, 2, (50, 24))  # Low consumption
    cluster2 = np.random.normal(25, 3, (50, 24))  # Medium consumption
    cluster3 = np.random.normal(40, 4, (50, 24))  # High consumption
    data = np.vstack([cluster1, cluster2, cluster3])
    kmeans = KMeans(n_clusters=3, random_state=42).fit(data)

    plt.figure(figsize=(10, 6))
    for i in range(3):
        plt.plot(data[kmeans.labels_ == i].T, alpha=0.2)
    plt.xlabel("Hour of Day")
    plt.ylabel("Load (kWh)")
    plt.title("Clustering Smart Meter Daily Load Profiles")
    plt.tight_layout()
    plt.savefig("chapter3_clustering.png")
    plt.show()

if __name__ == "__main__":
    # Regression
    df_reg = generate_regression_data()
    regression_example(df_reg)

    # Classification
    df_class = generate_classification_data()
    classification_example(df_class)

    # Clustering
    clustering_example()