Grid Operations Optimization
What You’ll Learn
By the end of this chapter, you will understand how reinforcement learning can optimize real-time grid control decisions. You’ll learn the basics of RL agents and how they interact with grid environments. You’ll see how Q-learning can learn voltage control policies through trial and error. You’ll recognize the importance of safety constraints when applying RL to critical infrastructure, and you’ll appreciate the difference between simulation and real-world deployment.
The Business Problem: Managing a Dynamic Grid in Real Time
The electric grid is an intricate balancing act. Voltage, frequency, and power flows must remain within tight tolerances to ensure reliability and safety. Historically, grid operations were simpler: centralized generation plants supplied predictable loads through a largely passive transmission and distribution network. Control rooms relied on deterministic engineering models, SCADA data, and operator experience to keep the system stable.
This paradigm is under strain. Distributed energy resources like rooftop solar and battery storage introduce bidirectional flows on distribution circuits. Electric vehicles add sudden, localized load spikes. Weather volatility drives rapid swings in both supply and demand. Operators must adjust reactive power, tap changers, capacitor banks, and feeder configurations more frequently and with less predictability than before. Manual interventions that once sufficed are no longer enough to keep pace with these dynamics.
Control rooms where operators are overwhelmed by the number of decisions they need to make in real time. The old way—set it and forget it—doesn’t work anymore. You need systems that can adapt.
Inefficient grid operations not only increase the risk of violations but also lead to unnecessary energy losses and wear on equipment. Voltage that is too high accelerates transformer aging, while voltage that is too low triggers customer complaints and equipment malfunctions. Operators are challenged to maintain stability while minimizing costs, all in an environment of growing complexity.
The Analytics Solution: Intelligent Control Through Optimization
Grid operations optimization uses analytics and machine learning to support real-time decision-making. Instead of relying solely on heuristic rules or static setpoints, optimization algorithms learn how to adjust controls dynamically to maintain system stability and efficiency.
One promising approach is reinforcement learning (RL). RL agents learn by interacting with a simulated grid environment, testing control actions and observing their effects on voltage, frequency, and power flow. Over time, the agent develops policies that stabilize the grid with minimal intervention, reducing voltage excursions, reactive power costs, and operator burden.
Beyond RL, optimization also includes predictive analytics that anticipate problems before they occur. For example, forecasting feeder voltage based on load and DER behavior enables preemptive adjustments. This approach shifts operators from reacting to alarms toward making proactive, data-driven decisions.
Understanding Reinforcement Learning
Reinforcement learning is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions, such as adjusting reactive power, observes the state, such as current voltage and load, and receives rewards, such as a negative penalty for voltage deviation from 1.0 per unit.
Key RL concepts include state, which represents current grid conditions like voltage and load, action, which is a control decision such as increasing or decreasing reactive power or holding steady, reward, which is a feedback signal like a penalty for voltage violations or a bonus for stability, policy, which is a strategy mapping states to actions, and the Q-table, which stores expected rewards for state-action pairs.
The agent learns through exploration, trying different actions, and exploitation, using learned knowledge. Over many episodes, it discovers which actions work best in different states.
Important safety note: RL for grid control requires extensive simulation and testing before deployment. Real-world grid failures can cause outages, equipment damage, or safety hazards. Always validate RL policies in simulation, test with human oversight, and implement safety guardrails that prevent dangerous actions.
Operational Benefits
By embedding analytics into grid operations, utilities can achieve tighter control with less manual effort. Optimized voltage regulation reduces losses and equipment stress. Automated adjustment of capacitor banks and inverters frees operators to focus on higher-level tasks. When deployed carefully, these tools act as decision-support systems, augmenting rather than replacing human judgment.
These capabilities also support the integration of DERs and advanced customer programs. Smart inverter controls coordinated by machine learning help manage voltage volatility caused by rooftop solar. As electrification accelerates, optimization tools will be critical for maintaining service quality without overbuilding infrastructure.
The scale challenge here is massive. TotalEnergies, one of the world’s top five energy companies with over 100,000 employees, faces what a lot of energy companies face: hundreds of legacy systems producing data, millions of measurements every day from time series sensor data coming from plants worldwide. The variety is huge—process data, sensor readings, operational logs from facilities across the globe.
Their strategic approach focuses on making data available to everyone in the company, not just data scientists. As one of their executives put it: “Data should not be only for data scientists, it should be for everybody in the company.” That democratization enables operational teams to make data-driven decisions in real time. They rely on a cloud strategy to extract data from legacy systems and make it widely available, which eases access for everybody while enabling advanced AI and machine learning usage.
The impact has been substantial. In just three years, they’ve put more than 100 applications into production and deployed more than 250 models. The millions of daily measurements from sensors worldwide feed into models that optimize energy production, transmission, and distribution operations. The result is more efficient operations, better resource utilization, and improved decision-making across the entire energy value chain. That’s the kind of scale that matters when you’re trying to solve one of the major challenges of the world.
Building a Reinforcement Learning Agent for Voltage Control
Let me show you a simplified reinforcement learning approach to voltage control. This is a toy example for educational purposes—real grid control requires much more sophisticated models, extensive safety validation, and integration with SCADA systems. I’m showing you this because RL is promising, but it’s also risky if you don’t do it right.
First, we create a grid environment that simulates voltage dynamics:
class VoltageControlEnv(gym.Env if HAS_GYM else object):
"""Custom environment simulating grid voltage with reactive power control."""
def __init__(self):
if HAS_GYM:
super().__init__()
self.voltage = config["environment"]["voltage_init"]
self.load = 0.8
self.reactive_power = 0.0
if HAS_GYM:
self.action_space = spaces.Discrete(3) # Decrease Q, Hold, Increase Q
self.observation_space = spaces.Box(
low=np.array([0.9, config["environment"]["load_min"]]),
high=np.array([1.1, config["environment"]["load_max"]]),
dtype=np.float32
)
def reset(self, seed=None, options=None):
self.voltage = config["environment"]["voltage_init"]
self.load = np.random.uniform(
config["environment"]["load_init_min"],
config["environment"]["load_init_max"]
)
self.reactive_power = 0.0
return np.array([self.voltage, self.load], dtype=np.float32), {}
def step(self, action):
step_size = config["environment"]["reactive_power_step"]
if action == 0: # decrease reactive power
self.reactive_power -= step_size
elif action == 2: # increase reactive power
self.reactive_power += step_size
# Voltage dynamics
self.voltage = (1.0 - 0.05 * (self.load - 0.8) + 0.04 * self.reactive_power +
np.random.normal(0, config["environment"]["voltage_noise"]))
reward = -abs(self.voltage - 1.0) # Penalize deviation from 1.0 pu
self.load += np.random.normal(0, config["environment"]["load_noise"])
self.load = np.clip(self.load, config["environment"]["load_min"],
config["environment"]["load_max"])
return np.array([self.voltage, self.load], dtype=np.float32), reward, False, False, {}
def render(self):
pass
This environment models basic voltage dynamics: higher load decreases voltage, reactive power injection increases voltage. The environment provides states (voltage, load) and rewards (penalty for voltage deviations from 1.0 per unit). In practice, you’d use a full power flow model, but the principles are the same.
Next, we implement and train a Q-learning agent:
def train_rl_agent():
"""Train Q-learning agent for voltage control."""
if not HAS_GYM:
print("gymnasium not available; skipping RL training.")
return
env = VoltageControlEnv()
q_table = np.zeros((10, 10, env.action_space.n)) # Discretized voltage/load
def discretize(obs):
v_bin = min(int((obs[0] - 0.9) / 0.02), 9)
l_bin = min(int((obs[1] - 0.5) / 0.07), 9)
return v_bin, l_bin
alpha = config["rl"]["alpha"]
gamma = config["rl"]["gamma"]
epsilon = config["rl"]["epsilon"]
episodes = config["rl"]["episodes"]
steps = config["rl"]["steps_per_episode"]
print_interval = config["rl"]["print_interval"]
for ep in range(episodes):
state, _ = env.reset()
v_bin, l_bin = discretize(state)
total_reward = 0
for _ in range(steps):
if random.random() > epsilon:
action = np.argmax(q_table[v_bin, l_bin])
else:
action = env.action_space.sample()
next_state, reward, _, _, _ = env.step(action)
nv_bin, nl_bin = discretize(next_state)
q_table[v_bin, l_bin, action] += alpha * (
reward + gamma * np.max(q_table[nv_bin, nl_bin]) - q_table[v_bin, l_bin, action]
)
v_bin, l_bin = nv_bin, nl_bin
total_reward += reward
if ep % print_interval == 0:
print(f"Episode {ep}: Total Reward {total_reward:.2f}")
The Q-learning algorithm learns which actions (decrease Q, hold, increase Q) work best in different voltage/load states. The agent explores the environment, receives rewards, and builds a Q-table storing learned knowledge. Episode rewards start highly negative (poor control) and improve as the agent learns. RL agents can learn surprisingly good policies, but they can also learn dangerous ones if you don’t constrain them properly.
Safety warning: Never deploy RL agents to control real grid equipment without extensive simulation testing, safety guardrails preventing violations, human operator oversight, gradual rollout with monitoring, and the ability to revert to manual control instantly. I can’t emphasize this enough—grid failures are unacceptable. Test extensively, deploy cautiously.
The complete, runnable script is at content/c7/grid_optimization.py. Run it, but remember: this is a simulation. Real grid control is much more complex.
What I Want You to Remember
Reinforcement learning offers adaptive control. RL agents can learn to optimize grid operations by interacting with simulated environments, potentially outperforming static control rules. Simulation is essential. RL requires extensive simulation to learn safely. Real grid failures are unacceptable, so agents must be thoroughly validated before deployment.
Safety is paramount. RL for critical infrastructure requires safety constraints, human oversight, and gradual deployment. The technology is promising but must be applied cautiously. RL complements, doesn’t replace, operators. The goal is decision support, not full automation. Operators retain final authority and can override agent decisions.
This is advanced material. RL is more complex than classification or regression. Consider this chapter an introduction—production RL systems require significant expertise and infrastructure. RL works well in simulation, but it can fail in production because the real world is messier than the simulation. Start simple, test extensively, deploy cautiously.
What’s Next
In Chapter 8, we’ll return to forecasting but focus on renewable energy—using PVLib and SARIMA to predict solar and wind generation, which is essential for managing variable resources on the grid. This is more practical than RL for most utilities right now.
Code
"""Chapter 7: Grid Operations Optimization."""
import numpy as np
import matplotlib.pyplot as plt
import yaml
from pathlib import Path
import random
# Optional dependency: gymnasium
HAS_GYM = True
try:
import gymnasium as gym
from gymnasium import spaces
except Exception:
HAS_GYM = False
# Load config
config_path = Path(__file__).parent / "config.yaml"
with open(config_path) as f:
config = yaml.safe_load(f)
random.seed(config["model"]["random_state"])
np.random.seed(config["model"]["random_state"])
class VoltageControlEnv(gym.Env if HAS_GYM else object):
"""Custom environment simulating grid voltage with reactive power control."""
def __init__(self):
if HAS_GYM:
super().__init__()
self.voltage = config["environment"]["voltage_init"]
self.load = 0.8
self.reactive_power = 0.0
if HAS_GYM:
self.action_space = spaces.Discrete(3) # Decrease Q, Hold, Increase Q
self.observation_space = spaces.Box(
low=np.array([0.9, config["environment"]["load_min"]]),
high=np.array([1.1, config["environment"]["load_max"]]),
dtype=np.float32
)
def reset(self, seed=None, options=None):
self.voltage = config["environment"]["voltage_init"]
self.load = np.random.uniform(
config["environment"]["load_init_min"],
config["environment"]["load_init_max"]
)
self.reactive_power = 0.0
return np.array([self.voltage, self.load], dtype=np.float32), {}
def step(self, action):
step_size = config["environment"]["reactive_power_step"]
if action == 0: # decrease reactive power
self.reactive_power -= step_size
elif action == 2: # increase reactive power
self.reactive_power += step_size
# Voltage dynamics
self.voltage = (1.0 - 0.05 * (self.load - 0.8) + 0.04 * self.reactive_power +
np.random.normal(0, config["environment"]["voltage_noise"]))
reward = -abs(self.voltage - 1.0) # Penalize deviation from 1.0 pu
self.load += np.random.normal(0, config["environment"]["load_noise"])
self.load = np.clip(self.load, config["environment"]["load_min"],
config["environment"]["load_max"])
return np.array([self.voltage, self.load], dtype=np.float32), reward, False, False, {}
def render(self):
pass
def train_rl_agent():
"""Train Q-learning agent for voltage control."""
if not HAS_GYM:
print("gymnasium not available; skipping RL training.")
return
env = VoltageControlEnv()
q_table = np.zeros((10, 10, env.action_space.n)) # Discretized voltage/load
def discretize(obs):
v_bin = min(int((obs[0] - 0.9) / 0.02), 9)
l_bin = min(int((obs[1] - 0.5) / 0.07), 9)
return v_bin, l_bin
alpha = config["rl"]["alpha"]
gamma = config["rl"]["gamma"]
epsilon = config["rl"]["epsilon"]
episodes = config["rl"]["episodes"]
steps = config["rl"]["steps_per_episode"]
print_interval = config["rl"]["print_interval"]
for ep in range(episodes):
state, _ = env.reset()
v_bin, l_bin = discretize(state)
total_reward = 0
for _ in range(steps):
if random.random() > epsilon:
action = np.argmax(q_table[v_bin, l_bin])
else:
action = env.action_space.sample()
next_state, reward, _, _, _ = env.step(action)
nv_bin, nl_bin = discretize(next_state)
q_table[v_bin, l_bin, action] += alpha * (
reward + gamma * np.max(q_table[nv_bin, nl_bin]) - q_table[v_bin, l_bin, action]
)
v_bin, l_bin = nv_bin, nl_bin
total_reward += reward
if ep % print_interval == 0:
print(f"Episode {ep}: Total Reward {total_reward:.2f}")
if __name__ == "__main__":
train_rl_agent()