Intelligent Control of Battery Thermal Management Systems Using Deep Reinforcement Learning

The imperative for sustainable transportation has positioned electric vehicles (EVs) at the forefront of the global energy revolution. However, the performance, safety, and longevity of the sole power source—the lithium-ion battery pack—are critically dependent on its operating temperature. An efficient battery management system (BMS), particularly its thermal management subsystem, is paramount to maintaining the battery within an optimal range (typically 20–50°C), preventing thermal runaway at high temperatures and mitigating performance degradation at low temperatures. Traditional control strategies, such as rule-based ON/OFF or Proportional-Integral-Derivative (PID) control, often struggle with precision, energy efficiency, and adaptability to complex, dynamic driving conditions and varying environmental factors.

This article explores an advanced, data-driven approach by applying Deep Reinforcement Learning (DRL) to the control of a liquid-cooled Battery Thermal Management System (BTMS). We construct a high-fidelity simulation environment integrating a battery electro-thermal coupling model with a vapor-compression refrigeration system model. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is employed to train an intelligent agent. This agent learns optimal control policies for the compressor and auxiliary heaters by interacting with the simulated environment, aiming to precisely regulate battery temperature while minimizing system energy consumption. Comparative analysis with conventional PID and ON/OFF controllers demonstrates the superior performance of the DRL-based strategy in terms of temperature uniformity, control stability, and significant energy savings across various seasonal and operational scenarios.

1. System Modeling of the Battery Thermal Management System

1.1 System Architecture

The studied BTMS, as illustrated in the figure above, features a dual-loop architecture: a refrigerant loop (using R134a) and a coolant loop (using ethylene glycol-water mixture). The refrigerant loop is a standard vapor-compression cycle comprising a compressor, a condenser, an expansion valve, and a chiller (which acts as the evaporator for the battery cooling circuit). The coolant loop circulates the cooled glycol mixture through liquid cold plates attached to the battery modules to absorb heat.

1.2 Refrigeration System Modeling

1.2.1 Compressor Model
The compressor is modeled considering its volumetric and isentropic efficiencies. The refrigerant mass flow rate $\dot{m}_{comp}$ is given by:
$$\dot{m}_{comp} = N_{comp} \cdot V_d \cdot \rho_{suct} \cdot \eta_{vol}$$
where $N_{comp}$ is the compressor speed, $V_d$ is the displacement volume, $\rho_{suct}$ is the suction density, and $\eta_{vol}$ is the volumetric efficiency. The isentropic efficiency $\eta_{is}$ is defined as:
$$\eta_{is} = \frac{h_{is,out} – h_{in}}{h_{out} – h_{in}}$$
where $h$ represents enthalpy, and the subscripts $in$, $out$, and $is,out$ denote compressor inlet, actual outlet, and isentropic outlet states, respectively.

1.2.2 Heat Exchanger Model (Condenser/Evaporator/Chiller)
The heat exchangers are modeled using the ε-NTU method. For the refrigerant side, the flow is divided into three regions: subcooled liquid, two-phase, and superheated vapor. The heat transfer rate $Q$ for a region is:
$$Q = \varepsilon \cdot (\dot{m}c_p)_{min} \cdot (T_{hot,in} – T_{cold,in})$$
where $\varepsilon$ is the effectiveness, a function of the Number of Transfer Units ($NTU$):
$$NTU = \frac{z \cdot U \cdot A}{(\dot{m}c_p)_{min}}$$
Here, $z$ is the length fraction of the region, $U$ is the overall heat transfer coefficient, $A$ is the heat transfer area, and $(\dot{m}c_p)_{min}$ is the minimum heat capacity rate between the two fluids. The overall $U$ is calculated from convective heat transfer coefficients on both sides, obtained using appropriate Nusselt number correlations (e.g., Gnielinski for single-phase turbulent flow, Cavallini & Zecchin for two-phase flow).

1.2.3 Expansion Valve Model
The thermostatic expansion valve is modeled as a simple restrictive element:
$$\dot{m}_{v} = C_v \cdot a_v \cdot \sqrt{\rho_v \cdot (p_c – p_e)}$$
where $C_v$ is the flow coefficient, $a_v$ is the valve opening, $\rho_v$ is the inlet density, and $p_c$ and $p_e$ are the condenser and evaporator pressures, respectively.

1.3 Battery Pack Thermal-Electrical Modeling

1.3.1 Battery Electrical Model
A simplified Rint equivalent circuit model is adopted to balance accuracy and computational efficiency for the battery management system control. The terminal voltage $U_{bat}$ is:
$$U_{bat} = U_{OCV}(SOC, T_{bat}) – I_{bat} \cdot R_{int}(SOC, T_{bat}, I_{bat})$$
where $U_{OCV}$ is the open-circuit voltage, $I_{bat}$ is the current (positive for discharge), $R_{int}$ is the internal resistance, and $SOC$ is the state of charge. The $SOC$ is updated as:
$$SOC(t) = SOC_0 – \frac{1}{C_{bat}} \int_0^t I_{bat}(\tau) d\tau$$
The parameters of the studied lithium-ion cell are summarized below.

Parameter	Value
Nominal Voltage	3.65 V
Chemistry (Cathode/Anode)	NMC / Graphite
Cut-off Voltage	2.75 – 4.25 V
Mass	0.895 kg
Specific Heat Capacity	1100 J·kg⁻¹·K⁻¹

1.3.2 Battery Thermal Model
The lumped-capacitance thermal model for a battery cell is:
$$c_{bat} m_{bat} \frac{dT_{bat}}{dt} = \dot{Q}_{gen} – \dot{Q}_{cool} – \dot{Q}_{air}$$
The heat generation $\dot{Q}_{gen}$ is primarily from joule heating and the entropic heat effect:
$$\dot{Q}_{gen} = I_{bat}^2 \cdot R_{int} + I_{bat} \cdot T_{bat} \cdot \frac{dU_{OCV}}{dT_{bat}}$$
The cooling by the liquid cold plate $\dot{Q}_{cool}$ is modeled as forced convection:
$$\dot{Q}_{cool} = h_{cool} \cdot A_{cool} \cdot (T_{bat} – T_{coolant})$$
The natural convection to ambient air $\dot{Q}_{air}$ is:
$$\dot{Q}_{air} = h_{air} \cdot A_{air} \cdot (T_{bat} – T_{amb})$$
The battery pack consists of 24 modules (3 cells in parallel, 4 in series). The thermal model is applied to each module, and their temperatures are coupled through the coolant flow.

1.3.3 Coolant Pump Model
The power consumption of the variable-speed coolant pump is:
$$P_{pump} = \frac{1}{\eta_{pump}} \cdot \dot{m}_{coolant} \cdot \frac{\Delta p_{pump}}{\rho_{coolant}}$$
where $\Delta p_{pump}$ is a function of the mass flow rate $\dot{m}_{coolant}$.

2. Model Calibration and Validation

The component models were calibrated against experimental data to ensure simulation fidelity, a crucial step for training a reliable DRL agent for the BMS. Key calibration results for the condenser and chiller are summarized below.

Table 2. Condenser Calibration Error Analysis
Test Point	Heat Transfer Error	Pressure Drop Error
1	0.76%	4.91%
2	2.07%	8.23%
3	1.67%	4.03%
4	0.19%	0.07%
5	1.83%	5.85%
Average	1.30%	4.62%

Table 3. Chiller Calibration Error Analysis
Test Point	Heat Transfer Error	Pressure Drop Error
1	5.39%	14.57%
2	4.63%	5.25%
3	4.21%	2.15%
4	3.49%	5.00%
Average	4.43%	6.74%

3. Deep Reinforcement Learning Control Strategy Design

The core of the intelligent battery management system control lies in the DRL agent. We formulate the BTMS control as a Markov Decision Process (MDP) and employ the TD3 algorithm, known for its stability in continuous action spaces.

3.1 State Space, Action Space, and Reward Function
• State ($s_t$): The observation provided to the agent includes the battery pack average temperature, its rate of change, and the coolant inlet/outlet temperatures.
• Action ($a_t$): The control outputs are the normalized compressor speed and the heater power (for winter conditions). The coolant pump flow is held constant.
• Reward ($r_t$): The reward function is designed primarily for precise temperature tracking:
$$ r(t) = \begin{cases}
-\alpha \cdot |T_{bat,avg}(t) – T_{target}|, & \text{if } |T_{bat,avg}(t) – T_{target}| \geq 0.5 \\
-10, & \text{if } |T_{bat,avg}(t) – T_{target}| < 0.5
\end{cases} $$
where $\alpha$ is a scaling factor. This function penalizes deviations from the target temperature $T_{target}$ (25°C for summer, 20°C for winter), with a harsher penalty for very small deviations to encourage tight control.

3.2 TD3 Algorithm
TD3 addresses the overestimation bias common in actor-critic methods like DDPG. Its key features are:
1. Clipped Double Q-Learning: Two independent critic (Q-value) networks are trained. The target Q-value for updates is taken as the minimum of the two target critic network outputs, reducing overestimation.
2. Delayed Policy Updates: The actor network and its target are updated less frequently than the critics, allowing for a more stable value function estimate before the policy is changed.
3. Target Policy Smoothing: Noise is added to the target action, which has a regularizing effect by making the value function harder to fit to incorrect, sharp peaks.

3.3 Training and Baseline Controllers
The agent was trained using a multi-step constant-current charging profile. For performance comparison, two baseline controllers were implemented:
• ON/OFF Control: The compressor and heater are switched on/off based on simple temperature hysteresis bands (e.g., 24-26°C for summer cooling).
• PID Control: A PID controller adjusts the compressor speed based on the temperature error. Gains were tuned manually for stability.

4. Results and Analysis

4.1 Summer Operation

Under a 35°C ambient charging scenario, all three controllers brought the pack average temperature below 26°C. The TD3-controlled BMS exhibited the smallest maximum temperature spread between modules (0.56°C), indicating superior temperature uniformity. More importantly, the TD3 agent modulated the compressor speed more smoothly and proactively in response to changing current, leading to significant energy savings. The compressor energy consumption is compared below.

Table 4. Summer Compressor Energy Consumption Comparison
Control Strategy	Charging (kWh)	Discharging (CLTC) (kWh)
ON/OFF	1.20	0.81
PID	1.06	0.59
TD3 (Proposed)	1.01	0.55

Compared to ON/OFF and PID, the TD3 strategy achieved energy savings of 15.8% and 4.7% during charging, and 32.1% and 6.8% during discharging, respectively.

4.2 Winter Operation

In winter conditions (-10°C ambient), the system requires heating. The TD3 agent demonstrated intelligent coordination between the heat pump (compressor) and the auxiliary PTC heater. It learned to use the heater primarily for fast initial warm-up and then relied more on the more efficient heat pump mode to maintain temperature, whereas PID and ON/OFF used both concurrently. The total system energy consumption (compressor + heater) is summarized below.

Table 5. Winter Total System Energy Consumption Comparison
Control Strategy	Charging (kWh)	Discharging (CLTC) (kWh)
ON/OFF	1.90	2.41
PID	1.69	2.31
TD3 (Proposed)	1.40	2.00

The TD3-based battery management system achieved energy savings of 26.3% and 17.2% during winter charging, and 17.0% and 13.4% during discharging, compared to ON/OFF and PID controllers.

4.3 Environmental Adaptability

To test generalization, the trained TD3 policy was evaluated under conditions not seen during training. During summer charging, the agent successfully regulated the battery temperature to ~25°C across a wide ambient temperature range (25°C to 45°C), adjusting the compressor speed accordingly. In winter, the policy was tested under five different driving cycles (CLTC, WLTC, NEDC, UDDS, FTP-75), all with a -10°C start. The agent maintained the pack temperature within 19-21°C in all cases, with total system energy consumption varying by less than 0.2 kWh from the training cycle (CLTC), demonstrating strong robustness.

5. Conclusion

This study successfully developed and validated an intelligent control strategy for a liquid-cooled Battery Thermal Management System using Deep Reinforcement Learning. The TD3 algorithm was trained within a high-fidelity co-simulation environment that integrated detailed models of the battery’s electro-thermal behavior and the vapor-compression refrigeration cycle.

The results conclusively demonstrate that the DRL-based controller outperforms conventional ON/OFF and PID strategies. It provides more precise and stable temperature regulation, evidenced by better module-to-module temperature uniformity. Most significantly, it achieves substantial energy savings—up to 32.1% in summer and 26.3% in winter compared to rule-based methods—by learning an efficient, predictive control policy that optimally coordinates multiple actuators (compressor, heater). Furthermore, the intelligent agent exhibits excellent adaptability to varying environmental temperatures and dynamic load profiles, a critical requirement for real-world automotive battery management system applications.

Future work will focus on incorporating energy consumption directly into the reward function to explicitly trade-off tracking precision with efficiency, and on deploying the trained policy to hardware-in-the-loop test benches to validate its performance in real-time. This research strongly affirms the feasibility and superior potential of reinforcement learning for advancing the intelligence and efficiency of thermal management in battery management systems.