Multi-Agent Reinforcement Learning for Charging Scheduling of Battery EV Car Clusters

With the rapid increase in the number of battery EV cars, developing rational strategies for charging and discharging scheduling of battery EV car clusters in distribution networks with integrated photovoltaic systems and multiple distributed charging stations has become a critical challenge in grid management. Traditional optimization methods often struggle with nonlinearities and dynamic uncertainties, while heuristic algorithms face scalability issues in large-scale scenarios. To address this, we propose a multi-agent deep reinforcement learning approach based on dynamic grouping for battery EV car cluster charging scheduling. This paper presents our methodology, environmental modeling, algorithm design, and experimental validation, demonstrating its effectiveness in reducing grid load fluctuations while meeting individual charging demands of battery EV cars.

The proliferation of battery EV cars is driven by their environmental and economic advantages, but it introduces new challenges for power grids. Large-scale charging loads can cause peak demand and voltage fluctuations, potentially leading to equipment overload and instability. Moreover, the vehicle-to-grid capability of battery EV cars, while offering grid regulation, may increase operational complexity. Factors such as real-time electricity prices, photovoltaic generation variability, and diverse user behaviors add to the uncertainty, making charging scheduling a complex decision-making problem. In this context, we explore multi-agent deep reinforcement learning as a solution to optimize charging schedules for battery EV car clusters in a distributed network environment.

Our work focuses on a distribution network that includes photovoltaic generation and multiple charging stations. We model each battery EV car and charging station as an intelligent agent, enabling decentralized decision-making while considering global objectives. The key contributions include a dynamic grouping mechanism to reduce decision coupling, a commander-follower structure for efficient communication, and a global Q-value fusion framework to handle inter-group competition and cooperation. Through extensive experiments, we show that our approach outperforms existing methods in terms of grid stability, charging efficiency, and cost-effectiveness for battery EV car users.

Environmental Modeling for Battery EV Car Charging

To formulate the charging scheduling problem, we first model the physical environment. The system consists of a distribution network with photovoltaic systems and distributed charging stations, where battery EV cars arrive and request charging services. We define the state space, action space, state transition relationships, and reward functions to capture the dynamics and objectives.

State Space Design

The state space incorporates observations from both battery EV cars and charging stations. For each battery EV car, we consider factors such as state of charge, arrival and departure times, selected charging station, distance to the station, queue information, charging status, and real-time electricity price. For charging stations, we include node active power, reactive power, voltage, and photovoltaic output power. The combined state space is represented as:

$$ S = \{ o_i^t, O_j^t \} $$

where for battery EV car $ i $:

$$ o_i^t = \{ SOC_i^t, p_i^t, t_{a,i}, t_{b,i}, g_i^c, d_i, q_i, f_i^c, \omega_c^t \} $$

and for charging station $ j $:

$$ O_j^t = \{ P_{PV,j}^t, P_{c,j}^t, Q_{c,j}^t, V_j^t, \omega_c^t \} $$

Here, $ SOC_i^t $ is the state of charge, $ p_i^t $ is the charging/discharging power, $ t_{a,i} $ and $ t_{b,i} $ are arrival and departure times, $ g_i^c $ is the charging station selection flag, $ d_i $ is the distance, $ q_i $ is the queue information, $ f_i^c $ is the charging flag, and $ \omega_c^t $ is the real-time electricity price. The node voltage must satisfy constraints to ensure grid safety:

$$ V_{min} \leq V_j^t \leq V_{max}, \quad \forall j \in L $$

The arrival times of battery EV cars follow a bimodal normal distribution to reflect peak commuting patterns, as supported by studies on private car charging behavior:

$$ f(t_a) = p_a \cdot \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(t_a – \mu_1)^2}{2\sigma^2}\right) + (1 – p_a) \cdot \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(t_a – \mu_2)^2}{2\sigma^2}\right) $$

with parameters set to $ \mu_1 = 9 $, $ \mu_2 = 18 $, $ \sigma = 1 $, and $ p_a = 0.4 $. The initial SOC and parking duration also follow probability distributions, ensuring realistic modeling of battery EV car behavior.

Action Space Setting

We frame the problem as a Markov game $ \langle N, S, A, P, R, \gamma \rangle $, where $ N $ is the number of agents, $ S $ is the joint state space, $ A $ is the joint action space, $ P $ is the state transition probability, $ R $ is the global cumulative reward, and $ \gamma $ is the discount factor. Each battery EV car agent and charging station agent has its own action space.

For a battery EV car agent $ i $, the action is the charging/discharging power request:

$$ A_i^t = \{ p_i^t \}, \quad i = 1,2,\dots,N $$

The power must adhere to constraints for both the charging station and the individual battery EV car:

$$ -p_{max} \leq \sum_{i=1}^{N} \eta_i \cdot p_i^t \leq p_{max} $$

$$ -p_{2max} \leq p_i^t \leq p_{2max} $$

where $ p_{max} $ is the maximum power of the charging station, $ p_{2max} $ is the maximum power of the charging pile, and $ \eta_i $ is the power exchange efficiency, with $ \eta_c $ for charging and $ \eta_d $ for discharging.

For a charging station agent $ j $, the action is the selection of battery EV cars and allocation of charging power:

$$ A_j^t = \{ select_j \} $$

where $ select_j $ is a vector indicating which battery EV cars are chosen for charging.

State Transition Relationships

After actions are executed, the environment updates its state. The SOC of a battery EV car evolves based on the charging power:

$$ SOC_i^{t+1} = SOC_i^t + \frac{\eta_i \cdot p_i^t \cdot \Delta t}{C_i} $$

where $ C_i $ is the battery capacity of battery EV car $ i $, and $ \Delta t $ is the time step.

Changes in charging power affect node voltages and power flows in the distribution network. The state transition for network nodes can be described using power flow equations:

$$
\begin{aligned}
P_{ij}^t &= P_{c,j}^t + P_{PV,j}^t + \sum_{k \in L} P_{jk}^t + I_{ij}^t \cdot r_{ij} \\
Q_{ij}^t &= Q_{c,j}^t + \sum_{k \in L} Q_{jk}^t + I_{ij}^t \cdot x_{ij} \\
I_{ij}^t &= \sqrt{\frac{(P_{ij}^t)^2 + (Q_{ij}^t)^2}{V_i^t}} \\
V_j^t &= V_i^t – 2 \cdot (r_{ij} P_{ij}^t + x_{ij} Q_{ij}^t) + I_{ij}^t \cdot (r_{ij}^2 + x_{ij}^2)
\end{aligned}
$$

where $ P_{ij}^t $ and $ Q_{ij}^t $ are active and reactive power from node $ i $ to $ j $, $ r_{ij} $ and $ x_{ij} $ are resistance and reactance, and $ I_{ij}^t $ is the line current. These equations ensure accurate modeling of grid dynamics as battery EV cars charge or discharge.

Multi-Objective Game and Partial Cooperative Relationships

The interaction between charging stations and battery EV cars is characterized by a multi-objective game with partial cooperation. Charging stations aim to maintain grid stability, while battery EV car users seek fast and low-cost charging. However, during periods of high photovoltaic generation, both parties can benefit from cooperative behavior, such as utilizing excess solar power for charging. We model this as a potential game, where a global potential function $ \Phi(\pi_{EV}, \pi_{CS}) $ aligns individual rewards with system-level objectives. The Nash equilibrium condition ensures that no agent can unilaterally improve its reward without affecting others, leading to Pareto-optimal strategies.

The joint strategy space is $ \{\pi_{EV}, \pi_{CS}\} $, and the equilibrium satisfies:

$$
\begin{aligned}
R_{EV}(\pi_{EV}^*, \pi_{CS}^*) &\geq R_{EV}(\pi_{EV}, \pi_{CS}^*), \quad \forall \pi_{EV} \\
R_{CS}(\pi_{EV}^*, \pi_{CS}^*) &\geq R_{CS}(\pi_{EV}^*, \pi_{CS}), \quad \forall \pi_{CS}
\end{aligned}
$$

where $ R_{EV} $ and $ R_{CS} $ are the reward functions for battery EV cars and charging stations, respectively. Through multi-agent policy gradient methods, strategies converge to this equilibrium.

Reward Function Design

The reward function combines individual and global objectives to guide learning. For battery EV car agents, the reward includes components for charging completion, cost, distance, queue time, grid losses, and load balance. For charging station agents, the reward focuses on grid losses, load variance, and voltage stability.

Specifically, the reward for battery EV car $ i $ is:

$$ R_{EV,i} = r_1 + r_2 + r_3 + r_4 $$

where:

$$ r_1 =
\begin{cases}
-\left( (0.8 – SOC_i^t) – \frac{a}{t_{c,i} + \epsilon} \right)^2, & \text{if } SOC_i^t < 0.8 \\
0, & \text{if } SOC_i^t \geq 0.8
\end{cases} $$

$$ r_2 = -\omega_c^t \cdot p_i^t \cdot \mathbb{I}(p_i^t > 0 \text{ and } SOC_i^t < 0.8) – \mu \cdot \mathbb{I}(p_i^t < 0 \text{ and } SOC_i^t < 0.8) $$

$$ r_3 = -\xi \cdot d_i – \delta \cdot q_i $$

$$ r_4 = -b \cdot \sum_{j \in L} (I_{ij}^t)^2 \cdot r_{ij} $$

The reward for charging station $ j $ is:

$$ R_{CS,j} = r_4 + r_5 + r_6 $$

where:

$$ r_5 = -\frac{1}{N} \sum_{i=1}^{N} (P_i – \bar{P})^2 $$

$$ r_6 = -\lambda \cdot \left( \max(0, V_j^t – V_{max}) + \max(0, V_{min} – V_j^t) \right) $$

Parameters such as $ a, \mu, \xi, \delta, b, \lambda $ are tuned to balance the objectives. For instance, after experimentation, we set $ a = 2 $, $ \mu = 5 $, $ \xi = 0.1 $, $ \delta = 0.2 $, $ b = 0.5 $, and $ \lambda = 2 $.

Table 1: Reward Parameter Settings
Parameter	Description	Value
$ a $	SOC weight adjustment coefficient	2
$ \mu $	Discharge penalty coefficient	5
$ \xi $	Distance penalty weight	0.1
$ \delta $	Queue time penalty weight	0.2
$ b $	Grid loss penalty weight	0.5
$ \lambda $	Voltage deviation penalty coefficient	2

Multi-Agent Deep Reinforcement Learning with Dynamic Grouping and Value Transformation

To address the scalability and complexity of coordinating numerous battery EV car agents, we propose a Multi-agent Dynamic Grouping Value Transformation algorithm. MADGVT leverages dynamic grouping to reduce decision coupling, a commander-follower structure for efficient communication, and a global Q-value fusion framework for handling inter-group relationships.

Dynamic Grouping and Value Transformation

In each time step, battery EV car agents are dynamically grouped based on their selected charging station. Specifically, all battery EV cars choosing the same charging station form a group $ g_k $, where $ k = 1, 2, \dots, K $. This grouping adapts to real-time changes in battery EV car behavior and charging station availability. Within each group, a commander agent is designated—typically the charging station due to its resource allocation capabilities—while other battery EV car agents act as followers. The commander aggregates local observations from followers to form a group summary:

$$ \xi_k^t = \frac{1}{|g_k|} \sum_{i \in g_k} V_i^t $$

where $ V_i^t $ is a latent representation of agent $ i $’s observation, generated by a message summarizer. This reduces communication overhead and information redundancy.

Commander-Follower Structure

The commander uses the group summary $ \xi_k^t $ to make decisions for the group, such as charging power allocation. The policy update for the commander follows the policy gradient method with a Gaussian distribution for continuous actions:

$$ \nabla_{\phi_k} J(\phi_k) = \mathbb{E} \left[ \nabla_{\phi_k} \log \pi_{\phi_k}(a_{g_k,t} | \xi_k^t) \cdot A_{g_k}(\xi_k^t, a_{g_k,t}) \right] $$

where $ \pi_{\phi_k} $ is the commander’s policy, $ a_{g_k,t} $ is the group action, and $ A_{g_k} $ is the advantage function.

A centralized Critic network within each group evaluates the group’s decision by learning a group Q-value:

$$ Q_{g_k}^{\theta_i}(\xi_k^t, a_{g_k,t}) $$

The Critic parameters are updated by minimizing the temporal difference error:

$$ L_{TD,k} = \left( R_{g_k,t} + \gamma \max_{a’} Q_{g_k}^{\theta_i’}(\xi_k^{t+1}, a’) – Q_{g_k}^{\theta_i}(\xi_k^t, a_{g_k,t}) \right)^2 $$

where $ R_{g_k,t} $ is the group reward, and $ \theta_i’ $ are target network parameters.

Global Q-Value Fusion

To ensure global coordination among groups, we fuse group Q-values into a global Q-value function. Since groups may have competitive or cooperative relationships, we use a non-monotonic mixing network to combine group Q-values:

$$ Q_{tot}^{\omega} = T_{\omega}(Q_{g_1}^{\theta_1}, Q_{g_2}^{\theta_2}, \dots, Q_{g_K}^{\theta_K}) $$

where $ T_{\omega} $ is a nonlinear mixing network with parameters $ \omega $.

Simultaneously, a global Critic learns a true global Q-value $ \bar{Q}_{tot}^{\phi} $ based on the global state and joint actions:

$$ \bar{Q}_{tot}^{\phi}(S^t, a_{g_1,t}, \dots, a_{g_K,t}) $$

To align local and global value estimates, we impose consistency constraints through loss functions. For suboptimal actions, we ensure the global Q-value is not significantly lower than the combined Q-value:

$$ L_{C1} = \max \left( \bar{Q}_{tot}^{\phi}(S^t, a_{g_1,t}, \dots, a_{g_K,t}) – Q_{tot}^{\omega}, 0 \right)^2 $$

For optimal actions, we enforce that both Q-values achieve the same maximum:

$$ L_{C2} = \left( \max_{a} \bar{Q}_{tot}^{\phi}(S^t, a) – \max_{a} Q_{tot}^{\omega} \right)^2 $$

The total loss for training is:

$$ L = L_{TD} + L_{C1} + L_{C2} $$

This framework enables efficient learning while maintaining global optimality for battery EV car cluster scheduling.

Application of MADGVT to Battery EV Car Cluster Charging Scheduling

We implement MADGVT for real-time charging scheduling of battery EV car clusters. The process begins with system initialization, where battery EV cars arrive dynamically based on probability distributions. At each time step, groups are formed or updated based on charging station selections. Commanders compute a recommendation score for their charging station to guide new battery EV car arrivals:

$$ Score_j = \alpha \cdot p_{remain,j} – \beta \cdot T_{queue,j} $$

where $ p_{remain,j} $ is the remaining available power at charging station $ j $, and $ T_{queue,j} $ is the estimated queue time. New battery EV cars use these scores to select charging stations, ensuring balanced load distribution.

After grouping, commanders make charging decisions, and the environment updates battery EV car SOC and grid states. The centralized Critic and global fusion modules then update network parameters through backpropagation. This cycle repeats throughout the scheduling horizon, optimizing for grid stability and user satisfaction.

Experimental Validation and Analysis

We validate our approach using a simulation environment built with Python and PyTorch, integrated with the PandaPower library for power flow analysis. The distribution network is based on the IEEE 33-node system, extended to larger systems for scalability tests. Key configurations include six charging stations, three photovoltaic systems, and 240 battery EV cars for the baseline scenario.

Experimental Setup

The IEEE 33-node system is modified to include charging stations at nodes 8, 15, 21, 25, 28, and 31, each with 20 charging piles and a maximum power of 150 kW. Photovoltaic systems are placed at nodes 11, 18, and 33, with time-varying output as shown in real data. Electricity prices fluctuate throughout the day to reflect market conditions. Battery EV cars are generated with arrival times, initial SOC, and parking durations following the probability distributions described earlier. The time resolution is set to 5 minutes for a one-day scheduling horizon.

We compare MADGVT against several baseline algorithms: QMIX, MADDPG, NashQ, and a maximum power charging strategy. All algorithms are trained for 100,000 episodes with a learning rate of $ 10^{-4} $, discount factor $ \gamma = 0.99 $, and batch size of 256. The experiments are conducted on a system with an Intel Core i7-12700H CPU and NVIDIA GeForce RTX4060 GPU.

Training Process Comparison

The cumulative rewards during training are plotted for each algorithm. MADGVT converges faster and achieves higher rewards compared to QMIX, MADDPG, and NashQ. This indicates that MADGVT effectively learns cooperative strategies for battery EV car charging while managing grid constraints. For instance, after 10,000 episodes, MADGVT stabilizes at a reward around 800, while other methods plateau at lower levels. This demonstrates the advantage of dynamic grouping and global value fusion in complex multi-agent environments.

Grid Performance Analysis

We analyze the impact of different scheduling strategies on grid operations. The daily load curve under MADGVT shows smoother fluctuations compared to the maximum power strategy and MADDPG. Peak loads are reduced by approximately 15% with MADGVT, contributing to grid stability. Voltage profiles at charging station nodes also improve; under MADGVT, voltage deviations stay within 0.998–1.002 per unit, whereas the maximum power strategy causes wider swings. This is critical for preventing equipment damage and ensuring reliable power supply to battery EV car users.

Table 2 summarizes the voltage statistics for key nodes under different strategies. MADGVT maintains voltages closest to the nominal value, with minimal variance.

Table 2: Voltage Performance Comparison at Charging Station Nodes
Algorithm	Average Voltage (pu)	Voltage Variance	Minimum Voltage (pu)
Maximum Power	1.001	$ 2.5 \times 10^{-5} $	0.998
MADDPG	1.000	$ 1.8 \times 10^{-5} $	0.999
MADGVT	1.000	$ 1.2 \times 10^{-5} $	0.999

Furthermore, the active power distribution across charging stations is more balanced with MADGVT. For example, during evening peaks, the power difference between the highest and lowest loaded stations is reduced by 30% compared to the maximum power strategy. This load balancing alleviates congestion and enhances overall grid efficiency.

Battery EV Car User Experience

From the perspective of battery EV car users, we evaluate charging completion and costs. Under MADGVT, the average departure SOC is 94.1%, slightly lower than the maximum power strategy but sufficient for user needs. The charging cost is minimized at 0.857 yuan/kWh, compared to 0.882 yuan/kWh for the maximum power strategy. This cost reduction is achieved by shifting charging to low-price periods and utilizing photovoltaic generation, benefiting both users and the grid.

We also monitor the SOC evolution of 20 sample battery EV cars. MADGVT exhibits diverse charging rates, with some vehicles discharging during high-price periods to earn rewards, but without dropping below the target SOC. This flexibility supports grid services while meeting user requirements for battery EV cars.

Table 3: Battery EV Car Charging Performance
Algorithm	Average Departure SOC (%)	Charging Cost (yuan/kWh)	Cost Savings (%)
Maximum Power	95.9	0.882	—
MADDPG	94.8	0.868	1.6
MADGVT	94.1	0.857	2.8

Robustness and Scalability Analysis

To test robustness, we consider two challenging scenarios: (1) significant reduction and fluctuation in photovoltaic output, and (2) random arrival patterns for battery EV cars instead of peak-based distributions. In both cases, MADGVT maintains voltage within safe limits, with all nodes staying above 0.998 pu. This resilience is attributed to the dynamic grouping mechanism, which adapts to changing conditions by reallocating battery EV car loads.

For scalability, we extend the system to IEEE 141-node and 300-node networks, with up to 960 battery EV cars. The training curves show that MADGVT continues to outperform other algorithms in larger environments, achieving higher rewards and stable convergence. Voltage distributions in the 141-node system under MADGVT have a narrower range compared to other methods, as shown in Table 4.

Table 4: Voltage Distribution in Large-Scale Systems (141-Node)
Algorithm	Voltage Min (pu)	Voltage Max (pu)	Voltage Range (pu)
Maximum Power	0.992	1.008	0.016
QMIX	0.994	1.005	0.011
NashQ	0.995	1.004	0.009
MADDPG	0.996	1.003	0.007
MADGVT	0.997	1.002	0.005

These results confirm that MADGVT scales effectively, making it suitable for real-world deployment in urban networks with high penetration of battery EV cars.

Conclusion

In this paper, we presented a multi-agent deep reinforcement learning approach for charging scheduling of battery EV car clusters in distribution networks with photovoltaic integration. Our method, MADGVT, incorporates dynamic grouping to reduce decision complexity, a commander-follower structure for efficient communication, and a global Q-value fusion framework to handle inter-agent relationships. Experimental results demonstrate that MADGVT significantly improves grid stability by reducing load fluctuations and voltage deviations, while also lowering charging costs for battery EV car users. The algorithm shows robustness under uncertain conditions and scalability to large networks.

Future work will focus on extending the framework to real-time scheduling with dynamic pricing and integrating additional factors such as battery aging and mobile charging for battery EV cars. We also plan to explore federated learning techniques to enhance privacy and reduce communication overhead in multi-station environments. Overall, this research contributes to the development of intelligent charging solutions that support the sustainable growth of battery EV car adoption and grid modernization.