With the rapid growth of electric vehicle adoption in China, optimizing charging and discharging strategies has become critical to mitigate grid stress and support carbon neutrality goals. The randomness in user travel behavior and electricity prices poses significant challenges for real-time decision-making. Traditional methods like robust optimization and stochastic programming often lack adaptability in dynamic environments. In this paper, we propose an improved soft actor-critic (SAC) algorithm, termed TCSAC, which integrates a triplet-critics network and a comprehensive experience replay mechanism to enhance the performance of electric vehicle charging and discharging strategies. This approach addresses estimation biases and low sampling efficiency in deep reinforcement learning, while accurately modeling state of charge (SOC) dynamics and user anxiety. Our contributions include a nonlinear efficiency model, an anxiety-based user behavior framework, and a real-time optimization method that reduces costs and ensures reliability. Simulations demonstrate the effectiveness of TCSAC in handling uncertainties for China EV applications.
The proliferation of electric vehicles in China has led to a surge in demand for intelligent charging solutions. As of 2024, China’s pure electric vehicle fleet exceeded 22 million units, highlighting the urgency for efficient grid integration. However, user behavior uncertainties, such as varying arrival/departure times and initial SOC, complicate real-time scheduling. Existing approaches, including day-ahead optimization, often fail to adapt to real-time fluctuations. Deep reinforcement learning (DRL) offers a promising alternative by learning optimal policies through environmental interactions. Specifically, the soft actor-critic algorithm balances exploration and exploitation but suffers from estimation biases. Our improved SAC algorithm, TCSAC, overcomes these limitations by incorporating triple critics and prioritized experience replay, enabling more accurate and efficient decision-making for electric vehicle charging and discharging.

To model the electric vehicle charging and discharging process accurately, we consider nonlinear efficiency and SOC boundaries. The SOC update accounts for efficiency variations with power, as efficiency is not constant but depends on the charging or discharging rate. Based on empirical data, we fit the efficiency to a cubic polynomial function. For charging, the efficiency $\eta_{\text{ch}}(P)$ is given by:
$$\eta_{\text{ch}}(P) = 0.37 – 0.05P + 0.0027P^2 – 0.0001P^3$$
For discharging, the efficiency $\eta_{\text{dis}}(P)$ is:
$$\eta_{\text{dis}}(P) = 0.6 – 0.1116P – 0.0177P^2 – 0.001P^3$$
These equations capture the nonlinear relationship between power and efficiency, which is crucial for precise SOC updates. The SOC at time $t+1$ is computed as:
$$S_{t+1} = S_t + \frac{P_t \Delta t}{E} \cdot \eta_{\text{cond}}$$
where $\eta_{\text{cond}}$ is $\eta_{\text{ch}}$ for charging and $1/\eta_{\text{dis}}$ for discharging, $E$ is the battery capacity, and $\Delta t$ is the time interval. The SOC must remain within dynamic boundaries $S_t^L$ and $S_t^H$, which are derived from maximum charging and discharging capabilities to ensure feasibility. For instance, the upper bound $S_t^H$ considers the maximum charge rate, while the lower bound $S_t^L$ accounts for discharge limits, adjusted for user demand.
User behavior is modeled to include anxiety effects, which influence charging decisions. Anxiety arises from range and time concerns, leading to irrational behavior if not addressed. We quantify anxiety using an exponential function that defines the desired SOC $S_t^a$ during anxiety periods:
$$S_t^a = s_0 + (1 – s_0) \cdot \left(1 – e^{-k_2 \cdot (t – t_a) / (t_d – t_a)}\right)$$
where $t_a$ is the anxiety start time, $t_d$ is the departure time, $s_0$ is the SOC at $t_a$, and $k_2$ is a parameter indicating anxiety level. Users are categorized into mild ($k_2 \in [-12, -1.5)$), moderate ($k_2 \in [-1.5, 1.5]$), and severe anxiety ($k_2 \in (1.5, 12]$) types. The anxiety cost $C_t^a$ is penalized if the actual SOC falls below $S_t^a$. Additionally, user travel uncertainty is represented using a Markov chain, with transition probabilities between home, work, and other locations based on real data.
The optimization problem is formulated as a Markov decision process (MDP) with state space, action space, transition probabilities, and reward function. The state $s_t$ includes historical electricity prices, anxiety time, departure time, current SOC, user demand SOC, and SOC boundaries:
$$s_t = [\lambda_{t-n+1}, \ldots, \lambda_t, t_a, t_d, S_t, S_t^a, S_d, S_{t+1}^H, S_{t+1}^L]$$
The action $a_t$ is the charging/discharging power, bounded by maximum limits $a_{\text{ch}}^{\text{max}}$ and $a_{\text{dis}}^{\text{max}}$. The reward function combines charging cost, action penalty, and anxiety penalties:
$$r_t = c_p \cdot r_p + c_a \cdot r_a + c_x \cdot r_x + c_d \cdot r_d$$
where $r_p = -\lambda_t a_t^{\text{real}} \Delta t$ is the cost term, $r_a = -(a_t^{\text{real}} – a_t)^2$ penalizes action violations, $r_x = -\max(S_t^a – S_t, 0)$ for anxiety periods, and $r_d = -\max(S_d – S_{t_d}, 0)$ for unmet demand at departure. The coefficients $c_p$, $c_a$, $c_x$, and $c_d$ are weights tuned for different user types.
Our improved SAC algorithm, TCSAC, enhances the standard SAC by addressing Q-value estimation bias and improving sample efficiency. The triplet-critics network uses three Q-networks to compute the target value:
$$y = r + \gamma \left( \chi Q_1′ + (1 – \chi) \min_{i=1,2} Q_i’ – \alpha \log \pi(a’ | s’) \right)$$
where $Q_1’$ is the average of two target Q-values, $Q_i’$ for $i=1,2$ are from twin critics, and $\chi$ is a weighting parameter (set to 0.05 for optimal performance). This reduces both overestimation and underestimation biases. Additionally, the comprehensive experience replay mechanism prioritizes samples based on temporal difference errors:
$$\delta = \frac{1}{3} \sum_{i=1}^3 |Q_{\text{target}} – Q_{\theta_i}| + \epsilon$$
Samples are drawn from the most recent $c_k$ experiences, where $c_k = \min(N, \lfloor \delta \cdot k \rfloor)$ for $k$ updates, improving learning efficiency. The policy network uses a Gaussian distribution with reparameterization for continuous action sampling:
$$a_t = \mu(s_t) + \sigma(s_t) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)$$
The loss functions for the critics and policy are optimized using Adam, with automatic entropy adjustment to maintain exploration.
We evaluate TCSAC using real-time electricity price data from the California ISO for 2022-2023, split into training and testing sets. The electric vehicle has a 50 kWh battery, with hourly consumption of 20% SOC during driving. User anxiety parameters $k_1$ and $k_2$ are sampled from truncated normal and uniform distributions, respectively. The table below summarizes the efficiency values at different power levels, based on empirical measurements:
| Power (kW) | Charging Efficiency | Discharging Efficiency |
|---|---|---|
| 0-1 | 0.15 | 0.65 |
| 1-2 | 0.43 | 0.74 |
| 2-3 | 0.64 | 0.79 |
| 3-4 | 0.74 | 0.82 |
| 4-5 | 0.78 | 0.83 |
| 5-6 | 0.81 | 0.85 |
| 6-7 | 0.83 | 0.86 |
| 7-8 | 0.85 | 0.88 |
| 8-9 | 0.85 | 0.89 |
| 9-10 | 0.88 | 0.91 |
Hyperparameters for TCSAC are listed in the following table:
| Hyperparameter | Value |
|---|---|
| Actor/Critic Hidden Layers | 3 |
| Hidden Layer Neurons | (128, 64, 32) |
| Activation Function | ReLU |
| Optimizer | Adam |
| Learning Rate | 1e-4 |
| Entropy Coefficient $\alpha$ | 0.2 |
| Discount Factor $\gamma$ | 0.99 |
| Target Update $\tau$ | 0.005 |
| Batch Size | 128 |
| Replay Buffer Size | 1e6 |
Simulation results show that TCSAC outperforms SAC and TD3 in cumulative reward and convergence speed. In 240 test days, TCSAC achieved a total cost of $48.15, compared to $54.83 for SAC and $56.39 for TD3, approaching the theoretical optimum of $43.64 from perfect information optimization. The algorithm adapts to different user anxiety types: for mild anxiety ($k_2 = -12$), it delays charging to maximize discharge profits during high prices; for severe anxiety ($k_2 = 12$), it charges early to maintain high SOC. Continuous five-day tests demonstrate reliable power and SOC management across home, work, and other locations, with real-time decisions made in 2 ms per step. This highlights TCSAC’s effectiveness for China EV applications in reducing costs and handling uncertainties.
In conclusion, our improved soft actor-critic algorithm provides a robust real-time charging and discharging strategy for electric vehicles. By modeling nonlinear efficiency, dynamic SOC boundaries, and user anxiety, we achieve accurate SOC updates and cost reductions. The triplet-critics network and comprehensive experience replay mitigate estimation biases and enhance sample efficiency. Future work could extend this approach to large-scale electric vehicle fleets and integrate with renewable energy sources for sustainable grid support. The proposed method is particularly relevant for the growing China EV market, ensuring efficient and reliable operation amid behavioral and price uncertainties.
