Optimizing Pricing for Electric Vehicle car Charging Stations with Carbon Emissions in Mind

The rapid growth in the global electric vehicle car fleet presents a dual challenge: ensuring the economic and secure operation of power grids and charging infrastructure while maximizing the environmental benefits of electrification. Uncoordinated charging, particularly during evening peaks, can exacerbate grid stress, leading to “peak-on-peak” load problems. This necessitates intelligent strategies to guide the charging behavior of electric vehicle car users. In the context of carbon neutrality goals, this article proposes a coordinated pricing strategy for electric vehicle car charging stations that explicitly considers carbon emission reduction. The core innovation lies in formulating this optimization as a Markov Decision Process (MDP) and solving it using an enhanced Deep Reinforcement Learning (DRL) algorithm, specifically a modified Proximal Policy Optimization (PPO) method, to achieve real-time, robust decision-making.

The foundation of any pricing-based guidance strategy is understanding how electric vehicle car users react to price signals. User behavior is not linearly sensitive to price changes. A small price differential may not trigger any response, representing a psychological “dead zone.” Beyond a certain threshold, the willingness to change charging behavior increases with the price incentive until it reaches a saturation point. This characteristic can be modeled using a piecewise function with deadband and saturation zones. Let $\mu$ represent the user’s response willingness, $\Delta\lambda$ be the price difference between a new time-of-use tariff and the original tariff, $\Delta\lambda_1$ and $\Delta\lambda_2$ be the deadband and saturation thresholds, respectively, and $\mu_{\text{max}}$ be the maximum response willingness.

$$
\mu(\Delta\lambda) =
\begin{cases}
0, & \Delta\lambda \leq \Delta\lambda_1 \\
\mu_{\text{max}} \frac{\Delta\lambda – \Delta\lambda_1}{\Delta\lambda_2 – \Delta\lambda_1}, & \Delta\lambda_1 < \Delta\lambda < \Delta\lambda_2 \\
\mu_{\text{max}}, & \Delta\lambda \geq \Delta\lambda_2
\end{cases}
$$

This model effectively captures the non-linear, bounded nature of consumer price elasticity for electric vehicle car charging services.

The Multi-Objective Optimization Model

To holistically address grid stability, economic viability, and environmental impact, we formulate a multi-objective optimization model for charging station pricing. The model coordinates the actions of multiple charging stations within a distribution network to achieve three primary goals:

1. Minimize Carbon Emissions from Electric Vehicle car Charging: The carbon footprint of charging an electric vehicle car depends on the grid’s carbon intensity at the time of charging. By shifting charging loads to periods with lower carbon intensity (e.g., high renewable generation), overall emissions can be reduced.
$$
\min f_1 = \sum_{t=1}^{T} \gamma_t^{\text{CO}_2} \cdot P_t^{\text{EV}} \cdot \Delta t
$$
where $T$ is the total number of time intervals, $\gamma_t^{\text{CO}_2}$ is the grid’s carbon emission factor (kg/kWh) at time $t$, $P_t^{\text{EV}}$ is the aggregate charging load of all electric vehicle cars, and $\Delta t$ is the time step length.

2. Minimize the Peak-Valley Difference of the Distribution Network Load: A key objective is to “flatten” the load curve, alleviating grid congestion and improving stability.
$$
\min f_2 = P_{\text{max}}^{\text{DN}} – P_{\text{min}}^{\text{DN}}
$$
where $P_{\text{max}}^{\text{DN}}$ and $P_{\text{min}}^{\text{DN}}$ are the peak and valley loads of the distribution network, respectively.

3. Maximize Charging Station Revenue: The pricing strategy must ensure the economic sustainability of charging stations. The revenue is the difference between the income from selling electricity to electric vehicle car users and the cost of purchasing electricity from the grid.
$$
\max f_3 = \sum_{t=1}^{T} \sum_{i=1}^{N} (\lambda_{i,t}^{\text{CS}} – \lambda_t^{\text{TOU}}) \cdot P_{i,t}^{\text{CS}} \cdot \Delta t
$$
where $N$ is the number of charging stations, $\lambda_{i,t}^{\text{CS}}$ is the charging price at station $i$ at time $t$, $\lambda_t^{\text{TOU}}$ is the industrial time-of-use electricity purchase price, and $P_{i,t}^{\text{CS}}$ is the charging load at station $i$ at time $t$.

The model is subject to several constraints, including distribution network power flow equations, line capacity and voltage limits, bounds on charging station pricing, and a minimum revenue guarantee to ensure the strategy’s practical feasibility. For example, the charging price is constrained relative to the industrial tariff: $\lambda_t^{\text{TOU}} < \lambda_{i,t}^{\text{CS}} \leq \chi \lambda_t^{\text{TOU}}$, where $\chi$ is a price cap coefficient.

Formulating the Problem as a Markov Decision Process

To enable real-time application and handle the uncertainties inherent in electric vehicle car user behavior and grid conditions, the optimization problem is transformed into a Markov Decision Process (MDP). This framework is ideal for sequential decision-making under uncertainty and is the foundation for DRL algorithms.

1. State ($s_t$): The state represents the agent’s perception of the environment at time $t$. It includes information from both the distribution network and the charging stations:
$$
s_t = \{ P_t^{\text{DN}}, P_{i,t}^{\text{CS}}, \gamma_t^{\text{CO}_2}, \lambda_{j,t-1}^{\text{CS}} \}
$$
This includes the total distribution network load ($P_t^{\text{DN}}$), the load at each charging station ($P_{i,t}^{\text{CS}}$), the current grid carbon intensity ($\gamma_t^{\text{CO}_2}$), and the previous period’s prices at each station ($\lambda_{j,t-1}^{\text{CS}}$).

2. Action ($a_t$): The action is the decision made by the agent, which in this case is setting the real-time charging prices for all stations in the next period.
$$
a_t = \{ \lambda_{j,t}^{\text{CS}} \}
$$

3. Reward ($r_t$): The reward is the immediate feedback from the environment after executing an action. It is designed to guide the agent towards the three objectives. It is a composite signal:
$$
r_t = r_t^{\text{CO}_2} + r_t^{\text{DN}} + r_t^{\text{CS}}
$$
where:
$$
\begin{aligned}
r_t^{\text{CO}_2} &= -\lambda^{\text{CO}_2} \cdot \gamma_t^{\text{CO}_2} \cdot P_t^{\text{EV}} \cdot \Delta t \quad \text{(Carbon cost penalty)} \\
r_t^{\text{DN}} &= \begin{cases} -\lambda^{\text{PD}} (P_{\text{max}}^{\text{DN}} – P_{\text{min}}^{\text{DN}}), & \text{if } t \text{ is the final timestep} \\ 0, & \text{otherwise} \end{cases} \quad \text{(Peak-valley difference penalty)} \\
r_t^{\text{CS}} &= \sum_{i=1}^{N} (\lambda_{i,t}^{\text{CS}} – \lambda_t^{\text{TOU}}) \cdot P_{i,t}^{\text{CS}} \cdot \Delta t \quad \text{(Charging station revenue reward)}
\end{aligned}
$$
Here, $\lambda^{\text{CO}_2}$ and $\lambda^{\text{PD}}$ are weighting coefficients that balance the importance of emission reduction versus load flattening.

4. State Transition ($\psi$): This defines the probability of moving from state $s_t$ to $s_{t+1}$ after taking action $a_t$: $s_{t+1} = \psi(s_t, a_t)$. The DRL agent learns this transition model implicitly through interaction with a simulation environment, without requiring an exact, pre-defined physical model.

The Improved Proximal Policy Optimization Algorithm

We employ a Deep Reinforcement Learning agent based on the Actor-Critic architecture to solve the MDP. The Actor network selects actions (prices), and the Critic network evaluates the quality of those actions. The standard PPO algorithm is known for its training stability, achieved by clipping the policy update to prevent overly large, destabilizing steps. However, it treats all training samples equally, potentially slowing down learning. We propose an enhancement that dynamically adjusts the clipping parameter based on the sample’s quality.

The core update in PPO aims to maximize a surrogate objective $L^{\text{CLIP}}$:
$$
L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min\left( \tau_t(\theta) \hat{A}_t, \text{clip}(\tau_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]
$$
where $\tau_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio between new and old policies, $\hat{A}_t$ is the estimated advantage function, and $\epsilon$ is the clipping hyperparameter that limits the policy change.

Our improvement focuses on $\epsilon$. We use the Temporal Difference Error (TD-error), $\delta_t$, as a measure of sample quality or “surprise.” A large TD-error indicates the Critic’s value estimate was inaccurate, meaning the sample contains novel information that could lead to a significant policy improvement.

$$
\delta_t = r_t + \gamma V_\phi(s_{t+1}) – V_\phi(s_t)
$$
We dynamically set the clipping parameter $\epsilon_t$ for each mini-batch based on the mean TD-error $\bar{\delta}$:
$$
\epsilon_t = \begin{cases}
\epsilon_{\text{max}}, & \text{if } |\delta_t| \geq \bar{\delta} \\
\epsilon_{\text{min}}, & \text{if } |\delta_t| < \bar{\delta}
\end{cases}
$$
When a sample has a high TD-error ($|\delta_t| \geq \bar{\delta}$), we use a larger $\epsilon_{\text{max}}$ to allow a more substantial policy update, accelerating learning. For samples with low TD-error, we use a smaller $\epsilon_{\text{min}}$ to enforce a conservative update, ensuring stability. This adaptive mechanism allows the agent to learn efficiently from informative experiences while cautiously fine-tuning its policy based on predictable ones.

Case Study and Analysis

We evaluate our proposed strategy on a modified IEEE 33-node distribution network with six electric vehicle car charging stations, each equipped with 60 kW fast chargers. Key parameters include a user max response willingness ($\mu_{\text{max}}$) of 30%, a deadband threshold ($\Delta\lambda_1$) of 0.1 CNY/kWh, a saturation threshold ($\Delta\lambda_2$) of 1.5 CNY/kWh, and industrial time-of-use tariffs as a baseline.

Period Time Price (CNY/kWh)
Peak 10:00-15:00, 18:00-21:00 1.025
Flat 07:00-10:00, 15:00-18:00, 21:00-23:00 0.615
Valley 00:00-07:00, 23:00-24:00 0.305

Offline Training Performance: The improved PPO agent was trained over multiple episodes. The reward curve showed an initial phase of volatile but rapid improvement, facilitated by the larger clipping rate on informative samples. After approximately 400 episodes, the curve stabilized and converged, aided by the smaller clipping rate, indicating successful learning of an effective pricing policy.

Strategy Effectiveness: The deployed pricing strategy successfully guided electric vehicle car charging loads. Spatially, it balanced the load across different stations, reducing the standard deviation of station loads by 39.89% compared to the unguided scenario, alleviating congestion. Temporally, it reduced the peak-to-valley difference of the total distribution network load by 21.71%, effectively performing peak shaving and valley filling.

Economic and Environmental Results: The proposed strategy achieved a win-win outcome. Charging station revenue increased by 16.78%. Simultaneously, by shifting charging away from high-carbon-intensity peak periods, the total carbon emission cost associated with electric vehicle car charging was reduced by 16.44%. The following table summarizes the key results:

Metric Before Optimization After Optimization Improvement
Carbon Emission Cost (CNY) 903.24 754.74 -16.44%
Grid Peak-Valley Difference (MW) 4.10 3.21 -21.71%
Charging Station Revenue (CNY) 9,292.71 10,851.92 +16.78%

Algorithm Comparison: We compared our improved PPO algorithm against standard PPO and Deep Deterministic Policy Gradient (DDPG) algorithms. Our method achieved a higher final average reward (approximately 8.18% higher than standard PPO), demonstrating its superior optimization capability. DDPG converged faster initially but to a lower-performance policy. Furthermore, our strategy effectively mitigated voltage violations in the network during the evening peak that were present in the unguided case.

Sensitivity Analysis: An analysis of the clipping hyperparameters $\epsilon_{\text{max}}$ and $\epsilon_{\text{min}}$ confirmed that the chosen values (0.25 and 0.1, respectively) provided an optimal balance. Excessively high values led to unstable, low-reward policies, while excessively low values slowed learning.

Conclusion

This article presents a comprehensive framework for optimizing electric vehicle car charging station pricing with explicit consideration of carbon emissions. By accurately modeling user price response and formulating a multi-objective MDP problem, we enable a system-level optimization. The proposed improved PPO algorithm, with its TD-error-based adaptive clipping mechanism, effectively solves this complex problem, demonstrating superior efficiency and stability compared to standard DRL methods. The case study validates that the strategy can simultaneously enhance the economic benefits for charging station operators, improve grid stability by reducing load fluctuations, and promote environmental sustainability by lowering the carbon footprint of the growing electric vehicle car fleet. Future work will integrate large-scale renewable energy generation and conventional unit dispatch to further explore the synergistic carbon reduction potential between the power system and electric vehicle car charging networks.

Scroll to Top