Advancing Battery Health Prediction for Electric Vehicles

The accurate prognostics of the State of Health (SOH) for lithium-ion batteries stands as a critical challenge in ensuring the safety, reliability, and longevity of battery electric vehicles (BEVs). As the primary energy source for modern electric vehicles, the battery pack undergoes complex degradation during its lifecycle. Traditional data-driven approaches often falter under practical constraints, particularly the prevalent “small-sample” problem stemming from costly and time-consuming testing cycles. This work presents a novel, integrated framework designed to achieve high-accuracy SOH prediction for battery electric vehicles using limited real-world operational data.

The core of our methodology is a hybrid model that synergizes signal processing, feature engineering, and advanced ensemble learning. We leverage real-world charging data acquired from monitoring platforms of commercial battery electric vehicles. The proposed strategy systematically addresses data scarcity by enhancing the informational depth of available samples rather than merely increasing their quantity. This approach aligns with the philosophy of data quantity governance, which emphasizes balancing model complexity with the available data to achieve robust performance.

Methodological Framework

The overall framework comprises three main stages: 1) Data Preprocessing and SOH Label Generation, 2) Multi-Source Feature Engineering, and 3) Ensemble Model Construction. A flowchart of this process is provided below, illustrating the sequential integration of each component to transform raw operational data into a reliable SOH estimate.

1. Data Preprocessing and SOH Estimation

Real-world data from battery electric vehicles is inherently noisy and incomplete. Our initial step involves rigorous preprocessing: filtering abnormal values caused by sensor faults, imputing minor missing data segments, and deleting fragments with substantial gaps. Crucially, we focus on parking/charging segments due to their stable and reproducible conditions, which are more conducive to consistent feature extraction compared to highly dynamic driving data.

The SOH is defined based on capacity fade, the most direct indicator of battery aging. The present maximum available capacity $C_P$ is compared to the battery’s rated capacity $C_R$:

$$ \text{SOH}_{\text{Cap}} = \frac{C_P}{C_R} \times 100\% $$

For a selected constant-current charging segment, $C_P$ is calculated via Ampere-hour integration over a defined State of Charge (SOC) or voltage window. However, direct calculation from real-world data often yields unrealistic fluctuations. Therefore, we apply a correction method. Within clustered data segments identified by similar charging current and duration, the maximum charged capacity value $C_{\text{max}}$ is selected. The corrected SOH is then computed as:

$$ \text{SOH} = \frac{C_{\text{max}}}{C_R} \times 100\% $$

This correction smooths the SOH trajectory, yielding a monotonic degradation trend more representative of actual battery health for the battery electric vehicle.

2. Feature Engineering and Expansion

2.1. Aging Feature Extraction via Incremental Capacity Analysis (ICA)

To capture the electrochemical signatures of aging, we employ Incremental Capacity Analysis (ICA). The Incremental Capacity (IC) curve is derived by differentiating the capacity (Q) with respect to voltage (V):

$$ IC = \frac{dQ}{dV} \approx \frac{\Delta Q}{\Delta V} $$

where $\Delta Q$ is the charged capacity within a small voltage window $\Delta V$. Raw IC curves are noisy; hence, we apply wavelet transform for effective smoothing while preserving critical peak information. From the smoothed IC curve, several health indicators (HIs) are extracted, such as the peak voltage, peak height, and valley positions of characteristic IC peaks, which are linked to specific phase transitions in electrode materials.

2.2. Feature Selection via Grey Relational Analysis (GRA)

Not all extracted features are equally informative. We use Grey Relational Analysis (GRA) to quantify the correlation between each potential HI and the target SOH sequence. The reference sequence is $Y = \{y(k) | k=1,2,…,n\}$ (SOH), and a comparative sequence is $X_i = \{x_i(k) | k=1,2,…,n\}$ (an HI). The grey relational coefficient $\gamma_i(k)$ at point $k$ is calculated as:

$$ \gamma_i(k) = \frac{\min_i \min_k |y(k)-x_i(k)| + \rho \max_i \max_k |y(k)-x_i(k)|}{|y(k)-x_i(k)| + \rho \max_i \max_k |y(k)-x_i(k)|} $$

where $\rho$ is the distinguishing coefficient, typically set to 0.5. The overall grey relational grade $r_i$ for feature $i$ is the average of all coefficients:

$$ r_i = \frac{1}{n} \sum_{k=1}^{n} \gamma_i(k) $$

Features with $r_i > 0.6$ are considered strongly correlated and are selected for model input. This step reduces dimensionality and focuses the model on the most relevant signals from the battery electric vehicle’s operational data.

2.3. Feature Space Expansion via CEEMDAN

To combat the small-sample limitation, we enrich the input feature space by decomposing the SOH sequence itself. The Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) adaptively breaks down the non-linear and non-stationary SOH time series into a collection of Intrinsic Mode Functions (IMFs) and a residue. This process reveals multi-scale temporal patterns (e.g., short-term fluctuations, medium-term trends, and long-term degradation) inherent in the battery aging process. Let the original SOH sequence be $D(n)$. By adaptively adding white noise and performing ensemble empirical mode decomposition, CEEMDAN obtains:

$$ D(n) = \sum_{k=1}^{L} \text{IMF}_k(n) + R(n) $$

where $\text{IMF}_k(n)$ are the decomposed modes and $R(n)$ is the final residue. These IMFs, which represent different frequency components of the SOH decay, are then used as additional input features alongside the physically-derived ICA features. This fusion creates a comprehensive feature set describing both the symptom (ICA peaks) and the manifested degradation pattern (SOH-IMFs) of the battery in the electric vehicle.

3. The CEEMDAN-LSBoostELM Prediction Model

3.1. Base Learner: Extreme Learning Machine (ELM)

The Extreme Learning Machine serves as our base regressor (weak learner). Its appeal lies in its remarkable speed and suitability for small-sample learning. For $N$ arbitrary samples $(\mathbf{x}_i, \mathbf{y}_i)$, where $\mathbf{x}_i \in \mathbb{R}^n$ and $\mathbf{y}_i \in \mathbb{R}^m$, an ELM with $L$ hidden nodes and activation function $g(\cdot)$ models the output as:

$$ \mathbf{Y} = \mathbf{H}\boldsymbol{\beta} $$

where $\mathbf{H}$ is the hidden layer output matrix:

$$
\mathbf{H} =
\begin{bmatrix}
g(\mathbf{w}_1 \cdot \mathbf{x}_1 + b_1) & \cdots & g(\mathbf{w}_L \cdot \mathbf{x}_1 + b_L)\\
\vdots & \ddots & \vdots\\
g(\mathbf{w}_1 \cdot \mathbf{x}_N + b_1) & \cdots & g(\mathbf{w}_L \cdot \mathbf{x}_N + b_L)
\end{bmatrix}
$$

Here, $\mathbf{w}_i$ and $b_i$ are randomly assigned input weights and biases. The output weight matrix $\boldsymbol{\beta}$ is analytically determined by finding the least-squares solution:

$$ \hat{\boldsymbol{\beta}} = \mathbf{H}^\dagger \mathbf{Y} $$

with $\mathbf{H}^\dagger$ being the Moore-Penrose generalized inverse of $\mathbf{H}$. While fast, a single ELM’s performance can be unstable due to the randomness in its initial parameters.

3.2. Ensemble Strategy: Least Squares Boosting (LSBoost)

To enhance robustness and accuracy, we integrate multiple ELM weak learners using the LSBoost algorithm. LSBoost is a boosting method designed for regression that sequentially adds new weak learners to correct the residual errors of the current ensemble. The process is as follows:

Initialize the model with the mean of the target values: $F_0(\mathbf{x}) = \bar{y}$.
For $m = 1$ to $M$ (number of weak learners):
- Compute the pseudo-residuals for each sample $i$: $r_{im} = y_i – F_{m-1}(\mathbf{x}_i)$.
- Train a weak learner (an ELM model) $h_m(\mathbf{x})$ on the data $\{ \mathbf{x}_i, r_{im} \}$ to fit these residuals.
- Update the ensemble model: $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \nu \cdot h_m(\mathbf{x})$.

Here, $\nu$ is a shrinkage parameter (learning rate) that controls the contribution of each new learner, preventing overfitting. The final strong predictor after $M$ rounds is $F(\mathbf{x}) = F_M(\mathbf{x})$. This iterative focusing on errors allows the CEEMDAN-LSBoostELM model to build a highly accurate and stable composite model, ideal for the nuanced prediction task of SOH in battery electric vehicles.

Model Development and Parameterization

Developing a high-performance model requires careful tuning of its hyperparameters. We systematically determined the optimal configuration for our CEEMDAN-LSBoostELM model using real-world datasets from multiple battery electric vehicles.

1. Determining ELM Architecture

The number of hidden neurons and the activation function are key to ELM’s performance. We evaluated configurations across different vehicles. The table below summarizes the optimal choices based on comprehensive metrics (RMSE, MAE, R², MAPE).

Parameter	Tested Options	Optimal Choice & Rationale
Hidden Neurons	15, 20, 25, 30, 35	25. Provided the best balance between underfitting and overfitting across all vehicle datasets, offering stable and superior prediction accuracy.
Activation Function	Sigmoid, Tanh, ReLU, Sin	ReLU. Consistently outperformed others, yielding the lowest error metrics (RMSE, MAE, MAPE) and the highest R² value, indicating superior non-linear fitting capability for the SOH regression problem.

2. Tuning the LSBoost Ensemble

The performance of the boosting ensemble depends on the number of weak learners (ELMs) and the learning rate.

Parameter	Tested Options	Optimal Choice & Rationale
Number of Weak Learners (M)	5, 10, 15, 20, 25	15. Provided the point of diminishing returns. Fewer learners led to underfitting, while more introduced unnecessary complexity without significant accuracy gain, and even increased error for some vehicles.
Learning Rate (ν)	0.02, 0.04, 0.06, 0.08, 0.10	0.06. This value ensured stable and effective convergence. A lower rate slowed learning, while a higher rate risked causing volatile updates that harmed final model performance.

3. Data Partitioning Strategy

To evaluate generalization under small-sample conditions, we tested various training-test set splits. The results, aggregated over several vehicles, are shown below.

Train : Test Ratio	Average RMSE (%)	Average MAE (%)	Average R²	Average MAPE	Assessment
60 : 40	0.473	0.349	0.938	0.384	Good generalization but slightly higher error.
70 : 30	0.430	0.293	0.950	0.328	Balanced and robust performance.
80 : 20	0.351	0.221	0.960	0.254	Optimal. Provided the best accuracy and stability across all vehicles.
90 : 10	0.453	0.294	0.943	0.369	Potential overfitting due to very small test set.

Based on these results, an 80:20 data partitioning strategy was adopted for final model training and evaluation, as it offered the optimal trade-off between sufficient training data and a credible test for generalization in the context of battery electric vehicle fleet management.

Performance Evaluation and Comparative Analysis

The proposed CEEMDAN-LSBoostELM model was rigorously evaluated against several benchmark models to validate its effectiveness for SOH prediction in battery electric vehicles. The baselines included:

ELM: The single Extreme Learning Machine model.
CEEMDAN-ELM: An ELM model using the combined ICA and CEEMDAN features, but without boosting.
Random Forest (RF): A traditional ensemble method.
Multi-layer Perceptron (MLP): A standard deep learning baseline.

The evaluation was conducted on real-world datasets from multiple battery electric vehicles not used during model development. Performance was measured using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Coefficient of Determination (R²), and Mean Absolute Percentage Error (MAPE):

$$ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2}, \quad \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i – \hat{y}_i| $$
$$ R^2 = 1 – \frac{\sum_{i=1}^{n}(y_i – \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i – \bar{y})^2}, \quad \text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n}\left| \frac{y_i – \hat{y}_i}{y_i} \right| $$

The consolidated results demonstrate the clear superiority of the proposed integrated approach.

Model	Avg. RMSE (%)	Avg. MAE (%)	Avg. R²	Avg. MAPE
CEEMDAN-LSBoostELM (Proposed)	0.416	0.320	0.973	0.358
CEEMDAN-ELM	0.968	0.807	0.855	0.898
Single ELM	1.404	1.121	0.685	1.246
Random Forest (RF)	1.653	1.344	0.572	1.485
Multi-layer Perceptron (MLP)	2.086	1.764	0.324	1.962

Key Findings:

Feature Enhancement is Crucial: CEEMDAN-ELM significantly outperformed the single ELM (e.g., 0.968% vs. 1.404% RMSE), proving the value of ICA feature selection and CEEMDAN-based feature space expansion for characterizing battery electric vehicle battery degradation.
Boosting Delivers Stability and Precision: Our full CEEMDAN-LSBoostELM model reduced the RMSE of CEEMDAN-ELM by over 57% (from 0.968% to 0.416%). This dramatic improvement highlights LSBoost’s efficacy in stabilizing the ELM’s random initialization and sequentially refining prediction accuracy.
Superiority Over Traditional Models: The proposed model markedly outperformed well-established algorithms like RF and MLP. It achieved at least a 1.2 percentage point reduction in RMSE and a 0.935 reduction in MAPE, confirming its higher accuracy and robustness for the small-sample SOH estimation task in battery electric vehicles.
Practical Reliability: With an average MAPE below 0.36% and a maximum observed prediction error under 1.2% across all test vehicles, the model demonstrates reliability suitable for real-world diagnostic applications in battery electric vehicle management systems.

Conclusion and Future Perspectives

This work presents a comprehensive and effective solution for predicting the State of Health of lithium-ion batteries in battery electric vehicles under the practical constraint of limited data. The CEEMDAN-LSBoostELM framework successfully integrates physical insight from Incremental Capacity Analysis, signal decomposition for feature enrichment, and a robust ensemble learning strategy. By transforming the small-sample challenge into a problem of feature quality and model structure optimization, the method achieves exceptional prediction accuracy and stability, validated on real-world operational data from multiple electric vehicles.

The success of this approach underscores the importance of a holistic data strategy for battery electric vehicle prognostics. It is not merely about having more data, but about intelligently extracting more information from available data through multi-domain feature fusion and constructing models that are both powerful and resistant to overfitting. The LSBoost-enhanced ELM ensemble proves particularly effective in this context, offering a fast, accurate, and reliable predictor.

Future work will focus on enhancing the model’s generality and scope. This includes extending the feature set to incorporate parameters from fast-charging and dynamic discharge cycles, integrating other aging indicators like internal resistance and energy efficiency, and developing adaptive learning mechanisms that can update the model as new data streams in from the battery electric vehicle fleet. Ultimately, such advancements will contribute to more intelligent Battery Management Systems (BMS), enabling proactive health management, optimized usage strategies, and extended service life for the critical energy storage systems in electric mobility.