Advanced EV Load Prediction: Harversing Large Language Models

The rapid advancement of the new energy vehicle industry is profoundly transforming the energy and transportation sectors. In China, the adoption of battery electric vehicles (battery EV cars) has surged, with ownership reaching tens of millions. This widespread electrification of transport is a critical strategy for achieving national carbon peaking and carbon neutrality goals, promising significant reductions in fossil fuel consumption and greenhouse gas emissions. However, the large-scale and often uncoordinated integration of these battery EV cars into the power grid presents formidable challenges to power system stability and operational efficiency. Accurate forecasting of charging and swapping loads is therefore paramount for grid planning, economic dispatch, integration of renewable energy, and ensuring a reliable user experience.

Forecasting the charging load for battery EV cars is inherently complex. User charging behavior exhibits high uncertainty and volatility, influenced by diverse factors such as travel patterns, personal habits, weather conditions, and policy incentives. Furthermore, charging stations serving different purposes—such as residential areas, public transit depots, taxi fleets, and heavy-duty truck hubs—demonstrate vastly different load characteristics. Traditional forecasting methods, including statistical models (e.g., ARIMA) and conventional machine learning techniques (e.g., LSTM, SVM), often struggle with the nonlinearity and randomness inherent in this domain. While recent approaches have incorporated multi-source data fusion and sophisticated neural architectures, they typically require extensive historical data, intricate feature engineering, and lack generalization capability, especially for new stations with scarce data.

Meanwhile, Large Language Models (LLMs) have demonstrated remarkable capabilities in pattern recognition, reasoning, and contextual understanding within natural language processing. Recent explorations have shown their potential in time-series forecasting tasks by “reprogramming” numerical sequences into a textual semantic space that LLMs can comprehend. Building on this premise, this paper introduces a novel forecasting framework for battery EV car charging and swapping loads based on the Time-LLM architecture. The core innovation lies in leveraging the powerful prior knowledge of a pre-trained LLM while keeping its parameters frozen. Only lightweight input and output mapping modules are trained, enabling high-accuracy, low-cost adaptation to specific forecasting tasks. This approach effectively addresses the modality mismatch between time-series data and text, and demonstrates strong performance even in few-shot and zero-shot scenarios.

The primary contributions of this work are threefold. First, we propose and detail a practical LLM-based framework for EV load forecasting, achieving efficient task adaptation by training minimal parameters. Second, we implement a patch reprogramming mechanism combined with a prompt-as-prefix strategy to inject domain knowledge and statistical features, enabling the LLM to effectively reason about temporal patterns. Third, we conduct comprehensive multi-scenario experiments using real-world charging session data from four distinct types of stations in a major Chinese city. The results validate the superiority of our method over strong benchmarks in day-ahead, ultra-short-term, few-shot, and zero-shot forecasting tasks, underscoring its robustness and practical engineering value.

The mathematical formulation of the forecasting problem is as follows. Given a historical multivariate time series $\mathbf{X} \in \mathbb{R}^{N \times T}$, where $N$ is the number of features (e.g., charging power) and $T$ is the sequence length, the goal is to predict the future values $\mathbf{Y} \in \mathbb{R}^{H}$, where $H$ is the forecast horizon. The model $f_\theta$ parameterized by $\theta$ aims to minimize the discrepancy between predictions $\mathbf{\hat{Y}}$ and ground truth $\mathbf{Y}$:

$$ \mathbf{\hat{Y}} = f_\theta(\mathbf{X}), \quad \min_\theta \mathcal{L}(\mathbf{\hat{Y}}, \mathbf{Y}) $$

where $\mathcal{L}$ is a loss function, typically the Mean Squared Error (MSE):

$$ MSE = \frac{1}{H}\sum_{h=1}^{H}(\hat{Y}_h – Y_h)^2 $$

The proposed forecasting framework, illustrated conceptually, integrates several key components: data processing, model training based on Time-LLM, load prediction, and model evaluation. Raw charging transaction data undergoes cleaning, aggregation, normalization, and segmentation. The core Time-LLM model reprograms time-series patches into the LLM’s embedding space, prefixes them with task-specific prompts, and uses the frozen LLM for inference before mapping the output back to numerical predictions.

The technical core of our method, the Time-LLM model, consists of five stages. First, Input Embedding: The normalized input series for each channel $\mathbf{X}^{(i)}$ is divided into $P$ patches $\mathbf{X}_P^{(i)} \in \mathbb{R}^{P \times L_p}$ and projected via a linear layer to embeddings $\mathbf{\hat{X}}_P^{(i)} \in \mathbb{R}^{P \times d_m}$.

Second, Patch Reprogramming: This module aligns time-series patches with the LLM’s textual semantic space. A learnable text prototype library $\mathcal{E}’ = \{\mathbf{e}’_1, …, \mathbf{e}’_{V’}\} \subset \mathcal{E}$ is constructed from the LLM’s full vocabulary $\mathcal{E}$. Using multi-head cross-attention, the patch embeddings $\mathbf{\hat{X}}_P^{(i)}$ are reprogrammed:

$$ \mathbf{Q}^{(i)}_k = \mathbf{\hat{X}}_P^{(i)}\mathbf{W}^Q_k, \quad \mathbf{K}^{(i)}_k = \mathcal{E}’\mathbf{W}^K_k, \quad \mathbf{V}^{(i)}_k = \mathcal{E}’\mathbf{W}^V_k $$
$$ \mathbf{Z}^{(i)}_k = \text{Softmax}\left(\frac{\mathbf{Q}^{(i)}_k {\mathbf{K}^{(i)}_k}^\top}{\sqrt{d}}\right)\mathbf{V}^{(i)}_k $$
$$ \mathbf{O}^{(i)} = \text{Concat}(\mathbf{Z}^{(i)}_1, …, \mathbf{Z}^{(i)}_K) \mathbf{W}_o + \mathbf{b}_o $$

where $\mathbf{O}^{(i)} \in \mathbb{R}^{P \times D}$ is the reprogrammed representation in the LLM’s embedding space.

Third, Prompt-as-Prefix (PaP) Construction: A natural language prompt is prefixed to the input sequence to provide context. A template is used, injecting dataset background, domain knowledge for battery EV car charging, task instruction, and statistical features of the input series (e.g., max, min, mean, periodicity).

Fourth, LLM Reasoning: The combined sequence of prompt tokens and reprogrammed patch embeddings $\mathbf{\hat{X}} = [\text{PaP}; \mathbf{O}^{(1)}; …; \mathbf{O}^{(N)}]$ is fed into the frozen, pre-trained LLM (e.g., Qwen2-7B). The LLM processes this input using its inherent attention mechanisms to generate an output sequence $\mathbf{O}_{out}$:

$$ \mathbf{O}_{out} = \text{LLM}_{\theta_{LLM}}(\mathbf{\hat{X}}), \quad \mathbf{O}_{out} \in \mathbb{R}^{P \times D’} $$

Fifth, Output Mapping: The LLM’s output is truncated to the forecast-relevant part, flattened, and mapped back to the numerical prediction space via a linear layer:

$$ \mathbf{\hat{Y}} = \text{Flatten}(\mathbf{O}_{out}) \mathbf{W}_y + \mathbf{b}_y $$

The model is trained by minimizing the MSE loss between $\mathbf{\hat{Y}}$ and $\mathbf{Y}$, updating only the parameters of the patch embedding, reprogramming, and output mapping modules, while $\theta_{LLM}$ remains frozen.

We evaluated our method using real-world charging session data from six stations in a Chinese city throughout 2024. The stations represent four typical types: residential (Site A), public bus (Sites B, E), taxi (Sites C, F), and heavy-duty truck swapping (Site D). Data was cleaned, aggregated to 15-minute resolution (1-hour for swapping), and normalized. We compared Time-LLM against four strong benchmarks: LSTM, DLinear, TimesNet, and PatchTST. Evaluation metrics included Mean Squared Error (MSE), Mean Absolute Error (MAE), and the Coefficient of Determination (R²).

$$ MAE = \frac{1}{H}\sum_{h=1}^{H}|\hat{Y}_h – Y_h|, \quad R^2 = 1 – \frac{\sum_{h=1}^{H}(\hat{Y}_h – Y_h)^2}{\sum_{h=1}^{H}(Y_h – \bar{Y})^2} $$

Day-Ahead Forecasting: Using 3 days of history (288 points) to predict the next 24 hours (96 points), Time-LLM consistently outperformed all benchmarks across all station types. The results, averaged over multiple runs, are summarized below.

Model	Site A (Residential)			Site B (Bus)			Site C (Taxi)			Site D (Truck Swap)
	MSE	MAE	R²	MSE	MAE	R²	MSE	MAE	R²	MSE	MAE	R²
LSTM	0.8874	0.8102	0.4406	0.1529	0.3014	0.8692	0.3348	0.4665	0.6068	0.7591	0.6916	-0.1452
DLinear	0.3159	0.4683	0.7230	0.1277	0.2396	0.8908	0.2747	0.3903	0.7378	0.6615	0.6854	0.0765
TimesNet	0.3803	0.5081	0.6665	0.1050	0.2116	0.9102	0.3179	0.4336	0.6966	0.4566	0.5505	0.3112
PatchTST	0.5856	0.6421	0.4865	0.1240	0.2504	0.8939	0.2343	0.3820	0.7763	0.5365	0.5753	0.1906
Time-LLM	0.2034	0.3616	0.8216	0.1017	0.1994	0.9131	0.2155	0.3612	0.7943	0.4865	0.5466	0.3628

Ultra-Short-Term Forecasting: A rolling forecast strategy was employed: predict next 4 hours (16 points) using past 24 hours (96 points), roll forward by 4 hours, and repeat to cover 24 hours. Time-LLM again achieved the best performance, particularly in capturing rapid load fluctuations for battery EV cars.

Model	Site A (Residential)			Site B (Bus)			Site C (Taxi)			Site D (Truck Swap)
	MSE	MAE	R²	MSE	MAE	R²	MSE	MAE	R²	MSE	MAE	R²
LSTM	0.1703	0.3122	0.7978	0.1554	0.2711	0.8663	0.1962	0.3632	0.7995	0.9899	0.8113	0.1423
DLinear	0.1702	0.3489	0.7980	0.1830	0.2867	0.8541	0.2949	0.4270	0.7223	0.9727	0.7989	0.0082
TimesNet	0.1281	0.1251	0.8667	0.0803	0.1908	0.9359	0.1895	0.3314	0.8372	1.0387	0.7993	0.1001
PatchTST	0.2520	0.3945	0.1554	0.1241	0.2306	0.9009	0.2432	0.4053	0.7515	1.0210	0.8126	0.1155
Time-LLM	0.1219	0.1110	0.8740	0.0517	0.1564	0.9556	0.1629	0.3180	0.8465	0.7354	0.6206	0.3628

Few-Shot Forecasting: To simulate scenarios with limited historical data (e.g., a new charging station), models were trained using only 20% of the training data. We also compared against GPT4TS, another LLM-based model that requires fine-tuning. Time-LLM demonstrated superior generalization, outperforming all benchmarks on periodic stations (Bus, Taxi) and remaining competitive on others.

Model	Site A (Residential)			Site B (Bus)			Site C (Taxi)			Site D (Truck Swap)
	MSE	MAE	R²	MSE	MAE	R²	MSE	MAE	R²	MSE	MAE	R²
LSTM	1.2724	0.9567	-0.4125	0.5553	0.4894	0.5317	0.6302	0.5571	0.5212	0.9727	0.7989	-0.7944
DLinear	1.3759	0.9994	-0.6373	0.4246	0.4564	0.6649	0.4734	0.5955	0.4732	0.9381	0.7908	-0.5577
TimesNet	1.4421	1.0201	-0.2964	0.3858	0.4117	0.6955	0.3787	0.4854	0.5844	0.8909	0.7633	-0.2297
PatchTST	1.0614	0.9409	-0.4177	0.4687	0.4336	0.6048	0.5786	0.6106	0.5837	1.0298	0.7633	-0.3641
GPT4TS	0.6938	0.6360	0.3602	0.2829	0.3265	0.8420	0.3833	0.5142	0.7004	0.6151	0.5475	0.1906
Time-LLM	0.7523	0.6454	0.4029	0.2598	0.3048	0.8667	0.3531	0.4570	0.7138	0.6920	0.6783	0.0923

Zero-Shot Forecasting: To test generalization, models trained on one station were directly applied to forecast load at a new, unseen station of the same type (Bus: B→E; Taxi: C→F). Time-LLM’s prompt-as-prefix mechanism, which encodes transferable domain knowledge, enabled it to significantly outperform all other models, including GPT4TS.

Model	Site E (Bus, from B)			Site F (Taxi, from C)
	MSE	MAE	R²	MSE	MAE	R²
LSTM	0.4954	0.5067	0.6264	0.6665	0.6464	0.4614
DLinear	0.3965	0.5001	0.6683	0.4820	0.5898	0.6612
TimesNet	0.5447	0.5024	0.5443	0.4625	0.5235	0.6080
PatchTST	0.3550	0.4730	0.7030	0.5087	0.5840	0.5652
GPT4TS	0.2202	0.3942	0.8158	0.3689	0.4925	0.6873
Time-LLM	0.1554	0.2497	0.8700	0.2963	0.4343	0.7618

An ablation study on the Prompt-as-Prefix component for Site B forecasting confirmed the importance of each element. Removing any part (background, domain knowledge, statistics) degraded performance, and removing the entire prompt caused the most significant drop, highlighting its crucial role in guiding the LLM.

In conclusion, this paper presents a novel and effective framework for predicting the charging and swapping load of battery EV cars by harnessing the power of large language models. The proposed Time-LLM-based method, through patch reprogramming and knowledge-informed prompting, successfully adapts a frozen LLM to the time-series forecasting domain with minimal parameter training. Comprehensive experiments on real-world data from diverse station types demonstrate its superior accuracy and robustness in day-ahead, ultra-short-term, few-shot, and zero-shot forecasting tasks compared to state-of-the-art benchmarks. The method’s strong generalization capability makes it particularly valuable for practical engineering applications, such as forecasting for new charging infrastructure. Future work will focus on automating prompt engineering, incorporating multi-modal data (e.g., weather, traffic), and extending the framework to other power system forecasting problems like multi-regional load and renewable generation prediction.