With the increasing adoption of electric vehicles, the operational complexity of EV charging stations has grown significantly. Traditional fault diagnosis methods for EV charging stations often rely on specialized equipment and extensive labeled data, which are costly and impractical for large-scale deployment. In this work, we propose a novel approach leveraging Vision Transformer (ViT) for fault diagnosis in EV charging stations, using only voltage and current time-series data collected from the stations themselves. By converting these signals into time-series images and employing pre-trained ViT models, our method achieves high diagnostic accuracy while reducing dependency on labeled data. This approach supports online diagnosis and is scalable for real-world applications in EV charging station maintenance.

The rapid expansion of EV charging stations worldwide has highlighted the need for efficient and reliable fault diagnosis systems. Early fault detection in EV charging stations is crucial to prevent permanent failures and ensure uninterrupted service. Existing methods, such as those based on convolutional neural networks (CNNs) or recurrent neural networks (RNNs), often require high-frequency data sampling and manual feature extraction, which are not always feasible for EV charging station operations. Moreover, these methods depend heavily on large annotated datasets, which are expensive to acquire. Our method addresses these limitations by transforming low-frequency voltage and current signals from EV charging stations into time-series images and utilizing the powerful feature extraction capabilities of ViT models. This allows for effective fault diagnosis without the need for specialized hardware or extensive data labeling.
In this paper, we first review the theoretical foundations of Transformer models and Vision Transformers, including the self-attention mechanism that enables global context modeling. We then detail our proposed fault diagnosis model, which consists of a feature extraction network and a fault classification network. The feature extraction network employs a pre-trained ViT to process time-series images, while the classification network maps the extracted features to fault categories. We also describe our training strategy, which involves pre-training on large-scale image datasets to transfer cross-domain knowledge and fine-tuning on a smaller labeled dataset specific to EV charging stations. Experimental results on real-world data from EV charging stations demonstrate that our method achieves an average accuracy of 92.2% in binary fault classification and 88% in multi-class fault diagnosis, outperforming several baseline models. The effectiveness of our approach is further validated through comparisons with CNN, RNN, and other deep learning models, highlighting its superiority in handling the complexities of EV charging station fault diagnosis.
Related Work
Fault diagnosis in EV charging stations has been explored using various machine learning techniques. Early methods relied on traditional signal processing and feature engineering, which often required domain expertise and were not scalable. With the advent of deep learning, models like CNNs and RNNs have been applied to time-series data from EV charging stations. For instance, some studies used CNNs to extract spatial features from vibration signals or infrared images, while others employed RNNs to capture temporal dependencies in sensor data. However, these approaches typically demand high-quality, labeled datasets and may not generalize well to diverse operating conditions of EV charging stations. Recent advancements in Transformer-based models, particularly ViT, have shown promise in computer vision tasks by leveraging self-attention mechanisms to model long-range dependencies. In this work, we adapt ViT for fault diagnosis in EV charging stations, exploiting its ability to process image-like representations of time-series data and reduce the need for extensive labeled examples through pre-training.
Theoretical Foundations
The Transformer architecture, originally developed for natural language processing, has been successfully applied to visual tasks through Vision Transformer (ViT). At the core of Transformer models is the self-attention mechanism, which computes interactions between all elements in an input sequence. For an input sequence represented as a matrix \( X \in \mathbb{R}^{n \times d} \), where \( n \) is the sequence length and \( d \) is the feature dimension, self-attention is calculated as follows:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Here, \( Q \), \( K \), and \( V \) are query, key, and value matrices derived from \( X \) through linear transformations: \( Q = XW_q \), \( K = XW_k \), and \( V = XW_v \), with \( W_q, W_k, W_v \in \mathbb{R}^{d \times d_k} \) being learnable weight matrices. The term \( \sqrt{d_k} \) scales the dot products to prevent vanishing gradients. Multi-head self-attention extends this by performing multiple attention operations in parallel, allowing the model to capture different aspects of the input:
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_o $$
where each head is computed as \( \text{head}_i = \text{Attention}(QW_q^i, KW_k^i, VW_v^i) \), and \( W_o \in \mathbb{R}^{h d_v \times d} \) is the output weight matrix. ViT adapts this mechanism for images by splitting an input image into fixed-size patches, flattening them into sequences, and adding positional encodings to retain spatial information. The positional encoding for a patch at position \( pos \) in the sequence is given by:
$$ PE_{(pos, i)} = \begin{cases} \sin\left(\frac{pos}{10000^{i/d_{\text{model}}}}\right) & \text{if } i \text{ is even} \\ \cos\left(\frac{pos}{10000^{(i-1)/d_{\text{model}}}}\right) & \text{if } i \text{ is odd} \end{cases} $$
where \( i \) ranges from 1 to \( d_{\text{model}} \), the dimension of the patch embeddings. This enables ViT to process images as sequences and apply Transformer encoders for feature extraction, making it suitable for analyzing time-series images from EV charging stations.
Proposed Method
Our fault diagnosis method for EV charging stations involves converting voltage and current time-series signals into grayscale images and using a ViT-based model for classification. The overall architecture comprises a feature extraction network and a fault classification network, as summarized in the following table:
| Component | Description |
|---|---|
| Feature Extraction Network | Embeds time-series images into sequences, applies positional encoding, and processes them through Transformer encoders with multi-head self-attention. |
| Fault Classification Network | Uses a linear layer and softmax activation to map extracted features to fault classes (e.g., normal charging, fault conditions). |
The feature extraction network begins by dividing an input image \( x \in \mathbb{R}^{h \times w \times c} \) into \( N \) patches of size \( p \times p \), where \( N = \frac{h}{p} \times \frac{w}{p} \). Each patch is flattened into a vector \( x_i \in \mathbb{R}^{p^2 c} \) and linearly projected to a embedding dimension \( d \). Positional encodings \( P \in \mathbb{R}^{N \times d} \) are added to the patch embeddings to form the input sequence \( Z_0 = [x_1 E, x_2 E, \ldots, x_N E] + P \), where \( E \in \mathbb{R}^{(p^2 c) \times d} \) is the embedding matrix. This sequence is processed by \( L \) Transformer encoder layers, each consisting of multi-head self-attention (MSA) and a feed-forward network (FFN), with layer normalization (LN) and residual connections:
$$ Z’_l = \text{MSA}(\text{LN}(Z_{l-1})) + Z_{l-1} $$
$$ Z_l = \text{FFN}(\text{LN}(Z’_l)) + Z’_l $$
The FFN is a two-layer perceptron with GELU activation: \( \text{FFN}(Z) = \text{GELU}(ZW_1 + b_1)W_2 + b_2 \), where \( W_1 \in \mathbb{R}^{d \times d_{ff}} \), \( W_2 \in \mathbb{R}^{d_{ff} \times d} \), and \( d_{ff} \) is the hidden dimension. The output of the last encoder layer, corresponding to the [CLS] token, is used as the global image representation for fault classification.
For fault classification, the feature vector \( z_L^0 \) (from the [CLS] token) is passed through a linear layer and softmax to produce class probabilities:
$$ y = \text{softmax}(z_L^0 W_c + b_c) $$
where \( W_c \in \mathbb{R}^{d \times C} \) and \( b_c \in \mathbb{R}^C \) are learnable parameters, and \( C \) is the number of fault classes (e.g., 4 for multi-class diagnosis). The model is trained using cross-entropy loss:
$$ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log(\hat{y}_{i,c}) $$
where \( y_{i,c} \) is the true label and \( \hat{y}_{i,c} \) is the predicted probability for sample \( i \) and class \( c \). To address data scarcity, we pre-train the ViT model on ImageNet-21K using a masked autoencoding objective, where random patches are masked and the model is trained to reconstruct them. This pre-training allows the model to learn general visual features, which are fine-tuned on the smaller EV charging station dataset.
Experimental Setup
We evaluated our method on real-world data collected from multiple EV charging stations over six months. The dataset consists of voltage and current measurements sampled at 10-minute intervals, totaling 73,711 data points. These were segmented into 527 time-series images, each representing a charging cycle. The images were resized to 32×32 pixels to balance detail and computational efficiency. The fault classes include constant current-constant voltage (CC-CV) normal charging, CC-CV charging fault, multi-stage constant current (MSCC) normal charging, and MSCC charging fault. The data was split into training (70%), validation (10%), and test (20%) sets. Pre-training was performed on ImageNet-21K, and the ViT model was fine-tuned with the following hyperparameters, determined through grid search:
| Hyperparameter | Value |
|---|---|
| Input image size | 32×32 |
| Patch size | 4×4 |
| Embedding dimension | 192 |
| Number of heads | 12 |
| Number of layers | 12 |
| Learning rate | 0.0001 |
| Batch size | 32 |
| Epochs | 200 |
We compared our pre-trained ViT model against several baselines: CNN, RNN, CNN-SVM, Attention-LeNet5, and ViT without pre-training. The CNN model used convolutional layers for feature extraction, while the RNN processed raw time-series data with LSTM units. CNN-SVM combined CNN features with a support vector machine classifier, and Attention-LeNet5 integrated attention mechanisms into a LeNet-5 architecture. All models were trained and tested on the same EV charging station dataset to ensure fair comparison.
Results and Discussion
Our pre-trained ViT model achieved an average accuracy of 92.2% in binary fault diagnosis (fault vs. no-fault) and 88% in multi-class fault diagnosis across the four categories. The confusion matrix for multi-class diagnosis is shown below, where S1 to S4 represent CC-CV normal, CC-CV fault, MSCC normal, and MSCC fault, respectively:
| Actual/Predicted | S1 | S2 | S3 | S4 |
|---|---|---|---|---|
| S1 | 0.87 | 0.00 | 0.19 | 0.00 |
| S2 | 0.00 | 0.93 | 0.00 | 0.00 |
| S3 | 0.00 | 0.00 | 0.81 | 0.00 |
| S4 | 0.13 | 0.07 | 0.00 | 1.00 |
The results indicate high recall and precision for most classes, with MSCC fault detection achieving perfect scores. The training curves demonstrated that our pre-trained ViT converged faster and with lower loss compared to other models, as illustrated in the loss and accuracy plots over 200 epochs. The superiority of our method is attributed to the self-attention mechanism in ViT, which captures global dependencies in the time-series images, and the pre-training strategy, which reduces overfitting and enhances feature representation for EV charging station data.
We further analyzed the performance metrics for multi-class diagnosis, as shown in the table below:
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| CNN | 0.82 | 0.75 | 0.78 |
| RNN | 0.85 | 0.72 | 0.78 |
| CNN-SVM | 0.84 | 0.80 | 0.82 |
| Attention-LeNet5 | 0.87 | 0.81 | 0.78 |
| ViT (no pre-training) | 0.83 | 0.71 | 0.76 |
| Pre-trained ViT (Ours) | 0.88 | 0.86 | 0.87 |
Our pre-trained ViT model outperformed all baselines in precision, recall, and F1 score, confirming its effectiveness for fault diagnosis in EV charging stations. The ablation study on pre-training showed that without it, ViT’s performance dropped significantly, highlighting the importance of transfer learning for small datasets. Additionally, we tested the model’s robustness to noise by adding Gaussian noise to the time-series images, and it maintained an accuracy above 85%, demonstrating its suitability for real-world EV charging station environments.
Conclusion
In this paper, we presented a Vision Transformer-based fault diagnosis method for EV charging stations that utilizes voltage and current time-series images. By leveraging pre-training and the self-attention mechanism, our model achieves high accuracy without relying on specialized equipment or extensive labeled data. Experimental results on real-world EV charging station data validate the method’s superiority over existing approaches, with an average accuracy of 92.2% in binary classification and 88% in multi-class diagnosis. Future work will focus on extending this method to other types of EV charging station faults and integrating it with real-time monitoring systems for proactive maintenance. The proposed approach offers a scalable and efficient solution for enhancing the reliability and operational efficiency of EV charging stations.
