Visual 6D Pose Estimation of Battery Package Locking Mechanism in Intelligent Swapping Stations

Faced with the growing need for efficient energy replenishment for electric vehicles (EVs), battery swapping has emerged as a compelling alternative to conventional charging, offering the significant advantage of completing the process within minutes. The core of a successful automated battery swap lies in the precise alignment and docking between the robotic swapping mechanism and the EV battery pack secured underneath the vehicle chassis. Accurate positioning of the battery’s locking mechanism is paramount for this operation. While mechanical guided alignment exists, vision-based systems provide superior flexibility, compatibility across different vehicle models, and the ability to handle unexpected positional variances. This article details our research into a robust 3D vision-based method for estimating the full six-degree-of-freedom (6D) pose of an EV battery pack locking mechanism.

The development of reliable 3D vision positioning systems is crucial for the scalability of battery swapping infrastructure. Our goal was to create a system capable of functioning reliably in real-world conditions, where factors like sensor noise, dust, and mud adhesion on the EV battery pack can severely degrade the performance of conventional algorithms. The locking mechanism itself is a relatively small component, typically around 40 mm x 40 mm x 30 mm, making its precise localization amidst noise a non-trivial challenge. The image below illustrates the context of an EV battery pack within a vehicle’s undercarriage.

Principles of Locking Mechanism Pose Estimation for EV Battery Pack

The automated battery swapping process involves a robot navigating beneath a stationed vehicle to locate, unlock, and exchange the EV battery pack. The robot’s end-effector, equipped with a locking/unlocking tool, must align perfectly with the locking studs on the battery pack. Visual servoing provides the guiding coordinates. In our system, a high-precision binocular structured-light 3D camera is fixed in an eye-to-hand configuration on the swapping platform, pointing upwards towards the vehicle’s underbody where the EV battery pack is located.

The camera is pre-calibrated to obtain its intrinsic matrix $$K$$, the alignment function between color and depth images, and its extrinsic pose $$T_c$$ relative to the swapping station’s world coordinate frame. During operation, the camera captures an RGB-D image of the scene, which includes parts of the vehicle chassis, the station structure, and the target locking mechanism on the EV battery pack. The depth map is then projected into a 3D point cloud $$P_{scene}$$ using the intrinsic parameters:

$$
\begin{aligned}
x &= \frac{u – c_x}{f_x} z, \\
y &= \frac{v – c_y}{f_y} z, \\
z &= d,
\end{aligned}
$$

where $$(u, v)$$ are pixel coordinates, $$d$$ is the depth value, and $$(x, y, z)$$ are the resulting 3D coordinates of a point.

The core task of our algorithm is to process this scene point cloud to estimate the pose $$T_r$$ of the locking mechanism relative to the camera. The final absolute pose $$T_o$$ of the mechanism in the station’s world frame is then computed as:

$$
T_o = T_c \cdot T_r.
$$

This transformation $$T_o$$ is sent to the robotic controller to guide the precise docking operation. The primary challenge is to estimate $$T_r$$ accurately and robustly despite the presence of sensor noise (approx. ±1–3 mm) and surface contamination on the EV battery pack (approx. ±2–4 mm), which can corrupt the local geometry crucial for matching.

Methodology: A Segmentation-Enhanced 6D Pose Estimation Pipeline

To address the stability and accuracy requirements, we developed a novel 6D pose estimation pipeline that synergistically combines deep learning-based semantic understanding with traditional feature-based point cloud registration. Our method, termed Deep Semantic-Assisted Sample Consensus Initial Alignment (D-SAC-IA), consists of four main stages: 2D object detection and segmentation, point cloud preprocessing, 3D semantic part segmentation, and enhanced feature-based registration.

Stage 1: RGB Image-Based Target Detection and Segmentation

The initial step isolates the region of interest from the cluttered background. We employ a YOLOv5 instance segmentation network, renowned for its strong performance on small objects and fast inference speed. The network takes the RGB image and outputs both a bounding box and a segmentation mask for the locking mechanism. The mask provides a per-pixel probability $$p$$ of belonging to the target. A binary template $$M$$ is created where pixels with $$p > 0.5$$ are considered part of the locking mechanism. This mask is used to filter the corresponding depth pixels, which are then projected to form the initial target point cloud $$P_0$$, containing only points belonging to the EV battery pack‘s locking mechanism.

Stage 2: Point Cloud Preprocessing for the EV Battery Pack Component

The raw point cloud $$P_0$$ is dense and noisy. We apply a voxel grid filter with a leaf size of 2.0 mm to downsample the cloud, replacing all points within a voxel with their centroid. This yields a sparser point cloud $$P_1$$ that retains the structural shape while reducing computational burden. Subsequently, we apply Moving Least Squares (MLS) surface smoothing to mitigate high-frequency noise along the surface normal direction, which is critical for reliable local feature computation. The MLS fits a local polynomial surface within a neighborhood of each point. For a point to be smoothed, a local plane is fitted using weighted least squares, with a weight function $$w(||x – x_i||)$$:

$$
f(x) = \mathbf{p}^T(x) \mathbf{a}(x),
$$

where $$\mathbf{p}(x) = [1, x, y]^T$$ is the basis function and $$\mathbf{a}(x)$$ are the coefficients minimizing:

$$
J = \sum_{i=1}^{m} w(||x – x_i||) ||\mathbf{p}^T(x_i) \mathbf{a}(x) – y_i||^2.
$$

The smoothed point cloud $$P_2$$ exhibits a cleaner, more regularized surface, as shown in the following comparison of processing stages.

Processing Stage	Key Operation	Primary Purpose
Raw Point Cloud (P₀)	Projection from masked depth	Initial data acquisition
After Voxel Filtering (P₁)	Downsampling via voxel centroid	Reduce density, preserve structure
After MLS Smoothing (P₂)	Local polynomial surface fitting	Suppress noise, smooth surface

Stage 3: 3D Semantic Part Segmentation of the Locking Mechanism

Traditional feature descriptors like the Fast Point Feature Histogram (FPFH) rely solely on local geometric properties, making them susceptible to noise and ambiguous matches on symmetric or similar-looking parts of the EV battery pack lock. To inject higher-level, more robust information, we introduce a 3D semantic segmentation step. We designed a deep neural network based on an encoder-decoder architecture (similar to PointNet++) to assign one of four semantic part labels to each point in $$P_2$$: Lock Top (Label 1), Lock Middle (Label 2), Lock Bottom (Label 3), and Lock Support Surface (Label 4).

The network takes a normalized, fixed-size subset of $$P_2$$ (4,096 points) as input. The encoder progressively down-samples the point cloud and aggregates features using PointNet layers, capturing contextual information at multiple scales. The decoder then up-samples the feature maps and propagates the semantic labels back to the original points. The network was pre-trained on a general dataset (ShapeNet) and fine-tuned on a custom synthetic dataset generated from the CAD model of the locking mechanism, augmented with random noise, rotations, and occlusions to simulate real-world conditions. This network achieves a segmentation accuracy of over 95% and an IoU of 90%, providing reliable semantic labels $$L(p) \in \{1,2,3,4\}$$ for each point $$p$$.

Stage 4: Enhanced Feature-Based Point Cloud Registration

With the semantically labeled target point cloud $$P_s$$ (derived from $$P_2$$) and a template point cloud $$Q_m$$ sampled from the pristine CAD model of the locking mechanism, we perform registration to find the transformation $$T_{sm}$$ aligning $$Q_m$$ to $$P_s$$. The key innovation is augmenting the FPFH descriptor with the semantic label.

First, the standard FPFH is computed for both clouds. For a point $$p$$, its Simplified Point Feature Histogram (SPFH) is calculated in its local neighborhood, defined by the angles $$(\alpha, \phi, \theta)$$ between the point’s normal and its neighbors. The FPFH is a weighted sum of the SPFH of $$p$$ and the SPFH of its neighbors:

$$
\text{FPFH}(p) = \text{SPFH}(p) + \frac{1}{k} \sum_{i=1}^{k} \frac{1}{w_i} \cdot \text{SPFH}(p_i),
$$

where $$w_i$$ is the distance to neighbor $$p_i$$. This results in a 33-dimensional histogram descriptor. We then create an enhanced descriptor $$\text{FPFH}^*$$ by concatenating the semantic label encoded as a one-hot vector, scaled by a hyperparameter $$\mu$$:

$$
\text{FPFH}^*(p) = \langle \text{FPFH}(p), \mu \cdot \text{OneHot}(L(p)) \rangle.
$$

This augmented descriptor ensures that point correspondence search during matching considers not only local geometric similarity but also global semantic consistency. Correspondence pairs $$(p_s, q_m)$$ are found by a nearest-neighbor search in the feature space using a KD-tree, followed by reciprocity and tuple tests for filtering.

The initial coarse transformation $$T_{sm}^{coarse}$$ is estimated using a RANSAC (Random Sample Consensus) scheme on these correspondences:

Randomly select 3 correspondence pairs.
Compute a candidate transformation $$T$$ using Singular Value Decomposition (SVD).
Evaluate the transformation using the Huber loss: $$ l = \sum_{i=1}^{n} h(||p_i – T q_i||) $$.
Repeat for a fixed number of iterations (e.g., 100) or until the loss falls below a threshold $$\xi$$.
Select the transformation with the minimal loss as $$T_{sm}^{coarse}$$.

Finally, the Iterative Closest Point (ICP) algorithm is applied, using $$T_{sm}^{coarse}$$ as the initial guess, to refine the alignment and obtain the final precise pose estimate $$T_{sm}$$. The complete pipeline’s effectiveness is validated through the following experiments.

Experimental Results and Analysis

We conducted extensive experiments to validate our proposed D-SAC-IA method. A SURFACE HD50 binocular structured-light camera was used to capture data. For controlled experiments, a 3D-printed model of the EV battery pack locking mechanism was used, placed at varying distances (0.4–0.7 m) and orientations relative to the camera. For real-world validation, tests were performed on an actual vehicle’s EV battery pack.

Comparative Performance Analysis

We compared D-SAC-IA against several classic and state-of-the-art registration algorithms: ICP (Iterative Closest Point), NDT (Normal Distributions Transform), FGR (Fast Global Registration), and the standard SAC-IA (without semantic augmentation). The evaluation metrics were angular error $$\Delta \theta$$, translation error $$\Delta t$$, and point-to-point Root Mean Square Error (RMSE). Ground truth was obtained via meticulous manual registration. The results, averaged over 10 trials, are summarized below.

Algorithm	Avg. Angular Error $$\Delta \theta$$ (°)	Avg. Translation Error $$\Delta t$$ (mm)	Avg. RMSE (mm)
ICP	40.23	2.8	5.5
NDT	45.54	2.5	6.1
FGR	16.59	2.1	2.2
SAC-IA (Baseline)	5.67	1.5	1.8
D-SAC-IA (Ours)	2.86	1.4	1.6

The data clearly shows our method outperforming all others in terms of both angular and translational accuracy on the nominal EV battery pack lock point cloud.

Robustness Test Under Severe Noise

To simulate harsh conditions like heavy mud adhesion on the EV battery pack, we added significant Gaussian noise ($$\sigma = 4$$ mm) and surface morphology noise (±2 mm) to the test point clouds. The performance comparison under this noisy scenario is critical.

Algorithm	Avg. Angular Error $$\Delta \theta$$ (°)	Avg. Translation Error $$\Delta t$$ (mm)	Avg. RMSE (mm)
ICP	49.24	4.6	6.2
NDT	44.85	5.5	5.9
FGR	25.50	4.7	3.1
SAC-IA (Baseline)	10.67	2.5	2.3
D-SAC-IA (Ours)	5.51	1.9	1.8

Under severe noise, the advantage of our semantic-augmented approach becomes even more pronounced. While all algorithms degrade, D-SAC-IA maintains significantly higher accuracy, demonstrating the stabilizing effect of the global semantic features extracted from the EV battery pack lock structure.

Full Pipeline Evaluation on 3D-Printed and Real EV Battery Packs

We evaluated the complete pipeline (including ICP refinement) on the 3D-printed model at 10 distinct poses, with 5 trials per pose. The results confirm the system’s consistency.

Pose Set	Mean Angular Error $$\Delta \theta$$ (°)	Mean Translation Error $$\Delta t$$ (mm)	Mean RMSE (mm)
1-10 (Average)	1.30	1.2	1.3

Finally, the most important validation was performed on a real vehicle’s EV battery pack. Images were captured from five different viewing angles, with five estimations per angle.

View Angle Set	Mean Angular Error $$\Delta \theta$$ (°)	Mean Translation Error $$\Delta t$$ (mm)	Mean RMSE (mm)
1-5 (Real Battery Pack)	1.90	1.4	1.5

Conclusion

This work presented a novel and robust vision-based pipeline for the 6D pose estimation of the locking mechanism on an EV battery pack, a critical enabler for automated battery swapping services. The core contribution is the D-SAC-IA method, which intelligently fuses deep learning-based 3D semantic part segmentation with traditional local feature matching. By augmenting the FPFH descriptor with semantic labels, we effectively mitigate the ambiguities and noise sensitivity inherent in purely geometric matching, leading to more accurate and stable initial alignment. The subsequent ICP refinement further hones the precision. Our comprehensive experiments demonstrate that the proposed method achieves an average pose estimation accuracy of 1.90° in angle and 1.4 mm in translation on real EV battery pack components, with an RMSE of 1.5 mm. This performance not only surpasses conventional registration algorithms but, crucially, meets the stringent precision requirement (typically ≤3 mm alignment error) for reliable robotic docking in battery swapping stations. The method provides a effective and practical solution for the vision-guided positioning challenge in the rapidly evolving domain of electric vehicle energy replenishment.