We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping. This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.
WildGS-SLAM takes a sequence of RGB images as input and simultaneously estimates the camera poses while building a 3D Gaussian map of the static scene. Our method is more robust to the dynamic environment due to the uncertainty estimation module, where a pretrained DINOv2 model is first used to extract the image features. An uncertainty MLP then utilizes the extracted features to predict per-pixel uncertainty. During the tracking, we leverage the predicted uncertainty as the weight in the dense bundle adjustment (DBA) layer to mitigate the impact of dynamic distractors. We further use monocular metric depth to facilitate the pose estimation. In the mapping module, the predicted uncertainty is incorporated into the rendering loss to update the 3D Gaussian map. Moreover, the uncertainty loss is computed in parallel to train the uncertainty MLP. Note that both the uncertainty MLP and 3D Gaussian map are optimized independently, as illustrated by the gradient flow in the gray dashed line.
To further assess performance in unconstrained, real-world settings, we introduce the Wild-SLAM Dataset, comprising two subsets: Wild-SLAM MoCap and Wild-SLAM iPhone.
The Wild-SLAM MoCap Dataset includes 10 RGB-D sequences featuring various moving objects as distractors, specifically designed for dynamic SLAM benchmarking. The sequences are recorded with an Intel RealSense D455 camera with a fixed exposure time. Although WildGS-SLAM works with only monocular inputs, depth images are also included in the dataset to support the evaluation of other RGB-D baselines or future research. The room is equipped with an OptiTrack motion capture system, providing ground truth camera trajectories.
To further assess performance in more unconstrained, real-world scenarios, we collected the Wild-SLAM iPhone Dataset, which comprises 8 non-staged RGB sequences recorded with an iPhone 14 Pro. These sequences comprise 4 outdoor and 4 indoor scenes, showcasing a variety of daily-life activities such as strolling along streets, shopping, navigating a parking garage, and exploring an art museum. Since ground truth trajectories are not available for this dataset, it is used solely for qualitative experiments.
BibTeX
@inproceedings{Zheng2025WildGS,
author={Zheng, Jianhao and Zhu, Zihan and Bieri, Valentin and Pollefeys, Marc and Peng, Songyou and Armeni Iro},
title = {WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}