WildGS-SLAM

Monocular Gaussian Splatting SLAM in Dynamic Environments

CVPR 2025

Jianhao Zheng^1* Zihan Zhu^2* Valentin Bieri² Marc Pollefeys^2,3 Songyou Peng² Iro Armeni¹

* Equal Contribution

¹Stanford University ²ETH Zürich ³Microsoft

TL;DR: We present WildGS-SLAM, a robust monocular RGB SLAM system that uses uncertainty-aware tracking and mapping to handle dynamic scenes, leveraging DINOv2-based uncertainty maps for dynamic object removal, improving tracking, mapping, and enable high-quality view synthesis.

Abstract

We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping. This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.

Video

Method

WildGS-SLAM takes a sequence of RGB images as input and simultaneously estimates the camera poses while building a 3D Gaussian map of the static scene. Our method is more robust to the dynamic environment due to the uncertainty estimation module, where a pretrained DINOv2 model is first used to extract the image features. An uncertainty MLP then utilizes the extracted features to predict per-pixel uncertainty. During the tracking, we leverage the predicted uncertainty as the weight in the dense bundle adjustment (DBA) layer to mitigate the impact of dynamic distractors. We further use monocular metric depth to facilitate the pose estimation. In the mapping module, the predicted uncertainty is incorporated into the rendering loss to update the 3D Gaussian map. Moreover, the uncertainty loss is computed in parallel to train the uncertainty MLP. Note that both the uncertainty MLP and 3D Gaussian map are optimized independently, as illustrated by the gradient flow in the gray dashed line.

Wild-SLAM Datasets Available on Hugging Face

To further assess performance in unconstrained, real-world settings, we introduce the Wild-SLAM Dataset, comprising two subsets: Wild-SLAM MoCap and Wild-SLAM iPhone.

Wild-SLAM MoCap Dataset

The Wild-SLAM MoCap Dataset includes 10 RGB-D sequences featuring various moving objects as distractors, specifically designed for dynamic SLAM benchmarking. The sequences are recorded with an Intel RealSense D455 camera with a fixed exposure time. Although WildGS-SLAM works with only monocular inputs, depth images are also included in the dataset to support the evaluation of other RGB-D baselines or future research. The room is equipped with an OptiTrack motion capture system, providing ground truth camera trajectories.

Wild-SLAM iPhone Dataset

To further assess performance in more unconstrained, real-world scenarios, we collected the Wild-SLAM iPhone Dataset, which comprises 8 non-staged RGB sequences recorded with an iPhone 14 Pro. These sequences comprise 4 outdoor and 4 indoor scenes, showcasing a variety of daily-life activities such as strolling along streets, shopping, navigating a parking garage, and exploring an art museum. Since ground truth trajectories are not available for this dataset, it is used solely for qualitative experiments.

Results

Wild-SLAM Mocap

Rendering

Wild-SLAM iPhone

Bonn RGB-D Dynamic Dataset (crowd)

TUM RGB-D Dataset (rgbd_dataset_freiburg3_walking_rpy)

BibTeX

@inproceedings{Zheng2025WildGS,
      author={Zheng, Jianhao and Zhu, Zihan and Bieri, Valentin and Pollefeys, Marc and Peng, Songyou and Armeni Iro},
      title     = {WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year      = {2025}
    }

Acknowledgements

We thank Sayan Deb Sarkar, Tao Sun, Ata Celen, Liyuan Zhu, Emily Steiner, Jikai Jin, Yiming Zhao, Matt VanCleave, Tess Ruby Horowitz Buckley, Stanley Wang for their help in Wild-SLAM data collection. We thank Aleesa Pitchamarn Alexander for granting permission to release the data collected from the Spirit House exhibition. We also thank Aleesa Pitchamarn Alexander, Robert M. and Ruth L. Halperin for curating the exhibition, as well as all the participating artists, particularly Dominique Fung, Stephanie H. Shih, and Tammy Nguyen whose art works are prominently captured in the video data.