![]() |
I am currently a Senior Researcher at ByteDance. My work mission is to capture and understand human-centric scenes in the real world, and digitalize humans, objects, and events for immersive applications in VR/AR. Prior to this, I joined Tencent as a Senior Researcher through the Special Recruitment Talents Program (技术大咖 ). Before entering the industry, I graduated from the Department of Automation at Tsinghua University, where I had the honor of being supervised by Qionghai Dai and Lu Fang, and collaborated closely with Yebin Liu and Lan Xu. |
![]() |
SEGA: Drivable 3D Gaussian Head Avatar from a Single Image
We propose SEGA, a novel approach for Single-imagE-based 3D drivable Gaussian head Avatar creation that combines generalized prior models with a new hierarchical UV-space Gaussian Splatting framework. |
![]() |
EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling
We propose EnvPoser, a two-stage method using sparse tracking signals and pre-scanned environment from VR devices to perform full-body motion estimation and handle the multi-hypothesis nature with uncertainty-aware and environmental constraint integration. |
![]() |
RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance
We present RePerformer, a Gaussian-based representation for high-fidelity volumetric video playback and re-performance. Via Morton-based parameterization, our method enables efficient rendering. A semantic-aware alignment module and deformation transfer enhance realistic motion re-performance. |
![]() |
HeadGAP: Few-shot 3D Head Avatar via Generalizable Gaussian Priors
Xiaozheng Zheng, Chao Wen, Zhaohu Li, Weiyi Zhang, Zhuo Su, Xu Chang, Yang Zhao, Zheng Lv, Xiaoyuan Zhang, Yongjie Zhang, Guidong Wang, Lan Xu 3DV 2025 We propose a 3D head avatar creation method that generalizes from few-shot in-the-wild data. By using 3D head priors from a large-scale dataset and a Gaussian Splatting-based network, our approach achieves high-fidelity rendering and robust animation. |
![]() |
HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors
Panwang Pan*, Zhuo Su* † (Project Lead), Chenguo Lin*, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, Yebin Liu‡ NeurIPS 2024 We propose HumanSplat, a method that predicts the 3D Gaussian Splatting properties of a human from a single input image in a generalizable way. It utilizes a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors to effectively integrate geometric priors and semantic features. |
![]() |
SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models
We introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. |
![]() |
HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, Lan Xu CVPR 2024. We present an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage, in which our core intuition is to marry the 3D Gaussian representation with non-rigid tracking. |
![]() |
OHTA: One-shot Hand Avatar via Data-driven Implicit Priors
Xiaozheng Zheng, Chao Wen, Zhuo Su, Zeran Xu, Zhaohu Li, Yang Zhao, Zhou Xue CVPR 2024 OHTA is a novel approach capable of creating implicit animatable hand avatars using just a single image. It facilitates 1) text-to-avatar conversion, 2) hand texture and geometry editing, and 3) interpolation and sampling within the latent space. |
![]() |
Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling
Xiaozheng Zheng*, Zhuo Su*, Chao Wen, Zhou Xue, Xiaojie Jin ICCV 2023 We propose a two-stage framework that can obtain accurate and smooth full-body motions with the three tracking signals of head and hands only, in which we first explicitly model the joint-level features and then utilize them as spatiotemporal transformer tokens to capture joint-level correlations. |
![]() |
Instant-NVR: Instant Neural Volumetric Rendering for Human-object Interactions from Monocular RGBD Stream
Yuheng Jiang, Kaixin Yao, Zhuo Su, Zhehao Shen, Haimin Luo, Lan Xu CVPR 2023 We propose a neural approach for instant volumetric human-object tracking and rendering using a single RGBD camera. It bridges traditional non-rigid tracking with recent instant radiance field techniques via a multi-thread tracking-rendering mechanism. |
![]() |
Robust Volumetric Performance Reconstruction under Human-object Interactions from Monocular RGBD Stream
Zhuo Su, Lan Xu, Dawei Zhong, Zhong Li, Fan Deng, Shuxue Quan, Lu Fang TPAMI 2022 We propose a robust volumetric performance reconstruction system for human-object interaction scenarios using only a single RGBD sensor, which combines various data-driven visual and interaction cues to handle the complex interaction patterns and severe occlusions. |
![]() |
NeuralHOFusion: Neural Volumetric Rendering Under Human-Object Interactions
Yuheng Jiang, Suyi Jiang, Guoxing Sun, Zhuo Su, Kaiwen Guo, Minye Wu, Jingyi Yu, Lan Xu CVPR 2022 We propose a robust neural volumetric rendering method for human-object interaction scenarios using 6 RGBD cameras, achieving layer-wise and photorealistic reconstruction results of human performance in novel views. |
![]() |
Learning Variational Motion Prior for Video-based Motion Capture
Xin Chen*, Zhuo Su*, Lingbo Yang*, Pei Cheng, Lan Xu, Gang Yu arXiv 2022 We propose a novel variational motion prior (VMP) learning approach for video-based motion capture. Specifically, VMP is implemented as a transformer-based variational autoencoder pretrained over large-scale 3D motion data, providing an expressive latent space for human motion at sequence level. |
![]() |
RobustFusion: Human Volumetric Capture with Data-driven Visual Cues using a RGBD Camera
Zhuo Su, Lan Xu, Zerong Zheng, Tao Yu, Yebin Liu, Lu Fang ECCV 2020 (Spotlight) We introduce a robust human volumetric capture approach combined with various data-driven visual cues using a Kinect, which outperforms existing state-of-the-art approaches significantly. |
![]() |
UnstructuredFusion: Realtime 4D Geometry and Texture Reconstruction using Commercial RGBD Cameras
Lan Xu, Zhuo Su, Lei Han, Tao Yu, Yebin Liu, Lu Fang TPAMI 2019 We propose UnstructuredFusion, which allows realtime, high-quality, complete reconstruction of 4D textured models of human performance via only three commercial RGBD cameras. |
Human Motion Capture and Avatar Creation for XR Scenarios using Sparse Observations
The development of high-fidelity 3D human motion capture and avatar reconstruction is essential for immersive experiences in XR applications. This line of work explores modeling under sparse observation settings, tackling the challenges posed by limited sensors and minimal image inputs. The research spans from multi-modal motion understanding to generalizable avatar generation, and I share this exploration with the hope of contributing to the XR technologies.
Apr 11, 2025, China3DV, Beijing | Young Scholar Forum
Human 3D Reconstruction and Generation
The construction of realistic 3D human avatars is crucial in VR/AR applications. This talk focuses on 3D human modeling, covering topics from traditional volumetric capture to neural rendering, from per-scene optimization to generalizable prior model training and generative methods. I shared my exploration in this field, hoping to inspire related research.
Dec 26, 2024, ByteDance, Online Live Stream | ByteTech Technical Sharing Seminar