Embodied instruction following in unknown environments

Embodied instruction following (EIF) demands intelligent systems to understand and execute human instructions in physical environments. However, it remains challenging due to the requirements of 1) accurately parse and interpret complex natural language instructions; 2) robustly perceive and understand dynamic unknown environmental contexts; and 3) seamlessly integrate this understanding to plan and execute appropriate actions. This project aims to address these three issues by leveraging recent advancements in 1) multi-modal sensory fusion for enhanced environmental perception; 2) grounding foundational models to realistic scenes; and 3) generative 3D scene representation learning for adaptive and efficient action planning. Our system can complete 204 complex human instructions (e.g. making breakfast, tidying rooms) in large house-level scenes.


Generative models for general robotic manipulation

Developing a variety of robotic manipulation agents in our unstructured 3D world has been a long-standing pursuit for the community, with applications ranging from home assistance to industrial assembly. Drawing on insights from the success of generative models in natural language processing (NLP) and computer vision (CV), this project aims to construct a generative foundation model for a broad spectrum of everyday robotic manipulation tasks, such as opening doors, toggling faucets, and cleaning dishes. To achieve this goal, we will explore various components that contribute to the final model, including how to build a scalable data engine, how to accurately perceive the 3D world, how to plan effectively, how to develop high-capacity policies, and how to bridge the gap between simulation and real-world application, among other things. Our robots can complete diverse manipulation tasks such as closing jars, opening drawers with high generalization ability.


General robotic packing system

Robotic packing plays an important role in many industrial applications such as warehouse management, cargo transportation and robotic assembly, which saves human labor cost with increased throughput and low accident rate. The autonomous packing system usually faces several challenges including: 1) accurately estimating the geometry of cluttered objects that are to be packed; 2) planning packing locations and orientations of irregular objects with high space utilization ratio. We developed interactive object visual perception framework for cluttered objects including recognition, segmentation and shape estimation, and we built reinforcement learning-based packing plan generation pipeline to fully utilize packing boxes. Our robotic packing system is able to pack 12 categories of everyday objects with 86.7% success rate.


Foundation model compression

Deploying power deep neural networks especially large foundation models (GPT-4/Gemini) on robots is usually prohibited due to the strict limits of computational resources. To address this, we propose: 1) fundamental network compression techniques that reduce model complexity without performance degradation; 2) automatic model compression framework that selects the optimal compression policy within hardware resource constraint; 3) hardware-friendly compilation engine to achieve actual speedup and memory savings for robot-based computation platforms. With our framework, we can deploy large vision transformers in STM32F4 (256K memory, 5 USD) for a wide variety of tasks including object detection and instance segmentation.


Real-time online 3D scene perception

3D scene perception methods including 3D semantic/instance segmentation, 3D object detection, 3D representation extraction, are widely used as the foundation of robotic planning and interaction. Although various research have been conducted on this field, it is still very challenging to perceive the 3D scene in an embodied manner: a real-time, online, efficient scene perception is still on the way. This project establishs a general online 3D perception framework, which converts existing offline 3D perception model with online alibility without model and task-specific design. With this framework, we construct a VFM-assisted 3D segment anything model that can online processing streaming RGB-D videos and output real-time 3D reconstruction and fine-grained segmentation results with leading performance on ScanNet200, SceneNN, etc. datasets.