X-IL: Exploring the Design Space of Imitation Learning Policies

Xiaogang Jia1,2, Atalay Donat1,2, Xi Huang1,2, Xuan Zhao1,2, Denis Blessing1,2, Hongyi Zhou1,2, Hanyi Zhang1,2, Han Wang3, Qian Wang1,2, Rudolf Lioutikov1,2, Gerhard Neumann1
1Autonomous Learning Robots (ALR),Karlsruhe Institute of Technology 2Intuitive Robots Lab (IRL), Karlsruhe Institute of Technology 3Meta Reality Labs, USA

Abstract

Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies.

In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework's modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks.

Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning. Code available at this anonymous link.

Framework Overview

Framework Overview

Overview of X-IL framework. X-IL supports multi-modal inputs (Language, RGB, and Point Cloud) and two architectures: Decoder-Only and Encoder-Decoder. Inside each architecture, the Backbone serves as the core computational unit, offering support for Transformer, Mamba, and xLSTM. For policy representations, X-IL supports Behavior Cloning (BC), Diffusion-based, and Flow-based Policies, enabling diverse learning paradigms for imitation learning. Notably, each component—input modality, architecture, backbone, and policy—can be easily swapped to efficiently explore various model configurations.

Backbones: X-Block

X-Block Architecture

Network details of X-Block. X-Layer is the core part, which is used to process sequence tokens; AdaLn conditioning is used to inject the context information.

Experimental Results

Simulation Benchmark: LIBERO vs RoboCasa

Illustration of LIBERO and RoboCasa. While LIBERO demonstrates minimal variations in the same task, e.g. LIBERO-Spatial, RoboCasa provides diversities in different aspects. CoffeeServeMug is shown in the figure.

Results Table

Results on LIBERO benchmark with 20% and 100% demonstrations, averaged across three seeds. The best overall results are highlighted in bold, with category-specific best results underlined. DEC refers to the Decoder-only architecture.

Evaluation on Visual Inputs

Results for Different Input Types

Results for RoboCasa using different input types with 50 human demonstrations, averaged across three seeds. The best overall results are highlighted in bold, with category-specific best results underlined. DEC refers to the Decoder-only architecture.

Architecture Comparison

Mamba Architecture

(a) Mamba

xLSTM Architecture

(b) xLSTM

Comparison on different architectures. Dec refers to the Decoder-only model, while Enc-Dec refers to the Encoder-Decoder model.

Encoder Comparison

Image Encoders

(a) Image Encoders

PC Encoders

(b) PC Encoders

Comparison of different image encoders and Point Cloud (PC) encoders.