Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies.
In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework's modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks.
Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning. Code available at this anonymous link.
Overview of X-IL framework. X-IL supports multi-modal inputs (Language, RGB, and Point Cloud) and two architectures: Decoder-Only and Encoder-Decoder. Inside each architecture, the Backbone serves as the core computational unit, offering support for Transformer, Mamba, and xLSTM. For policy representations, X-IL supports Behavior Cloning (BC), Diffusion-based, and Flow-based Policies, enabling diverse learning paradigms for imitation learning. Notably, each component—input modality, architecture, backbone, and policy—can be easily swapped to efficiently explore various model configurations.
Network details of X-Block. X-Layer is the core part, which is used to process sequence tokens; AdaLn conditioning is used to inject the context information.
Illustration of LIBERO and RoboCasa. While LIBERO demonstrates minimal variations in the same task, e.g. LIBERO-Spatial, RoboCasa provides diversities in different aspects. CoffeeServeMug is shown in the figure.
Results on LIBERO benchmark with 20% and 100% demonstrations, averaged across three seeds. The best overall results are highlighted in bold, with category-specific best results underlined. DEC refers to the Decoder-only architecture.
Results for RoboCasa using different input types with 50 human demonstrations, averaged across three seeds. The best overall results are highlighted in bold, with category-specific best results underlined. DEC refers to the Decoder-only architecture.
(a) Mamba
(b) xLSTM
Comparison on different architectures. Dec refers to the Decoder-only model, while Enc-Dec refers to the Encoder-Decoder model.
(a) Image Encoders
(b) PC Encoders
Comparison of different image encoders and Point Cloud (PC) encoders.