Behaviour Discovery and Attribution for Explainable Reinforcement Learning

1Mila - Quebec AI Institute 2University of Calgary 3McGill University 4University of Montreal 5CIFAR AI Chair
MY ALT TEXT

A transformer-based VQ-VAE is used for behavior discovery, where state-action sequences are encoded, discretized via a codebook, and decoded to predict future states. The resulting latent codes are used to construct a graph, and the graph clustering module partitions the graph into subgraphs, each representing a "behavior". A causal mask is applied to both the decoder and the encoder to restrict access to future information.

Abstract

Explaining the decisions made by reinforcement learning (RL) agents is critical for building trust and ensuring reliability in real-world applications. Traditional approaches to explainability often rely on saliency analysis, which can be limited in providing actionable insights. Recently, there has been growing interest in attributing RL decisions to specific trajectories within a dataset. However, these methods often generalize explanations to long trajectories, potentially involving multiple distinct behaviors. Often, providing multiple more fine-grained explanations would improve clarity. In this work, we propose a framework for behavior discovery and action attribution to behaviors in offline RL trajectories. Our method identifies meaningful behavioral segments, enabling more precise and granular explanations associated with high-level agent behaviors. This approach is adaptable across diverse environments with minimal modifications, offering a scalable and versatile solution for behavior discovery and attribution for explainable RL.

Contributions

Our main contributions include:
  • A novel framework for behavior discovery and action attribution in offline RL trajectories.
  • A transformer-based VQ-VAE for behavior discovery, which encodes state-action sequences, discretizes them via a codebook, and decodes them to predict future states.
  • A graph clustering module that partitions the graph build using learnt codebook vectors into subgraphs, each representing a "behavior".
  • An attribution module that assigns actions taken by a policy trained on the entire dataset to the discovered behaviors.

BibTeX

@misc{rishav2025behaviourdiscoveryattributionexplainable,
        title={Behaviour Discovery and Attribution for Explainable Reinforcement Learning}, 
        author={Rishav Rishav and Somjit Nath and Vincent Michalski and Samira Ebrahimi Kahou},
        year={2025},
        eprint={2503.14973},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2503.14973}, 
    }