Read to Play (R2-Play):
Decision Transformer with
Multimodal Game Instruction

Yonggang Jin2*, Ge Zhang145*^, Hao Zhao2*, Tianyu Zheng2, Jiawei Guo2,
Liuyu Xiang2, Shawn Yue6, Stephen W. Huang6, Wenhu Chen1,4,5, Zhaofeng He2^, Jie Fu1,3^
1Multimodal Art Projection Research Community;
2Beijing University of Posts and Telecommunications;
3HKUST; 4University of Waterloo; 5Vector Institute; 6Harmony.AI;

*Indicates Equal Contribution

^Indicates Corresponding Authors

Abstract

Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a ``read-to-play'' capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.

Introduction

Creating a generalist agent that can accomplish diverse tasks is an enduring goal in artificial intelligence. The recent advancement in integrating textual guidance or visual trajectory into a single decision-making agent presents a potential solution. This line of research provides task-specific context to guide the agent. Although textual guidance and visual trajectory each offer advantages, they also have distinct limitations: (1) Textual guidance lacks visually-grounded information, which diminishes its expressiveness for decision-making tasks based on visual observations; (2) Without clear task instructions, deriving an effective strategy from a visual trajectory is extremely difficult, which is similar to people's difficulty understanding player intentions when watching game videos without explanations. The complementary relationship between textual guidance and visual trajectory suggests their combination enhances guidance effectiveness, as illustrated in Figure 1. As a result, this paper aims to develop an agent capable of adapting to new tasks through multimodal guidance.
Similar endeavors are undertaken in the field of multimodal models. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task, aiming to integrate it into the RL field. We construct a set of Multimodal Game Instruction (MGI) to provide multimodal guidance for agents. The MGI set comprises thousands of game instructions sourced from approximately 50 diverse Atari games, designed to provide a detailed and thorough context. Each instruction entails a 20-step trajectory, labeled with corresponding textual language guidance. The construction of this multimodal game instruction set aims to empower agents to read game instructions for playing various games and adapting to the new ones.

MY ALT TEXT
Figure1: Imagine an agent learning to play Palworld (a Pok\'emon-like game). (1) The agent exhibits confusion when only relying on textual guidance. (2) The agent is confused when presented with images of a Pal sphere and a Pal. (3) The agent understands how to catch a pet through multimodal guidance, which combines textual guidance with images of the Pal sphere and Pal.

Game Instruction Construction

In the multimodal community, considerable attention is devoted to the integration of instruction tuning within multimodal models. These efforts aim to enhance the performance of multimodal models in performing specific visual tasks by incorporating language instructions. Drawing upon the insights derived from these efforts, it is imperative to explore the potential of leveraging multimodal game instructions to augment the capabilities of RL agents, especially in the context of visual-based RL tasks as long-horizon visual tasks. We construct a set of Multimodal Game Instruction (MGI) to apply the benefits of instruction tuning to DT. An illustrative example of a multimodal game instruction is presented in Figure 2.

MY ALT TEXT
Figure2: Examples of game instructions.

Decision Transformer with Game Instruction

The current section introduces the Decision Transformer with Game Instruction (DTGI), a DT model that integrates multimodal game instructions, as depicted in Figure 3. Firstly, we undertake the representation of multimodal instructions. Secondly, we calculate importance scores for each instruction in the Instruction set. Finally, We propose a novel design named SHyperGenerator to integrate game instructions into DT. N instruction generates n module parameters through hypernetworks. The module parameters are weighted based on the importance score of the instruction, and then utilized as adapter parameters.

MY ALT TEXT
Figure3: Model architecture of Decision Transformer with Game Instruction

BibTeX

@article{Jin2024,
  title={Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction},
  author={Yonggang Jin, Ge Zhang, Hao Zhao, Tianyu Zheng, Jiawei Guo, Liuyu Xiang, Shawn Yue, Stephen W. Huang, Wenhu Chen, Zhaofeng He, Jie Fu},
  journal={arXiv preprint arXiv:2402.04154},     
  year={2024}
}