MIDI-GPT

Audio & MusicMachine LearningPublicationsSystemsCo-creative InterfacesArtificial intelligence

Feb 2

MIDI-GPT is a generative system developed by the Metacreation Lab, designed for computer-assisted music composition workflows. It is based on the Transformer architecture and is intended to be a style-agnostic model that is expressive and steerable.

Key Features and Capabilities:

Multi-Track Support: MIDI-GPT uses an alternative representation for multi-track musical material, which allows for the decoupling of track information from note tokens. This enables the system to accommodate all 128 General MIDI instruments and generate more than 10 tracks simultaneously, depending on their content.
Flexible Input/Output: The system uses the General MIDI format as its input and output. It does not require a fixed instrument schema and can accommodate a variety of user workflows.
Controllable Generation: MIDI-GPT supports various generation tasks, including unconditional generation, continuation, infilling, and attribute control. It allows for the infilling of musical material at the track and bar level.
Attribute Controls: The system allows for control of musical attributes such as instrument type, musical style, note density, polyphony level, and note duration. It uses categorical, value, and range controls to achieve this. For example, instrument control is a categorical control, note density is a value control, and note duration and polyphony are range controls.
Expressiveness: MIDI-GPT can generate expressive music by including velocity and microtiming tokens in its representation. It uses DELTA tokens to encode the time difference between the original MIDI note onset and the quantized token onset.

Tokenization:

MIDI-GPT uses two main tokenizations: the Multi-Track representation and the Bar-Fill representation.
The Multi-Track representation represents each bar of music as a sequence of tokens, including NOTE ON, TIME POSITION, and DURATION tokens. Bars are delimited by BAR START and BAR END tokens, tracks are delimited by TRACK START and TRACK END tokens, and the entire piece begins with a START token.
The Bar-Fill representation allows for bar-level infilling by replacing bars to be predicted with a FILL IN token. These bars are then placed at the end of the piece, delimited by FILL START and FILL END tokens.

Training and Evaluation:

The system was trained using the GigaMIDI dataset.
MIDI-GPT was evaluated for originality, stylistic similarity, and effectiveness of attribute controls.
The results showed that MIDI-GPT can generate original variations and maintain stylistic similarity to the training data.
The attribute controls for note density and note duration were found to be effective, while polyphony level control was less so.

Real-world Applications:

MIDI-GPT is being integrated into synthesizers, game music composition software, and digital audio workstations.
It has been used in various artistic projects, including entries to AI song contests and the creation of music albums.
The system has also been used to compose adaptive music for games and has been the subject of an artistic residency.

Ethical Considerations:

MIDI-GPT inherits the biases present in the training dataset, which may underrepresent certain musical styles.
The legal status of such a model is unclear, but its use has been restricted to non-commercial purposes so far.

Future Work:

Future work includes optimizing the model for real-time generation, training larger models, expanding the set of attribute controls, and continuing the integration of MIDI-GPT into real-world products and practices.

MIDI-GPT is available as an Open RAIL-M licensed MMM model.

Return to main Multi-track Music Generation Models page