MusicLM: Generating Music From Text

|paper|dataset|

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

Google Research

Abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

Audio Generation From Rich Captions

Caption Generated audio
The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls.
A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable.
A rising synth is playing an arpeggio with a lot of reverb. It is backed by pads, sub bass line and soft drums. This song is full of synth sounds creating a soothing and adventurous atmosphere. It may be playing at a festival during two songs for a buildup.
Slow tempo, bass-and-drums-led reggae song. Sustained electric guitar. High-pitched bongos with ringing tones. Vocals are relaxed with a laid-back feel, very expressive.

Long Generation

Text prompt Generated audio
melodic techno
swing
relaxing jazz

Story Mode

The audio is generated by providing a sequence of text prompts. These influence how the model continues the semantic tokens derived from the previous caption.
Text prompts Generated audio
time to meditate (0:00-0:15)
time to wake up (0:15-0:30)
time to run (0:30-0:45)
time to give 100% (0:45-0:60)
electronic song played in a videogame (0:00-0:15)
meditation song played next to a river (0:15-0:30)
fire (0:30-0:45)
fireworks (0:45-0:60)
jazz song (0:00-0:15)
pop song (0:15-0:30)
rock song(0:30-0:45)
death metal song (0:45-1:00)
rap song (1:00-1:15)
string quartet with violins (1:15-1:30)
epic movie soundtrack with drums (1:30-1:45)
scottish folk song with traditional instruments (1:45-2:00)

Text and Melody Conditioning

By adding melody embeddings to the conditioning, we can generate music that respects the text prompt while following the provided melody.
melody prompt → text prompt ↓bella ciao - humming bella ciao - jingle bells - whistling mozart symphony25 - whistling ode to joy - humming fingerstyle guitar jingle bells - marimba twinkle twinkle little star - piano when the saints go marching in - strings
a cappella chorus
electronic synth lead
guitar solo
jazz with saxophone
opera singer
piano solo
string quartet
tribal drums and flute

Painting Caption Conditioning

Painting title and author Painting image (from Wikipedia) Painting description Generated audio
The Persistence of Memory- Salvador Dalí"His melting-clock imagery mocks the rigidity of chronometric time. The watches themselves look like soft cheese—indeed, by Dali s own account they were inspired by hallucinations after eating Camembert cheese. In the center of the picture, under one of the watches, is a distorted human face in profile. The ants on the plate represent decay." By Gromley, Jessica. "The Persistence of Memory". Encyclopedia Britannica, 14 Apr. 2022.
Napoleon Crossing the Alps - Jacques-Louis David"The composition shows a strongly idealized view of the real crossing that Napoleon and his army made across the Alps through the Great St Bernard Pass in May 1800." By wikipedia
Dance - Henri Matisse"Made early in his career, Matisse s Dance, 1910, shows a group of red dancers caught in a collective moment of innocent freedom and joy, holding hands as they whirl around in space. Simple and direct, the painting speaks volumes about our deep-rooted, primal human desire for connection, movement, rhythm and music." By thecollector.com
The Scream - Edvard Munch"Inspired by a hallucinatory experience in which Munch felt and heard a scream throughout nature, it depicts a panic-stricken creature, simultaneously corpse like and reminiscent of a sperm or fetus, whose contours are echoed in the swirling lines of the blood-red sky." By Zaczek, Iain. "The Scream". Encyclopedia Britannica, 14 Apr. 2022.

10s Audio Generation From Text

Instruments
Caption Generated audio
acoustic guitar
cello
electric guitar
flute
Genres
Caption Generated audio
8 bit
ambient
berlin 90s house
big beat
Musician Experience Level
Caption Generated audio
beginner piano player
intermediate piano player
professional piano player
crazy fast professional piano player
Places
Caption Generated audio
beach in the caribbeans
escaping prison
gym
opera
Epochs
Caption Generated audio
club in the 50s
club in the 60s
club in the 70s
club in the 80s
Accordion Solos
Caption Generated audio
accordion death metal
accordion edm
accordion piano
accordion rap

Generation Diversity

We test the diversity of the generated samples while keeping constant the conditioning and/or the semantic tokens.
Same Text Prompt
Text prompt: Motivational music for sports
Same Text Prompt and Same Semantic Tokens
Text prompt: Motivational music for sports