JEN-1 Music AI Research Paper*
*IN SIMPLE WORDS

THIS IS NOT THE RESEARCH PAPER

Simply put, AI tech has made real progress in text and image generation. But music is much more difficult to tackle, so the approaches are fewer and far between.

We’ve been working on this for years. Not only regarding how to do it, but why to do it. We just published the first of many research papers to give you a window into our progress in the music AI space. The paper is dense. It’s supposed to be. We’re doing something quite complicated.

At Futureverse, our focus is to make our technology invisible – so that the user experience is seamless and simple. We also aim to make complicated topics easier to understand. So, we thought we’d give you a digestible overview of our JEN-1 Music AI Research Paper. If you already attempted to read the full version (futureverse.com/research/jen), good for you. If you understood it all, you get a gold star. If at any part of it (including simply opening it) you thought “WTF is JEN-1: Text-guided universal music generation with omnidirectional diffusion models…” then read on.

The Jen-1 Gist

JEN-1 is the first text-to-music AI model we built at Futureverse. It’s very easy to use. You type the music you want to generate into a prompt; for example, “pop dance song with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach” and Jen will whip it right up for you.

Now, if you want, you can stop reading after the following statement because here’s the quick gist of what you need to know: we believe Jen produces higher-quality audio with more accuracy than the other AIs out there. And Meta and Google are playing ball. So while the playing field is small, the teams are mighty.

Our paper is called: “JEN-1: TEXT-GUIDED UNIVERSAL MUSIC GENERATION WITH OMNIDIRECTIONAL DIFFUSION MODELS” and it was written by our brilliant team of PhDs and researchers. The title means we’ve built a mathematical approach to how sound is generated.

Sound is intricate. Anything you hear, be it a clang, the wind, or something as detailed as a song is made up of frequencies. Those frequencies can have an expansive range, from low to high pitches. Higher notes have higher frequencies, lower notes have lower frequencies. If you threw a rock into the water, a soundwave is like the ripple it would create. The frequency, in this metaphor, is the speed at which the ripples repeat. The chirp of a bird has high frequencies which have rapid, closely-sequenced ripples, while a bass drum would create low frequencies with ripples that are more spaced out.

Another challenge lies in the fact that a single song can have multiple instruments and unique melodic arrangements resulting in a large quantity of sonic styles and intricacies. This necessitates a layered training approach that is as sophisticated as the music itself. We’ve trained Jen to become a musicologist. In other words, she’s adept at analyzing the composition of a track in numerous ways at once. She’s an excellent multi-tasker. We’ve taught her to find holes in a composition of an output and be smart enough to replace the hole with an element of music that makes sense melodically and harmonically. Jen can also extend or continue a song, which is super cool and useful. Imagine you created a 70s rock track, but wanted to hear where Jen would take it if she jammed out. Where would the guitar solo or drum beat go if it kept playing? Ask Jen. She’ll show you.

How Does JEN-1 Stack Up?

We analyze Jen daily and compare her to others. She’s special. We have a board of well-respected music industry A&Rs. They agree. Our testing has proven that our output quality is higher than others.

This is primarily because we use training methods that are highly  efficient. We work with raw WAV files (actual music) over spectrograms (which convert music to visual representations to analyze the spectrum of a frequency). The reason we prefer our approach is that things can get lost in translation during the spectrogram conversion. As the music file is compressed, more granular details of the original audio  can be lost, ultimately leading to an erosion in fidelity.

At this point, it should be no surprise that we think Jen is a GigaBrain (an ASM GigaBrain that is). Her neural network is thoughtfully designed to compress and decompress audio efficiently with a statistical math model that simultaneously spreads out sound for better processing. This method enables Jen to iterate faster. And at 48kHz, the audio quality of our music output exceeds that of what you typically hear on Spotify and Apple Music (44.1kHz).

The intersection of text and music, known as text-to-music generation, is going to change the world as we know it. Being a pioneer in this space comes with significant responsibility that we don’t take lightly. You can be sure we’re in our lab grinding to set the tone. More to come soon.

Oh and if you are no longer saying WTF, give our full academic paper a read. Probably twice.