Over the summer, the art world was rocked after a “painting” created by a AI won a state fair art contest. Although image generators have been around for decades, they’ve mostly been treated as a novelty – a fun quirk of the internet for people to mess around and have fun with. But the AI art that won the competition crossed a kind of Rubicon that was heretofore unseen. When he did, he made something very clear to the world: the AI art wars have begun.
Between bots like Midjourney and DALL-E, not to mention Google’s Imagen, developers and engineers are stepping up to create the best and most sophisticated text-to-image generators.
Now Meta (Facebook’s parent company) has thrown their hat in the ring, but instead of a text-to-image bot that turns a text prompt into an image, they’ve created one that can turn your words into videos.
The company advertised in a blog post last week that he developed a text-to-video generator called Make-A-Video. The AI-driven system appears to be Meta’s answer to DALL-E and Imagen – only instead of just creating a still image, it spits out full videos.
“Generative AI research advances creative expression by giving people tools to create new content quickly and easily,” Meta’s announcement reads. “With just a few words or lines of text, Make-A-Video can bring imaginations to life and create unique videos full of vibrant colors, characters, and scenery.”
In addition to text inputs, the bot “can also create videos from images or take existing videos and create new, similar ones,” the company added. This means you can grab images and videos that already exist, and it will create a new video based on the image.
Although not the first advanced image text generator (researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence unveiled a similar text-to-video generator dubbed CogVideo in May), Make-A-Video is remarkable for the way it was trained, which is a big hurdle to overcome when it comes to bots making videos.
This is because a typical text-to-image generator is usually trained on massive datasets consisting of billions of images paired with alt-text descriptions of what the image is. These text-image pairs allow the AI to learn which image to form when you type text.
However, in a article published by Meta AI researchers on September 29, the authors wrote that video generators do not have this advantage. Such programs can only rely on datasets containing a few million videos with subtitles (this is how CogVideo was formed).
To circumvent this problem and use existing image datasets, Meta researchers used text-to-image models to train their bot to recognize the connection “between text and the visual world”. Next, they trained Make-A-Video on video datasets to teach it realistic movements.
So if you enter something like “a horse drinking water”, he will be able to leverage his image formation to figure out what a horse drinking water looks like. Then he would rely on his video training to figure out that the large, four-legged creatures near watering holes typically lapped up water with their mouths, then made a video that did that.
The results are amazing, but with some obvious limitations. Although the generator can sometimes create realistic and impressive videos, they are not perfect. The subjects of some images can appear just as malformed and off-putting as the most rudimentary text-to-image generators. The videos aren’t long either – less than five seconds – so no feature films yet. There is also no sound. But it’s clear that Make-A-Video is nonetheless a giant, transformative leap forward for AI and AI-based art.
However, it’s this same set of data that created these videos that will likely create a familiar thorn in the side of so many AI bots before it: bias. After all, these generators are only as good as the data they are trained on. When you train it on biased data, you will get biased results.
There’s no end to real-life examples of the damage caused by biased AI – from racist chatbotsmortgage lending algorithms that reject black and brown candidateseven to highly sophisticated image generators like DALL-E mimicking our biased tendencies.
Meta’s Make-A-Video is no exception. Although the paper’s authors note that they attempted to filter out NSFW content and toxic language from its datasets, they admit that “our models have learned and likely exaggerated social biases, including harmful biases.” . They add that all of their data is publicly available, however, in an effort to add a “layer of transparency” to the models.
So the bias we saw with text-to-image generators still exists – only now it happens with full videos. In the age of deepfake videos and the proliferation of rampant misinformation, that’s a chilling prospect. Although Make-A-Video still only produces rudimentary videos, it’s easy to imagine the inevitability of technology becoming so refined that videos are indistinguishable from reality.
What Meta is doing shouldn’t come as too much of a surprise to anyone who’s seen the rise of bot-generated images over the past few years. And it is no coincidence that these companies are beginning to invest more and more in these systems. With the exponential explosion of research and development of these computer models, we’ve seen text-to-image generators transform from malformed images of squinting and you kind of see into full-fledged works of art jumping from the realm of the strange directly into hyper-realism.
This flood of newer, more sophisticated robot-created artwork is fueling what has become a growing arms race in AI image generation – and it’s a race that’s heating up in a big way. It is no coincidence that the company formerly known as Facebook decided to announce the bot on the same day Google announced its DreamFusion 3D text-to-image generator.
While many are outraged by AI winning contests and being used for things like book covers, this is really just the beginning. If Make-A-Video lives up to its promise, we’re not that far off from seeing fully AI-powered feature films complete with sound, characters, and story. This is not hyperbole. It’s fate. You cannot put this knowledge back in a box.
In other words: we have come a long way from telling stories and making shadow puppets around a fire in a cave. Now it’s only a matter of time before the fire we’ve created starts casting its own shadows against the cave walls.