Blog maker

The revolutionary AI filmmaker of Meta: Make-A-Scene

Meta AI’s new video creation model is out and in one sentence: it generates videos from text. Not only is it capable of generating videos, but it is also the new cutting-edge method, producing higher quality and more consistent videos than ever before!

You can think of this model as a stable streaming model for videos. Surely the next step after being able to generate images. This is all information that you must have seen before on a news site or just by reading the title of the article, but what you don’t know yet is what exactly it is and how it works.

Here’s how…

References

►Read the full article: https://www.louisbouchard.ai/make-a-video/
► Meta’s blog post: https://ai.facebook.com/blog/generative-ai-text-to-video/
►Singer et al. (Meta AI), 2022, “MAKE-A-VIDEO: GENERATION OF TEXT-TO-VIDEO WITHOUT TEXT-TO-VIDEO DATA”, https://makeavideo.studio/Make-A-Video.pdf
►Make a video (official page): https://makeavideo.studio/?fbclid=IwAR0tuL9Uc6kjZaMoJHCngAMUNp9bZbyhLmdOUveJ9leyyfL9awRy4seQGW4
► Pytorch implementation: https://github.com/lucidrains/make-a-video-pytorch
►My Newsletter (A new AI application explained every week to your emails!): https://www.louisbouchard.ai/newsletter/

Video transcript

0:00

the new model of methias makes a video is out

0:03

and in a single sentence it generates

0:05

videos from text it is not able to

0:07

generate videos but it is also the novelty

0:09

state-of-the-art method producing more

0:11

quality and more consistent videos than

0:14

ever you can see this model as a stable

0:16

broadcast template for videos surely the

0:19

next step after being able to generate

0:21

images this is how you must

0:23

already seen on a news site or

0:26

just by reading the title of the video

0:28

but what you don’t know yet is what is

0:30

it is exactly and how it works make a video

0:33

is the most recent publication of met

0:35

III and it allows you to generate a

0:37

short video from just text inputs

0:40

like this to add complexity

0:42

to the image generation test not only

0:45

having to generate several frames of

0:47

the same subject and the same scene but it is also

0:49

must be consistent over time you can’t

0:51

simply generate 60 images using dally

0:53

and generate a video, it will just look

0:56

bad and nothing realistic you need a

0:58

model that understands the world of a

1:00

best way and takes advantage of this level of

1:02

understanding to generate coherence

1:04

series of images that blend well

1:06

together you basically want to simulate

1:08

a world, then simulate recordings of

1:11

but how can you do that usually you

1:14

will need tons of text video pairs for

1:16

train your model to generate such videos

1:18

from text input, but not in this case

1:21

since this kind of data is really

1:23

difficult to obtain and training costs

1:25

are super expensive they come close

1:27

problem differently another way is to

1h30

take the best text model in image and

1:32

adapt it to videos and that’s what I came across

1:35

done in a research paper that they just

1:38

published in their case the text to the image

1:40

model is another model by meta called

1:43

magazine that I covered in a previous

1:45

video if you want to know more

1:47

but how to adapt such a model to

1:50

take the time into consideration you add a

1:53

space-time pipeline for your model

1:55

being able to process videos means

1:58

that the model will not only generate a

2:00

image but in this case 16 of them at the bottom

2:03

resolution to create a short coherent text

2:06

video in the same way as a text for

2:08

image model but adding a one-dimensional model

2:11

convolution with the regular

2:13

a two-dimensional simple addition

2:15

allows them to keep the pre-trained ones

2:17

identical two-dimensional convolutions

2:19

and add a time dimension that they

2:22

will train from scratch reusing most of the

2:25

the code and model parameters of the

2:27

image model from which they left us too

2h30

want to guide Our Generations with text

2:32

entry which will be very similar to

2:34

image templates using a clip embeds

2:37

process I go in detail in my stable

2:39

broadcast video if you are not familiar

2:41

with their problem but they will also be

2:43

add the time dimension when

2:45

mixing the characteristics of the text with the

2:47

image features doing the same

2:49

keep the attention module that I described

2:52

in my video of creating a scene and adding a

2:55

one-dimensional attention module or

2:57

time considerations copy paste the

3:00

image builder template and duplication

3:02

the generation modules for one more

3:04

Dimension to have all our 16 initials

3:07

frames but what can you do with 16

3:10

good frames nothing really interesting

3:13

we need to make a high definition video

3:16

out of these frames, the template will do

3:19

that by having access to previews and

3:21

future frameworks and iteratively

3:23

interpolating from them both in terms of

3:27

Temporal and spatial dimensions at

3h30

at the same time, essentially generating new

3:33

and larger frames between those

3:35

16 initial images based on the images

3:38

before and after them who will be

3:40

fascinate make the movement coherent

3:43

and the overall video was ruined, done

3:45

using a frame interpolation network

3:47

which I have also described in other videos

3:50

but will basically take the images that we

3:52

have and fill the gaps generating in

3:54

between the information, it will do the same

3:57

thing for an enlarging spatial component

3:59

the image and fill in the pixel gaps to

4:02

make more high definition

4:04

so to summarize the fine-tuning a text to

4:07

image template for video generation this

4:09

means they already take a powerful model

4:12

trained and adapt it and train it a bit

4:14

a bit more to get used to the videos

4:16

recycling will be done with non-labelled

4:19

videos just to teach the model how to

4:21

understand videos and video image

4:23

consistency that makes the data set

4:25

much simpler construction process than us

4:27

use an image-optimized template again

4:30

to improve the spatial resolution in our

4:32

last frame interpolation component to

4:35

add more frames to make the video smooth

4:38

of course, the results are not yet perfect

4:40

just like text to image templates, but we

4:43

know how fast progress goes it was

4:45

just a glimpse of how i met

4:47

managed to approach the text in video

4:49

task in this large paper all the links

4:52

are in the description below if you would like

4:53

would like to know more about their approach

4:55

to the pytorch implementation is also

4:57

already developed by the community

4:59

as well so stay tuned for that if you want

5:02

I would like to implement it yourself thank you

5:04

for watching the whole video and i will

5:06

see you next time with another amazing

5:08

paper

LOADING
. . . comments & After!