Virtual Paper Review – Diffusion Transformers & Flow Matching

Transcription provided by Huntsville AI Transcribe

All right, well, let’s go ahead and get started. So the topic for tonight is video diffusion. In general, we did talk a little bit about this at one of our sessions earlier in this month where Jay went over a lot of the different models and kind of at the top level of here’s like kind of what they can do, what this stuff looks like. But tonight we’re going to go a little bit more in detail on one of those specific models that we have a lot of information on, which is this WAN 2.1 model and their technical report. And so like our last few of these, we’re going to kind of have this split into two major sections. One in the front half, where we’re going to talk a lot about the fundamentals and kind of the current state of the art. And what are kind of the building blocks that are part of this model solution.

And in the second half, we’ll talk about the specific things that WAN 2.1 does that’s really, really cool.

which it does a lot of things that are really, really cool.

This is kind of the stable diffusion 1.5 moment for video where everything kind of became accessible. The quality kind of passed a certain bar and most people could do it on their GPUs that they have at the consumer level. And so there’s lots of really interesting things that they’ve done here. And so I’m excited to share it with you guys. And so just like any of the others, I’m watching chat over here. So feel free to stop and ask questions as you need to. I have tried to pack a little bit less into this one than we have in the last ones, just so it’s not so rushed. But there’s still a lot to go through just because the one paper is like 45 pages. So the topics that we’re going to hit at the high level is these seven.

So first, we’re going to kind of do a general context of what’s going on with generative models.

We’re going to talk about… the concept of the latent space.

What is that?

And what are variational auto encoders?

Then we’re just going to do a basic 101 on how does the score-based diffusion work?

And that’s kind of like the diffusion models that we’ve had for a very long time.

Then we’re going to go into what the current sort of state of the art is for generative media generation, which is this flow matching with rectified flow, and then also the diffusion transformer.

And so this is kind of our, okay, what are we even talking about? And then we’re going to go into a bunch of different things that Juan did for each one of these and go a little bit into a little bit about their data, how they train different stuff. They have like five models that they slapped into this thing. And each one of them has a really, really cool sort of way of going about stuff. And then there’s also a bunch of like extra stuff they have for image to video, something called Vase, which is a way of editing video. extending it, doing a lot of interesting stuff with that. Things with camera motion.

And they also, this hasn’t, I don’t think they’ve released this model yet, but they also are kind of hinting at some audio generation stuff too, where they’re generating the audio along with the video. All right. So just to get that started. We’ll talk first about just kind of the general idea of why video is hard. You know, we’ve had image models.

They’re pretty good at this point.

I think mid-journey, you know, Flux, all those have kind of hit that it’s really hard to tell if it’s real or not problem.

And we do not really have that problem with video, even some of the ones that we saw before, just because there’s lots of things that can go wrong whenever you’re starting to try and do video generation.

The main problems with this is that they change over time. So the model doesn’t just have to correctly predict one frame of static image.

It has to keep it consistent over time.

It has to look like it kind of tracks physics without having an actual physics engine inside of it.

It’s a very difficult problem.

It also has the problem of it has to track all those things and get them into some sort of shared space that it can reason over, which, you know, if a 1080p video has, you know, around a million pixels in it overall, and you have 24 frames per second across a six second video, it’s, you know, it’s a lot to deal with. Now take that to a minute, take it to an hour, and becomes a very intractable sort of problem if you don’t have the right handling of it.

The other issue is that it’s hard to train because of all those problems, but it’s also hard to generate. So if I want to generate out a stable diffusion image, you know, I can do four of those at a time and, you know, do them quick, quick, quick.

And I get an understanding of whether that thing was good or not.

The larger the thing I have to generate, the slower the training process is.

And so because of this, there’s lots of problems with generation.

It’s kind of lagged behind.

And just imagine this video that I’m showing here, you know, all the things that the model is having to correctly predict at this time and how difficult it would be to try and approach each one of those things individually. But we now have models that can do this.

And this is a generated video from Sora.

You see all the stuff that’s just kind of keeping track of the flexion of the thing, the diffraction and all that sort of stuff, the fishes. It’s kind of nuts where this thing is now.

All right, so we do have some limitations here that have started to get solved. The first major round of these where we really saw them get resolved in the generative models was with the flux architecture.

And this rectified flow architecture was actually introduced with stable diffusion, but the flux are the ones who really kind of got it out of the gate.

But this did not initially get transferred over for the video sort of model.

So the video models always lag a little bit behind. So we’re kind of getting these cool architectures before with Flux, where it’s kind of generating out this more aware flow-based methodology.

And the existing image models like, you know, AnimateDiff and all them, they’re really kind of, instead of… predicting out the whole sequence.

They’re really just kind of doing an image to image to image to image to image to image to image, which has lots of issues with, you know, a very noisy diffusion process.

And so what we’re looking at with these flux models is this concept of rectified flow and the diffusion transformers, where we’re not just doing diffusion, we’re also taking the attention mechanism that we’ve kind of been talking about with the other models and injecting that in instead of a convolution inside of the diffusion process. Talk a little bit about how that works later on.

And so just the generalized context of what WAN 2.1 does for all of this is that they introduce the concept of a latent space compression over a spatio-temporal vein.

which basically means a space-time.

It’s able to condense spatial and time-based information together into a latent space, making it easy to train.

It does this rectified flow matching and brings it over into the video generation models. It utilizes the diffusion transformer, and it also adds the ability to do multimodal conditioning where we’re able to out the gate. condition with text, image, video, and audio inputs all into the diffusion process. And, you know, some of these have been done before by some of the other models.

Hunyan, I think, has done a bunch of these as well, LTVX. But WAN 2.1 kind of did them all in a way that was very efficient and easy to train and do stuff with on consumer hardware. And so they’re the best.

best to look at whenever we’re wanting to look at all those things together to get an idea of what the current state of the art is.

All right, so that’s the front matter.

A lot of that probably doesn’t make sense. The goal of this talk is for that all to make more sense by the end of it. So there’s lots of jargon here. The video diffusion pipeline in general, it’s complex.

There’s no way around it. There’s lots of different moving parts that make this thing work.

But I think even getting a general intuition of kind of what’s going on underneath the covers is going to make it a lot easier to understand what these things can do, how they’re doing, where the limits are going to be, and kind of sniffing out, you know, how is it that I can troubleshoot my video generation process if you’re interested in this sort of thing. So before we start digging into the details, are there any questions out the gate? I think I’m tracking so far.

All right, cool. That was a piano in the video with the fish, though, right? It was, yes. Okay.

I thought I was saying something else, but okay. Good job. I was trying to give it something complex so it would mess up, but it seemed to get it. Yeah, that was a good complex prompt. All right, so first we’re going to talk about the variational autoencoder. And I think of all the things that we’re talking about here, this is kind of the most important one for just getting a general understanding of how all of these models sort of work.

And so the concept of variational autoencoder is that I have some sort of thing that’s out in the real world, and I need to compress it and make it smaller and condense in a way that makes it easier to work with.

And so the idea of them is that there’s some sort of input sequence that I have that gets compressed.

And then after I’m done messing around with it, doing stuff with it, I need to decompress it out basically into whatever my final output is. And so the idea here is that we’re compressing these things into what is called a latent space. And an understanding of this, it’s a highly abstract sort of concept.

But the general idea of the latent space is that it is a lower dimensional representation of what it means.

uh you know say for instance to be a cat so i have all these concepts of different things that map to a cat or a pet or a dog or all these different things and they’re living living out in this this uh you know fantasy distribution of of data points about you know times that i’ve seen a cat times that i’ve seen a dog um and the latent space is basically a mapping of that that generally looks like some sort of a normal distribution or something like that um And the idea is that it’s not a direct mirror, but it’s a highly abstract sort of place where we can do easy manipulations on the items.

It’s very hard for me to describe what latent space is.

And this is something that a lot of people have mentioned.

There’s not really good words to talk about this. But when I think of latent space, I almost think of whenever you’re thinking about a problem, I’m thinking… And a lot of times I’ll be thinking in words, but then sometimes I’m not. You know, there’s that space in between words in my mind where I’m processing stuff and stuff is happening, but I couldn’t possibly decode that into something to describe it to somebody else. And to me, that’s kind of what the latent space is for these models.

I don’t know if that’s helpful, but the general idea is that it’s just something that’s much more condensed and abstract, but very detailed in what it can do.

And so what we’re doing with these models is taking them from this high level representation and just generally, you know, some sort of pictures or text or different things like that.

And then we’re taking it and we’re basically condensing things down into smaller dimensions so that they’re really small.

And then once we’ve done all of our permutations on it, we’re extending it up into a larger area.

And so.

One of the areas here that I think can help with this is this concept of the convolution.

And so here’s an example where we’re looking at, say, for instance, it seems like some sort of a medical image here.

And we take it at this high level, and then we take it into this 224 by 224 by 64 size matrix here.

And the goal of this process is to gradually condense it into a different size matrix.

So here we’ve got an actual image, and here we’ve got a matrix, and then we’re going to get that into, you know, half it, 112, 112, but then increase the size here to 128, then 56, 56, 256, and so on into an area that’s easier for us to work with.

And so generally, there are lots of different ways that this sort of condensation happens.

And there are a lot of different types of variational autoencoders as well.

You’ll see things like this is a general VAE.

A lot of what people use nowadays is the VQVAE.

There’s also the VQGAN.

So different things like that that get used in these processes.

there’s always this general idea of condensing the things down and what it might look is basically we’re going you know through this image and there’s generally this thing this concept of you know a kernel that’s examining you know line by line and pooling features uh into that smaller map and so it might say that hey you know maybe i don’t need all of these pixels here i can say that this patch right here is a boat and so it’s going to say okay this right here is ground and so instead of storing all of this as ground and keeping each pixel it’s just trying to condense it down and down and down and down and down um and then we get at the end you know now we’ve got this this very large sort of matrix and we’re trying to get it into this this smaller level of classification and so this is a very simple vae uh and uh what we’ll see with the one model is that they have a much more complex one that’s dealing with some of that information that we were talking about before. And so an idea of the sort of features that might be getting extracted from these things, they’re not always things that make sense to us. What it learns to extract are things that just happen through the deep learning sort of process, where we don’t really have lots of insight into all of it. And say, for instance, you know, there’s this MNIST data set, which is all these different numbers and stuff like that, that there’s been lots of research done because it’s very small.

And so we have a better idea of some of the features that get extracted from it.

And it’s not always things that would make a lot of sense.

Here it is a little bit, but… it’s pulling out the bottom of the nine here as one of its classification patterns, the side here, the top part here.

It’s just kind of selecting random different things inside of here that it’s condensing together because it was useful to store them.

Sometimes they store multiple things on the same values, but only with certain patterns that it goes along with it, like with control bits and stuff like that. The general idea is that you’re using the neural network to basically train a very complex, specialized compression network.

Hey, Josh. Back on the MNIST, the features and things like that, if you were to take the same VAE, and basically try to run it. Is it going to pull the same features every time, or is it based on what random data you might have in there? Yeah, no, not at all. It’s based off of a randomly initialized network generally, so it might be similar, you’d hope, but you don’t know.

Okay, yeah, so it’s not something you could depend on next time going, oh, okay, yeah, we’ve seen this before, so it’s going to pick up the circle at the top of the nine.

Yeah.

Yeah, OK. Got it. Deep learning voodoo. So yeah, that’s the bottleneck of the layer. And so this is the area here where we’re looking for that compressions, where all the activity is done on it.

And the output here, this is the easiest, is that how do we then, once we’ve done all that, compress it out into our output stream?

And that might be the video.

It might be an image. It might be. There’s now diffusion text models where they’ll, you know, do different patches of stuff.

It might be, you know, if you look at something like GPT-4.0, whenever they’re doing their diffusion, their diffusion is very likely, you know, lots of little diffusion parts put together, like as sequence tokens.

And so they might be just putting out a sequence token, essentially, as what they’re decoding into instead of a full image where that stands for a patch, essentially.

So lots of interest.

Sorry to interject and backtrack a little bit, but for the nine, slide 14 again, it took me a second to put together my question. So I’ve run into some real interesting problems with segmented digits, where you have the seven segment digital displays and image recognition. And I’m wondering if you’ve looked into that at all for when these models try to recognize individual digits made up of separate parts originally rather than one solid part that it breaks down.

So seven segment.

OK, gotcha.

So like a digital clock sort of thing.

Yeah, yeah. Like a standard alarm clock. I have not.

OK. Yeah, I was just curious. they seem to run into an interesting problem where since they’re already broken into segments, they try to look at each segment as a separate character following sort of a pattern like this. So convincing it that it’s already segmented is an interesting twist.

Yeah.

So this is with like a transformer, like a GPT or something like that?

Yeah.

Most transformers and a lot of it. our software yeah i’m not sure if the transformer the i don’t think the vision transformers generally use a vae um and i have to double check on that yeah 01 has gotten starting with 01 it was decent with the seven segment but even if you look into a lot of uh just straight up ocr models they have a really hard time with seven segment anyways sorry for the aside i was just Being this image of the nine really made me think of another take on that. Not sure. Yeah, no problems with questions.

All right, so yeah, that’s the general idea here. And so here is a general idea of this kind of generally how we think about this, too. So this is looking at one segment here.

of one image that I’m getting down into this latent space.

And so there’s a different area here that we want to think about too, in the concept of latent space, which is the overall, you know, latent space of our entire training data.

So say for instance, I’m training stable diffusion and I have a bunch of images, you know, I’m training on the lay on data set and all the things that they trained on. And those have the actual images and things that exist inside of it. And so in that case, that’s this real-world data.

It’s messy.

You know, I don’t have a good perfect distribution.

There are large amounts of things that I don’t have in here.

You know, some things are more filled out.

But this is the data set that I have.

And my goal here is to be able to map with this VAE, this encoder and decoder, some sort of function that takes me from this.

into this compressed latent space, which is generally a normal distribution that we show with a Gaussian.

And I want it to be cleanly able to always be able to encode to here and always be able to decode into here.

And so the goal here is I’m never doing stuff with this.

I just want the perfect decoder encryption sort of ring.

And so that’s the goal with the VAE.

And so we need to move along here.

There’s lots of stuff on here. VAE is actually a really good entry point for diffusion because they’re actually fairly easy to train.

They’re usually generally small.

And they’re easy to kind of detect if you’ve done the right thing or not because you are really just doing the transformation in and transformation out. So there’s less things that can go wrong theoretically. But I have added some stuff here. And I’ll include the slide deck somewhere for the links out to some good resources for this.

All right.

And I apparently have added back in latent spacing here again. All right.

Any questions about VAEs?

Well, I’ve got one.

Okay. Josh, what if you took and compressed images that were in two dimensions?

Can you uncompress them back into a 3D structure?

So generally, I would… There are probably ways to do that, but you would generally not be doing that transformation with the VQ itself. But you could have a VAE that expects something to transform it in the middle.

and then it expects to decode into the 3d so it could be that it’s you know text and image in and then video out and but the vqvae is not what does that that transformation into video okay thank you yeah all right so we’re gonna talk about the division diffusion process generally uh we’re gonna start with the score based diffusion model uh and the general idea here and What a lot of these things do is this process of you take an image and same thing where you’re trying to get it to give you back the same image that you gave it. But in this case, we’re doing it by generally gradually adding noise to an image. So kind of corrupting the image and then working it backwards, trying to get the model to correctly remove the noise that we added in.

Since we added in the noise ourselves, we know kind of how that noise functions.

And so it gives us a way that’s actually learnable to do what is really kind of an insane thing. Score matching, it was the easiest one to get off the ground, but it’s actually a pretty weird process when you actually start thinking about it.

But that’s generally what we’re doing.

So we’re adding noise in and removing the noise.

And so what that generally looks like here.

It’s kind of very spiky.

And so we have the thing that we want to predict. And it’s always going over this time step.

And you’ll see this constant of steps inside of all these things.

So I run it for this many steps.

And I almost think about the steps.

It’s not that I’m running it for longer.

It’s almost like if you think about in a 3D modeling program, whenever you’re doing some sort of a complex shape and you break it into more segments, I think that’s a better analogy of what time steps are than… It’s I’m doing it for longer. I’m letting it cook. Where it generally has the same sort of increasing quality to where, you know, as you’re adding the first few, you’ll generally see a very large increase in quality on the, you know, the believability of these 3D objects that we’re adding segments to.

But, you know, what’s the difference on a 3D toothbrush between 100 and 100,000 vertices?

Visually, not that much.

So it’s a little bit like that.

You can see here that we’re generally adding the noise.

And this time step here is the noise that’s at that step.

And the noise is weird.

There’s this thing called a noise schedule that we won’t get into super deep here.

But there’s always a different level of noise that corresponds to the time step that we’re in with a lot of these models.

And so you can see as we go along, we’re starting to kind of move into a more More defined blob, but it’s still a blob.

But we’re going generally in the direction of the image that we want to predict as the time step goes along, generally with big movements happening at the beginning and the more fine details happening as we go along.

So that’s what the diffusion model is.

This stuff has worked for a long time, but the problem with score-based diffusion models is it’s… It’s really noisy.

It’s really spiky.

And since you’re doing each time step individually, it’s basically learning, if we go back here, I’m adding noise, I’m adding noise, I’m adding different noise each time, and then removing that different noise each time. So it could be that you have very large movements that happen because you’re really just predicting one at a time over and over and over again. And so you might have a direction that’s this way. My magnitude is this, this, this, this, this, this.

That’s very easy for, you know, if you think about the sort of the manifold that these things are learning, it’s very easy for it to go down and overshoot and kind of go off into nowhere and, you know, mess up completely and create some AI looking slop with nine fingers and all that sort of stuff.

And so.

The nice thing is that the actual data distribution here is this center line.

There’s generally a straight linear interpolation between my XO data and the X1 noise.

And so we’ll see this X0, X1.

And so just think about this as this is your X0 and this is your X1. So my actual data and this latent data that exists, my distribution that I’m looking to emulate.

And our goal is how do we get out of this sort of paradigm of doing this step by step by step by step, working really, really hard to get this thing to move there and just follow the path.

And so that is where we get into this concept of flow matching and the concept of rectified flow.

And so these are two different thoughts.

And so the general idea of flow matching is that instead of learning each point individually, I want to learn the trajectory of all these points together, of how do these things move to get me to this line up here.

So instead of learning to predict, you know, I’m here, my distribution’s there, from here, where do I need to move?

I’m instead learning to predict where is the next step going to be?

you know, instead of looking all the way at the end, I’m learning the path instead of the destination or kind of the vibe of how to detect that path. And so that’s flow matching, which is by itself very strong.

There’s also the concept of this rectified flow, which is how do I, instead of, you know, making that super complex, do the really, really simple thing.

And what if that path was just a straight line?

How can I make that happen?

And it sounds pretty easy. Wouldn’t that be great? And the truth is that you can learn that if you are able to kind of move this thing into an ODE.

And so our goal is to, at each time, we’re looking for just the delta between X1 and X0, where we’re looking for that straight line path from A to B at some sort of a constant speed.

And this constant speed thing is related to this straight line here.

I want to always just knowing from here, this is the direction that I need to go.

And so this flow matching and rectified flow stuff, you know, we’re talking before about the VQVAE.

This is how we’re actually doing the prediction of the video itself.

And so this is how we’re learning all those complex sort of temporal activities that go on.

I’ve made this little idea here of a, this is actually a Python notebook that I have where we have this random distribution and everything’s just kind of moving, you know, gradually and at a constant rate towards its final distribution where my green dots here are my, you know, initial noise.

And my red dots here are my actual data distribution.

So you got x0 are the red, and x1 is my latent that is being moved through the flow matching activity.

This has lots of uses outside of generative models, by the way.

Flow matching with rectified flow is very, this is heavily used with things like alpha fold, protein folding, anything where you’re caring about trajectories.

I’ll point out where we are.

Trajectories are a very important thing.

And also things like robotics. There are lots of areas where this concept of mapping this data distribution and figuring out how to correctly map to an unseen distribution has lots of values out in the field for practical use cases.

Any questions about flow matching?

It’s a whole thing that we could get into.

uh and i’m really just trying to give you like an intuitive understanding of what this thing is trying to do right does that generally make sense i’m probably gonna have to go back and look again on the video i will circle back again it’s uh it’s different so it’s kind of hard to You know what I mean? Wrap my head around it the first time through. Yeah, yeah. All right, so a general idea of why we want to do this flow matching thing instead of classical diffusion.

This is a show of some actual data of looking the jumpiness in the data.

With the classical diffusion, we’re jumping all around with what we’re representing.

So our learning signal is very jumpy.

It’s easy for us to overshoot and kind of get into a failure mode.

With flow matching, we can just do a constant sort of training path.

And obviously there’s still things that can go wrong, but we want this part to be more predictable.

And so we’re removing some of the chaos out of the kind of what’s happening.

The nice thing is that it’s just very easy in actual practice to use the rectified flow.

for training.

That’s why you basically see nobody using the old score-based methods now for their diffusion training, for at least the models in the media generation space.

I don’t see anybody doing that outside of folks looking to write a paper about something.

All right. So we’re going to move on now into diffusion transformers.

So this is kind of the next step.

This is replacing UNETs and CNNs and all that sort of stuff in the diffusion process.

And so traditionally, those UNETs, and if you don’t know what a UNET is, it kind of looks like this.

It’s basically lots of the same concepts and feels and vibes in a lot of these things.

A UNET is it’s using a convolution to take some sort of an input tile.

and take it down into this more easy to manipulate latent sort of distribution, pool distribution down here.

And then it will go up into that higher level.

And if you’re doing like an RNN, there’s some residuals that are left over from other layers.

But you generally see this sort of, it’s easy to see why it’s called a UNET here.

The idea here is that they take that noisy latent and they chop it into patches like a vision transformer.

And so this is a little bit different than what we were talking about before.

And those patches are included with the time step information as pooled inputs in to the transformer’s layers.

And we actually use the transformer layers then to figure out how things relate to each other.

as time goes along.

So starting to use some of that sequential information and most importantly, using the attention mechanism to enable some really complex cross attention. So the ability to do attention on both text and images inside of the network itself without doing clever things to kind of like fake it. We can actually include both of those things at the same time into that network.

which is really the big benefit to me with DIT.

We also usually generally use this Ada LN, which is this adaptive layer norm, as a way of starting to pool those different modalities into some sort of a shared matrix to manipulate the outputs.

All right.

So that’s the concept of diffusion transformers. We’ll talk about this a few… We’ll actually talk a lot more about the diffusion transformers in the actual WAN specifics.

But the general idea here, just kind of get an idea that we are replacing this U-net with this DIT block.

It’s just a little block that’s inside of the process.

And there, we’re just using this general… This is your transformers style block with your multi-head self-attention with a few additional… scales here so this you’ll see the scale and shift that’s a big one inside of here where instead of trying to predict the next token we’re doing stuff in a pixel space and doing some sort of transformations there some lots of affine transformation sort of stuff going on all right so that is the general vibe section of this a little bit more technical than the ones we’ve done in the past But does that all kind of track? Any questions before we get into the meat? Wait, now it gets hard?

No, that’s cool stuff. Yeah. Well, it’s very difficult.

So, I mean, just think about what this is all doing.

It’s a pretty complex thing.

And the fact that we’re here and it’s able to run on our hardware, it’s just a miracle to me. Yeah. hey i want a video of this this this i was like okay it’s like um that that’s all you need you know yeah and the vay especially um we’ll get into that this in a second uh i really think this vay is just so interesting how they’re doing so i i feel like i need another you know three months to read about it um but we’ll try and cover it here with what time we have So we’re going to go into an overview of the architecture generally in the pipeline. Then we’re going to go into all the different little pieces, parts of the paper. I’ve tried to pull it all out into here, but we might pop over as we go along. I’m going to see if we can get a little bit of time to look at the paper itself too.

But I’m trying to at least get the overview.

So with the WAN, we’re really looking at three major components here, which is the UMT5, which is your T5.

sort of input text processor.

So before, you know, we’re using lots of clip sort of stuff. I think clip is still part of this. I think it has clip as some additional processors. It’s really using T5 for a lot of this. And this UM is a multimodal T5 because this does both Chinese and English text.

Then we go into the WAN VAE, which is a spatio-temporal VAE, which is very cool to say.

And then we’ll go into the DIT with flow matching.

And those are the three main pieces.

There are some additional modules that exist as well.

And we’ll go into those afterwards.

But those are the three main ones.

All right.

So the first thing we’re going to talk about is the data set that they use for this.

And I’ve tried to include some simpler slides and then the actual details with all these.

But the general vibe of what they’re doing with the data set here is that they’ve got 3 million or so examples that they’ve scraped from the internet. And they’ve got to figure out which of those are crap, which of those videos have bad quality, are AI generated, are just not interesting. Maybe it’s just a dude sitting at his computer. I don’t necessarily want my model to learn that or have things that they maybe don’t want to have in their data set. Legal reasons, things like that.

Moral reasons, things like that. So they scrape all of these videos and they throw away about half of them. Anything that’s blurry, watermarked, offensive, anything like that. Because their goal is to refine this and give it good input data, which is very clear they did a very good job with. And so this is around page six to 10 of the paper goes into this. So they did 1.5 billion videos. trained on 10 billion images. They actually started this model off as an image model and then added in the video capability afterwards. They initialized it as an image model because it’s easier to do that initial convergence there. And they refined the pipeline through a multi-stage process. And you can see this kind of ribbon chart here showing how they’re condensing stuff down from their initial 3 billion to 1.5 billion videos.

And they’re going in, they’re doing things like NSFW sort of filtering, getting some of that stuff out, getting out things that have bad quality, aesthetic scoring, had low levels of motion, didn’t have useful topics, things like that.

Then they also added in OCR style data so that the… model is able to effectively render out text. That’s a big focus that they talk about a lot in the paper.

And as somebody who’s used it, it is very good at text. If you tell it, say, like, this sign says, eat at Joe’s, it’s going to give that thing effectively traveling through the scene if you give it some sort of motion as well.

Another thing they did is that they added in these dense captioning models using sort of a scene graph style dense captioning.

uh over uh a video and i think that they trained a fine-tuned version of quin 2 uh 7b it’s my guess here um and then uh they’ve open sourced everything but the data set but it’s uh they’ve been pretty open as that they didn’t really respect any sort of copyright whatsoever on this there’s a chinese company that built this model so it’s got everything in the kitchen sink uh and uh Yeah, that is what it is. They put it out on the internet. So here it is. And so here, I thought this was interesting. This is where they’re talking about their F1 score, which is basically a distance score here for Gemini 1.5 Pro and their little tiny fine-tuned Quinn 7B VL model. And you can see, number one, they’ve got really good performance. I mean, 1.5 Pro is a very good model.

for vision tasks.

Gemini, in general, I think, is the best at these sort of vision tasks from the paid models for captioning specifically.

Definitely does not have anything to do with the fact that Google owns YouTube, I’m sure. But you can see here, it’s very interesting that you see the sorts of tasks that they were training on. And the report is wonderful in going into all of these as far as how they’re doing this stuff.

but they’re looking for the ability to caption the event data, the action, camera angle, motion, if there’s text in it, what the style is, what the scene is, color, category, and counting. And so what you can assess from this is that the model is going to be really good at doing prompt adherence to any of these things, which that’s great.

I want it to be able to do all these things.

And these things are things that the other models, quite frankly, suck at a lot of the times. And that’s why they got a lot of the good performance on this is because all the time and innovation that went into this dataset captioning pipeline cannot be understated.

So it’s not necessarily the coolest thing from the technical perspective, but from an infrastructure perspective, all the dataset stuff in the paper is very cool. All right, any questions about the dataset? Did they mention anywhere where they had to go find additional data to augment what they had, or was it mostly just removal? I know you mentioned the OCR stuff they had to add in. I’m just wondering what else they may have had to pull. Yes, so the OCR, they definitely did mention that they did augment the data sets with additional synthetic text examples. Oh, I was just there. Yeah, synthesize hundreds of millions of text-containing images.

So they definitely did that with characters for OCR and stuff like that. As far as doing synthetic videos, I don’t think that they did a lot of synthetic videos. Actually, I know they didn’t because they mentioned that they noticed a severe degradation. when the data set had more than 10% synthetic examples. Cool.

I will say that’s probably the case for the base models.

I do know that for smaller in-post training, LORAs and stuff like that, synthetic examples are still effective if you have a fairly good curation pipeline.

But at the base level, I could see that not being the case. All right, so we’re going to move now into the VAE. So like we’ve been talking about before, it’s training a zip file, essentially, that’s shrinking in these input videos into the latent space and then being able to effectively output them as the videos that are very similar to the input videos that they have. And it’s very interesting.

So this is a 3D. spatio-temporal variational autoencoder.

It’s not the 3D that we generally think of, though.

It is not width times height times length sort of stuff. It is height times width times time. That’s our 3D that we’re always thinking about in the video generation space. And generally, our dimensions here for time is over 4.

uh height is eight and width is eight that’s our our dimensions here are four eight eight which is nuts they’re able to get it down into that size all right so the idea here with the vay for the architectures they’re doing this 3d causal vay where the causal is is talking about the uh actions uh so so where things are and the height and width and at what time they are in things that happen earlier cause the things that happen later. So it is a unidirectional, always going forward in time, causal way along those three dimensions is really what this is trying to say here. And they do lots of interesting things.

So this idea of a feature cache where they’re kind of smartly understanding that some part of the video of that three dimension isn’t changing.

so that they’re able to cache it and they don’t have to recompute that stuff. They’ve got lots of interesting stuff happening in their architecture there that makes it very effective for long clips that they’re encoding and decoding here. It’s a little bit different in the actual transformation, but encoding and decoding, they’re able to do that here. And their process here, they do this a whole bunch as they start with 2D VEI.

Then they inflate it to 3D. So they take the initial 2D and they start adding in three-dimensional data where three-dimension is the time height width. And then they’re gradually going from 128 and then up and up and up to the 720, which is kind of their general target. And because of that, they’re able to get this crazy small 127 million parameter model that works at 720p that is just as fast as heck.

And it’s a lot better than the honey one.

So yeah, their VEI is really, really cool.

All right. Any questions about the VEI?

OK.

So next up, we’ll talk about the diffusion transformer.

So this is what we’re talking about for.

This is kind of the brains.

This is the thing that actually imagines the new videos that kind of tracks out that causality, how things are going to move, how we’re going to move that sort of vector field sort of thing around.

And it takes that compressed data from the latent space and gradually turns it into the finished clip.

It’s very interesting if you watch these things move.

I think, Jay, you got the preview working at some point.

Or you can just kind of see it. It’s just kind of like slowly moving the thing over and over and diffusing the thing in those little segments, trying to kind of figure out like, OK, if I go here, it’s all working in that space in between words inside of the model’s brain. And so that’s what the diffusion transformer is doing here.

And this is kind of how they did the training for it.

So the idea here was that they are going to turn the latent videos into a patch that’s doing the same thing we were talking about before with that boat.

And it was looking at the ground and decided this whole thing is a ground where it’s kind of turning that into a sequence so that our transformer model can treat it as, you know, it’s still a sequence to sequence model so that it can do, you know, attention sort of processes over it.

It’s got its blocks split up over 40 different.

transformer blocks.

I didn’t really add anything about the blocks themselves, but think of them almost like each of the blocks are kind of good at a different thing. It’s almost like a layer, a layer inside of a transformer network. That’s kind of what a DIT block adheres to.

A very interesting thing about these DIT blocks is that they all share the same MLP, I believe.

which saves a lot of the activations.

They do a lot of interesting stuff where they’re dropping out and pooling attention for local and global across different layers. So they might go local, local, global, local, local, global. Lots of little clever things like that here that we kind of saw in some of the DeepSeek stuff where these guys are also constrained by the embargoes.

So a lot of those sort of clever tricks that were happening over there for need are showing up here as well.

It uses cross-attention to inject the UM-T5 embeddings.

And then also, this is where we’re getting that flow matching objective, where we’re trying to predict the velocity of the noise and doing very interesting things with protruding the noise in order to improve the quality of the video, which it seems kind of weird.

Really one of the ways to increase the quality of these things is messing up and kind of optimizing how it’s injecting and removing that noise, even at inference time. So for the DIT, they did talk about this curriculum learning schedule. And the idea of curriculum learning is the same thing with kids. I don’t teach kid calculus on day one.

Same way, I’m not going to teach the model how to generate out a 720p video for 81 frames on the first training run. And so they do this curriculum learning schedule to warm it up, starting with the image only warm up at 256 by 256 pixels.

This is a very normal starting point for image models.

I think over at OMI, we’re actually starting at 256 by 256 as well.

And the idea here is that you’re learning those abstract shapes.

at that level and not the fine details.

You want it to learn the high level abstractions that are important for early convergence, things like color, things like high level composition.

That’s the sort of thing that you want to start it off with.

And then they do go into a joint image and video. So they’re still adding in that image prediction stuff, which is basically, they’re only taking one frame. Just think of it that way. And then they do it in three stages, one of which is the 256. images and five seconds clips.

And then they go up to a 480 level. And they actually released this 480 model.

So they released this level because it’s a little bit less demanding to run. It’s one of the sizes, but then they also went another level up to the 720p and then left it there. And so that’s your general base.

And then do some post-training as well in order to kind of get some high fidelity data.

let it extend out to 81 frames, different stuff like that to add in capabilities afterwards. But this is the main idea of we’re just kind of stepping stuff up, lots of concept of generally increasing scale to allow for early wins for the model to actually be able to train. All right, so the concept here is that with the DIT curriculum learning schedule, And the generally increasing size, we also increase the sort of prompts that are going in. Prompt adherence is a big problem with all of these generation, sort of image generation models, because you’ve got to find a way to allow it to explore the space effectively, but also do what the user asked.

And the longer the prompt is, the more things that the user is asking and the more complicated it is.

And so they’re also doing that sort of thing in their curriculum learning.

And they actually have this prompt rewriter, which is using Quinn 2.5, that takes in the user prompt and rewrites it to help the model learn how to do longer prompts.

So they use this both in training, and they also provide it inside of their paid service.

I think they provide this. So sometimes you’ll write it, and it’s an enhanced prompt or whatever it is. That’s available at test time.

They also open source the code. So I just wrote my own plugin to do the thing that they did and it’s open source. So it was easy. So, yeah, I think most of the, when we did the other video series, most of the offerings that we looked at had some way to enhance the prompt you gave it.

Yeah, absolutely.

You know, cause sometimes it was helpful.

Other times it did some weird stuff, but yeah. So, yeah. The other thing was with the quality. So, doing the scaling lets them learn the high-level stuff and gets it, basically, think about this as you’re feeding it a ton of stuff early on, so it learns. all the abstract ways that humans and animals and spaceships and all these weird things move in so many weird different ways. It’s learning that abstraction early.

And then we can do a very, very small fine tune at the end with extremely high quality data to learn those fine details.

And the curriculum learning allows us to do that very easily.

And we see here with the experiments, we’re getting a log.

linear for shade video distance and this is this metric it’s okay i don’t necessarily like this one this is basically saying like uh you know what’s the distance between uh the generated video or the the input video and the output video and basically can i tell that the video that was generated is fake is really what it’s trying to detect um which is a weird metric uh so i wouldn’t say that uh I wouldn’t like put any weight into someone saying that we’ve got the highest FVD, but seeing it go as a log linear along with the scaling is probably a good sign. Especially on the going from this size of a model jump. All right.

They also did some interesting tricks with the scaling of parallelism. And I will be perfectly honest, all this sounds super cool. I don’t run a data center, so I don’t have a rack of H100s or H20s or whatever they are.

So this is all academic for me, but I’m going to put it in here because it’s in the paper and they mentioned it. And so a lot of the things that they’re doing here is doing lots of tricks to load. different parts of the network off whenever they’re not in use.

So the interesting thing here is that they’re doing this 32-way sharding over a cluster and then also using this ring attention sort of thing.

Ring attention, this actually came out around the Sora time period.

I think that there was a big sort of hullabaloo.

Because the guy who did Sora, who like led Sora, came up with this concept of ring attention.

And that was like the last thing he wrote. And then he went to go do Sora. And so I was trying to figure out, you know, how does Sora work? And this is one of the things that kind of came out was his ability to kind of do continuous infinite attention, which is very important when you’re doing sequencing for videos. So that was kind of what people were thinking about. So obviously, lots of stuff happened around using that for video. And that’s what they’re using here.

The other interesting thing is… Actually, it’s not in this one.

It’s in the other one. So I’ll talk about that. Yes.

All right. So this is very interesting. So here they’re doing a diffusion cache.

And a lot of the times that… what these things are doing is that they’re detecting the distance.

So you’re talking about these trajectories that they’re learning earlier on.

Let me find.

Here we go.

So the trajectory where it’s detecting that a certain part of the image is not changing over and over and over again.

So let’s say, for instance, let me actually go back to my boat.

I think that kind of shows it good.

So imagine this boat here and what the video is going to be is like this boat backing out and going and then like driving off or whatever it is. It should know that this sort of feature down here, I don’t have to do a lot of computation down here. And so I’ve got a lot of things cooked in that does basically a distance measure where they’re using, I think it’s an L1 sort of distance thing. to detect that they do not need to do any computation here. And they might be able to just skip the step or sometimes even skip the whole step entirely, which I’m not going to pretend to understand how that works. I know that there’s a certain value where if I set the parameter that controls that, my video suddenly starts getting bad because it thinks everything looks the same. But it does improve inference and I’m assuming training time quite a lot. So yeah, there’s lots of very cool things that they’re doing for inference and training speedups. All right. Before we get in, so we’re going to talk now about a lot of the sort of supporting tools and things like that.

So we had the base model itself.

which is this text-to-video model.

There’s another set of models they released at the very beginning called this image-to-video models.

And that’s kind of part of the core as well, but it’s really an add-on.

And then we’ll have a few other ones that we go into as well.

These are kind of the extras for this.

So we’re gonna start with the image-to-video.

And so for this one, what we’re looking at is basically feeding in an initial image. And then potentially also adding in one at the end.

So saying, I want to go from here to somewhere or saying, I want to go from here to over here.

And I think you can also do only feeding in the last image as well. I haven’t tried that myself though. So I can’t speak to that.

But the idea here is that we’re feeding this into the encoder. It’s creating a mask and the mask is basically something that’s giving it a very strong hint. that it doesn’t need to just like, OK, here’s the start image. And now I’m going to turn the camera around and show whatever the heck I want. So the mask is kind of how it’s controlling that and forcing it to keep focus and not do like Sora-like stuff, where it’s just, Sora is going to do whatever it wants to, depending on what you give it. And so it’s feeding that in. It does an image encoding here with clip that is then fed in with the T5.

into a cross attention for the text embedding and then also has this additional image embeddings that go in and if you see like in comfy ui there’s like this image uh embed uh that is goes in extra to this that’s pretty new i haven’t seen that in any other models uh and then it goes into your normal dit decoder blah blah blah um and so yeah this this max mechanism is really the cool thing right here where we’re making two copies of every pixels.

And there’s this sort of idea of the keeping pixels, things that we want to freeze, and things that we want to change as the video goes along.

That’s really the big focus here.

All right, so with the image-to-video training, we’re still using the same data set as before with the joint pre-training.

And then also, only using clips whose first frame already looks like the whole video.

This is a very important bit here, where they’re filtering down the main dataset.

And this is to prevent the case where we’re teaching the model that it’s okay to ignore the first frame, which obviously is deadly to an image-to-video thing. And so they had to have an additional dataset pipeline for that. And then they add in three specific heads that are focusing, one, on image-to-video task, One that’s taking basically a video continuation, which is taking the last frame of a video and then treating it as an image to video. And then another one that’s doing that first to last frame, more interpolation, whatever you want to call it. And I don’t have any examples of that one, but you can kind of imagine.

Actually, I do have examples of that one.

Let me pull it up.

Right. Let’s see.

So I think I had a little thread somewhere.

Where did I put it?

I scroll up till I see videos.

Oh, here we go.

So yeah, this is using the one image to video sort of thing where we started with that J’s sort of video here. And it holds in, and it’s doing that here, where it’s saying certain things change and certain things don’t. And this dude walks through and does stuff and things.

So it works pretty well.

All right.

So any questions on image to video?

Now, I thought the whole don’t ignore the first frame kind of thing, I didn’t even think of that. But that’s pretty cool. Yeah. You can definitely tell when somebody has not done that work for their model.

Looking at you, Sora.

All right.

So here, we’re looking at Vase. OK, so Vase is very cool. And so this actually started with, this is actually Ace, which I think is like adaptive content editor or something like that. And so they’ve.

as part of this paper included in the concept of vase which is a video adaptive content editor and this thing is nuts uh and so essentially i’m gonna show it down here real quick uh is the idea that i can take in a whole bunch of different kinds of models where i’m wanting to do edits on a video uh and uh have that all be done with just one model so some sort of like a uniform model uh so here you know i’ve got an image of no inputs And I’m just saying, I want this dude and this duck with whatever my text prompt there is, which is obviously something with like Superman or something like that. And so from no image is able to create a video with both of these things in it. In here, they’ve got a video of this dude on a horse and then this weird elf guy. And they’ve masked out the dude and it’s put the elf guy in his place. And then we have here Audrey Hepburn showing off this cool bag. And we can see that here it’s working with that.

And so the idea here is that we’re kind of replacing the zoo. And for those that don’t do a lot of diffusion stuff, a lot of the times that there’ll be like these different models, this concept of control nets and in-paint models and different stuff like that. And it ends up you have like 20 different models to do a workflow.

It’s just a pain in the butt.

And so they’re trying to resolve that with this vase framework and pull them in and be able to pack them all into a single model. And it works really well. It also does a lot of control net stuff, too, if you do preprocessors.

And so they take in a text prompt, a sequence of context frames, and then the binary masks that mark where the pixels have changed.

A lot of the things here is that it’s got this awareness of pixels and how those map into the latent space somehow.

And it’s being able to very specifically edit certain parts of the video is really the key idea here. And I’ve got this beautiful example from Bandedoco where we have our famous Indiana Jones scene. He’s got to get the MacGuffin.

And here the MacGuffin is nice cold beer at the end of the day. I feel you. All right. So the base pipeline here, data collection.

So beyond our general text-video pairs, we need some instance-level masks.

So we care a lot more about mask sort of data. You know, things like segment anything, grounding Deno, bounding boxes.

All of your segmentation models have become very important right here because I need to have very high levels of mask where I can remove an object and put another object in and then also detect if that… correctly happened at scale over a starting 1.5 billion size data set.

So you’ve got to automate some of this stuff. So SAM, I’m sure, got used a whole bunch here. They talk about it. They also talk about things like preprocessors, so control nets. So here you’re looking at things like normal maps, pose maps, things like DW pose, canny, all the sort of like scribble lines, things like that. I don’t have any examples of those.

I didn’t think to pull those in here. um actually hey i’ve got uh the paper see how long this thing is actually let me get to i’m gonna look for like five more seconds OK, I know I have a canyon here somewhere, but we’re not going to find it.

All right.

The general idea is, though, is that there’s lots of different modalities that you can put in here.

And they map very closely to the general media toolkit.

So this thing is very cool.

Yeah, that’s Vase. They also have something very interesting here, which they’ve isolated their camera model in here. So it’s easy to train.

and prompt camera motion sort of effects.

Very easy to do like LORAs for cameras and multi-camera LORAs. So you can have different movement styles that you train. You don’t have to train one LORA for each camera style. You can do multi things because they’ve kind of pulled it off into its own module so that you can freeze everything else but the camera adapter inside of the attention network. And remember, the attention network is what we’re using to decide what needs to change and what doesn’t need to change. That’s how it’s doing that pixel freezing sort of concept.

And the major components here that we care about here is this camera encoder.

So you have the encoder part that is doing, you know, what’s the pose of the thing, the plucker embedding, which just think of this like almost like as a sort of positioning sort of in space.

I think this is actually this sort of idea here where there’s a trajectory.

of where the camera’s moving.

I think that’s kind of your plucker embedding.

Might be remembering that wrong.

I’m very sure that is what that is.

And we also have this concept here of the text encoder can also have camera information that goes in.

And the encoder that is taking in the camera information from the source and the text encoder work together to build out the embedding here.

where we’re doing the scale shift value.

And that scale shift is kind of that transformation that allows the camera to do things like move forward and around at the same time and do the sort of complex motions that just happen with cameras. So that’s what’s happening here. And so what that allows us to do is we are able to take this base image here in the middle.

And I have four commands here, pan left, translate up, pan right, and translate down.

And Juan does not have a single problem with this. It is, of all the models that I’ve ever played with, it has the most fine control because they did the work to expose it and make it available for people to edit and modify. So, I mean, as far as like, if you’re actually doing content creation, you know, doing this sort of stuff and need to be able to rely on it in your daily work, I’m telling you, I mean, it’s very good. The thing that got me, especially, I wish Landon was in here because he actually does this for a living as far as editing videos for commercials and other types of things. And the ability back on the part where you want to take a thing and generate other stuff, but not the thing I’m giving you. In other words, if I had a branded t-shirt or something, I could generate all kinds of stuff. But no, this thing needs to stay as it is.

Being able to mix those is pretty interesting.

And in ways that, I mean, you might not have ever seen before too.

I mean, whatever this guy is on a horse, he looks like he should be on a spaceship. Yeah. All right. So yeah, that’s Vase. Vase is very cool. Like a lot. Then there’s another concept of WAN audio.

So they talk about this, and we’ve talked about this before when we were doing the speech-to-speech sort of talk, where you got to think about audio.

Audio is really a diffusion image process where they are, you have this auto encoder.

So we talked about it before.

We didn’t really talk about VEI specifically, but where we’re able to take in the waveform and we’re decoding here, not into an image, but into a spectrogram.

And then that spectrogram is able to be taken into an actual sort of MP3 or whatever WAV file it is that you have. But what is diffusing is that spectrogram as an image.

And so it’s able to do that on the input.

And I actually have seen this.

They do have something where you can feed in audio and it will change how it’s rendering the video based on the audio. If you feed in something that’s a music video. it’ll move along and the motions will move with the music video or the music that you’re adding in. Or things like you have a scene and there’s a big explosion. It can kind of move things in that direction to have things react to it if you combine that with the text prompt and the other different sort of embeddings that you have.

The interesting thing here is that apparently they also have something to output the audio.

I haven’t seen this yet.

I haven’t looked really hard.

So they might have actually released this.

But basically having it also output the audio that goes with the video that it’s generating, which if it’s kind of doing it all, everything in here, it makes sense to do that. So I think that’s very cool. And they’re not the only ones that’s doing this. I actually looked at the onion paper as well.

And they also had some sort of audio output.

There’s something about this architecture with the DIT that makes this very, very tractable.

And here you can see kind of a view of this is a professional, highly fine or specific MM audio generation model where they’re both generating out the spectrogram.

And that’s the end of slide deck.

So, open floor.

I had an interesting kind of a connection I didn’t expect, but I’m a fan of early Disney as well as early George Lucas, Industrial Light and Magic, and some of the things that they would do. like to be able to splice together multiple feeds, and some of the tricks they use, like, hey, we’ll put a flash here, and it’ll take your eye a certain amount of time to adjust back to what it was looking at. And they’ll take that time to actually be able to shift something, and you won’t notice that it happened.

And that, plus some of the masking pieces they would do.

As in, hey, we don’t actually have a spaceship, so we’ll mask out the rest part and we’ll kind of like the part where you had where things aren’t changing. You know, it’s very similar to how they would actually have just like a painted piece of glass. And yeah, I mean, so right before you had green screens and all that kind of stuff. So some of this, it’s kind of interesting. It’s like we’re doing things again, but digitally, if you will. I think it’s just kind of neat looking at this one. I don’t know how, how good the resolution is, but you can actually see the beer sloshing.

Obviously this would like fall out, but I, but it’s, it’s moving around in the glass even here, but everything else is holding. Right. It’s just, that’s, and this is why you do the DIT. Yeah. I’m trying to remember the scene, the thing we did last time with the, you know, that same, take this image and make a video of it.

where you had a person or two walking in front of a mirror and it was, it was smart enough to keep the other stuff static, but move the person and also know that it had to, you know, reverse the person to put them in a mirror. Yeah. I mean, it was just really, really interesting stuff that I had no idea it would be possible to do things like that. Look at the reflection here.

I think in general, Sora, I’m the least impressed with Sora.

So you can see it’s trying to do the reflection here, but it’s got it wrong.

It should be the upside down. Yeah, it knows there’s a reflection. It just doesn’t quite get it like it’s the right reflection. Yeah, and I would be curious if you… I didn’t run this in Juan, but I think it would be more likely to get that right. It’s hard to tell if that’s wrong or not. I mean, what would you… It’s not upside down.

Maybe I’m wrong.

I’ve never looked at a piano underwater, so maybe my frame of reference is not to be trusted. Well, where’s the keyboard compared to… I can’t tell.

It almost looks like a couple of pieces of piano.

Right. It looks like it’s just translated this. I would imagine that it would be flipped, right?

Am I wrong there?

No, I think it should be. If it should be flipped, I think that it is wrong. Maybe?

Yeah, because it’s still on the top. It’s trying. Yeah, the fact that it knows that the top of the water should reflect. That’s just interesting stuff. trying to split it up too yeah the uh the the thing that i’m still not quite sure on i mean are we still um so right now most of the ones that we ran do like 84 frames you know and then you can stack them you can take the last frame of one video and uses the input frame of the next and you know keep a prompt consistent or something where are we um as far as getting you know i i I need at least something that’s, you know, 20, 30 seconds, probably. Here you go. So yeah, I didn’t include this, but this came out in the past two weeks. Oh, okay. I actually put it in the same thing, but I didn’t, I, I wanted to not end at seven 30. So I didn’t include this one, but this is what I would have included, which is that they are doing some very interesting stuff with the attention so they can do essentially infinite generation.

Okay. And there’s this concept of diffusion forcing, which is very interesting.

where they’re basically finding a way to effectively extend out and stop focusing on stuff, but doing some sort of a global attention over long periods of time.

And I’ve seen this work very effectively up to a minute. Theoretically, you can go longer, but then you’re going to get into weird stuff where your text embedding is kind of starting to fall off. But yeah, it works. It’s really good. And you can combine… So this… It actually functions the same way as the wand-based stuff.

Okay. It’s basically hijacking the attention. Do you know if it’s… Are they moving to one prompt to do a whole minute, or is it more along the lines of, I’ve got a prompt that applies this prompt to the first five seconds, and then we’re shifting to prompt number two for the next five seconds? Can I storyboard this thing?

Yeah, people try both.

I think they’ll try to figure out.

You want to be able to do both, I think.

But yeah, the prompt travel is the name of the what you’re talking about, where you’re kind of doing.

Okay. Sorry, go ahead. Go ahead. Sora has a sort of storyboard prompt format like that.

But like we said, Sora has some issues.

I’ve also seen some pushback from people in charge of video models saying, hey, the average Hollywood movie shot is only a few seconds long or less, so you don’t need more. But I think that… The average YouTube video is much longer. Yeah, yeah. Once you look into what you actually want to create through this format, you don’t want to do… 500, two and a half second shots tied together.

Yeah, they definitely optimized for what they were thinking of instead of what… If their market was Hollywood, then they correctly optimized that model.

If it was the general public, they did not. Lauren’s got a question in the chat. Yeah. Hey, Josh. Thanks so much for the overview on diffusion models. Yes.

I had a question on the Frechet distance.

On the concept of Frechet distance, usually it’s related to walking a dog and minimizing the area under a curve of a dog walking a trajectory. I was just curious if I just think of the trajectories as points over time and then I consider pixels over time, are we still minimizing the area under the curve?

In that concept?

So generally, the for shade video distance is specifically talking about what’s the distance between my real videos from my actual input distribution and my generated videos.

So am I writing out?

So you’re trying to reduce that for that area. Okay, so but it’s still some sort of area like sequence of events over time.

It’s still the same kind of thing where you’re just trying to minimize the area. I’m not going… I’d have to look that up, basically. Okay, cool. No, I appreciate it. Thank you so much for going through everything. I think I kept up with pretty much most of it. Getting my head wrapped around the latent space is interesting.

You know, the diffusion transformer, I think I’ve got that in my head.

I need to go through that architecture and flip from the normal text-based and just kind of, I may go back and hit that again.

This is really cool stuff.

Yeah, and I mean, the flow matching and rectified flow to me was the weirdest one. I mean, I tell you, I had to sit there and stare at it for a long time.

Now that the other end is one of those things that I just need to stop thinking so hard about it.

But it’s weird, for sure. Now to block.

Go ahead. A lot of these problems, I think, getting a really, really solid intuition around this concept of taking some sort.

It’s really this right here.

Taking some sort of known distribution.

and taking it into this latent space and then out of it.

And the process of learning that function, to me, getting that solid intuition helps everywhere whenever you’re starting to think about these models.

It helps with the transformer stuff.

It helps if you start thinking about what RAG is, where you’re injecting those things in as additional input embeddings and helping it map to actual… ideas inside of the latent network of like a transformer. There’s so many things that kind of come back to this concept of mapping the input distribution into a manipulatable place to what you want at the other end.

Kind of, it makes the areas where the models start failing make sense.

So it’s a good one to focus on.

Sorry, what were you trying to say? Oh, I wasn’t. No.

I’m trying to wait the required amount of uncomfortable silence before closing out. Well, I’m sitting here thinking about the back door of the houses that I used to live in, okay? And I’m trying to figure out how this flow matching would be going to the neurons in my brain. You understand what I’m saying? What’s the process to getting stuff into some video memory of the human brain? Anyway, I’ll stop. Yeah, so actually think about that as more towards the autoencoder. And they actually do have people that are doing this sort of thing. I don’t have time to pull it here.

But where they’re able to take in the brainwaves as input into some sort of latent space and then decode it out as a video on the other end, where there is some level of actual ability to do that, which is nuts.

But it seems to be a learnable thing.

And that might be an interesting topic for one day, I think.

There’s a whole body of knowledge that I’d look at that and I’m like, okay, that’s sci-fi weird stuff. But it’s happening and it’s definitely getting better.

I read the headlines enough and the papers enough to know that that’s a thing that is happening. And just ignore how close it is to Minority Report and not worry about it. Yeah. Dropping into how some of the lores are done, that might be a thing for a future. How you can style things specifically just by tuning one very small piece of this. That might be fun. We are right at 7.30.

I guess thanks for putting all of this together.

I know this is not a small amount of work, and we certainly appreciate it. Yes, thank you, John. I think we had 13 on at one point, so there’s at least 10 people that are smarter now than they were an hour and a half ago.

Thank you for letting me go.

Cool. Well, nothing else, and thank you, guys. All right. You want to stop recording? Thanks. Thanks, Josh. Thanks, Josh. Appreciate it.