Transcription provided by Huntsville AI Transcribe
And we’ll go ahead and get started. All right. So tonight’s topic is unified multimodal embeddings. We’ll be talking about multimodal fusion and kind of how embeddings have progressed in the past year or so. We had a few of these talks, I think, early, mid-ish, 2024. So this is a good time to catch up kind of generally on things that are happening in embedding space, but also on some very specific, cool things happening in the multimodal space specifically. And so this will be two parts, really, where one is that we’re going to be going over a paper called Gina V4, which is one of the more popular embedding models that are out there.
And then we’ll also be looking at a paper called the rankability of visual embeddings.
Super riveting title there.
But kind of going through those, but more importantly, we also have lots of little demo playground stuff to look at where I’ve built a little multimodal world builder based off about 5,000 sci-fi sort of images that I scraped over this weekend. This is something I just pulled together starting Saturday and just kind of grabbed a whole bunch of images, threw some captions on them, threw some embeddings on them, and gave us some visualizations but also implemented a lot of the things we’ll be talking about in the paper. And so we’ll be able to do stuff like searching here for spaceship and see where those pop up inside that big principal component analysis, look here, re-rank by this sort of stuff, and go from there. So we’ll look at this a little bit towards the back half once we’ve talked about some of this stuff and can contextualize it. And what we talked about in the back half, we can talk technical stuff. We want to talk through the pipeline that’s building that or the core principles. So anything’s kind of fair game once we get through the meet.
All right, but we’re going to start talking about multimodal fusion, which is the name of the game.
Super fancy sounding word, but basically just means that we’re jamming the different modalities together.
And there are different places where we could do that jamming together. And those have different effects on the final outcome.
And so generally, when we’re talking about multimodal fusion, especially right now, the most common thing that people talk about is in terms of vision.
So that’s where we’re taking textual inputs and image based inputs and turning them into some sort of unified vector and using that to perform some sort of joint inference.
And it allows basically the AIs to bridge the gap, the fact they don’t have eyes by tricking it, some of this is visual information and coding it into something that they can understand.
And it sounds kind of wonky, but if you know a little bit about kind of like how human sight works and how, you know, the eye works and how it encodes and warps and transforms incoming data into our brains, it’s not that dissimilar from how we work. But the idea of fusion is the process in which we take those things and turn them into something meaningful. And so let’s see. Yeah, I skipped some stuff here.
Here we go.
Early fusion.
So there’s just three major types of fusion that we care about.
One of which is early fusion, which is basically we combine the input signal before we feed it into the model.
So generally, we’re talking about this as an LLM, any other types of model, this could be like a diffusion model.
So if you know how flux or stable diffusion works, they generally have two text encoders these days, one which is the T5 or some variant of the T5 like the UM T5 that does the text encoding.
And then there’s clip that is still used a lot of the time too.
And so they’ll take those things together, fuse them at the beginning and then feed it into the model. So that’s the idea of early fusion. Intermediate fusion is where we basically have a pooling layer that happens sometime in the middle, sometimes it happens a little bit after, sometimes it’s like in between passes.
But there’s some sort of pooling layer where we’re trying to fuse in the middle and get a little bit more aligned. And this is helpful because it’s easier to kind of mix and match stuff as you go along, you don’t have to train the models with each other. And so that’s one option. I see that done the least.
And then there’s late fusion, which is basically after the fact, I’ve got this image embedding, I’ve got this text embedding, I’m just going to average them out and hope for the best. This is the easiest, but it’s not super useful. And probably if you see people like saying, like, hey, I do fusion or multimodal fusion, they’re probably doing something like this with like llama index or something like that, which has some problems with it.
And so what we’ll be talking about today is really advancements in early fusion for use cases that are not diffusion, not sort of edge cases, things that we want to do, especially with like rag and document analysis, but also I’m going to show some non-rag. I think we can all imagine what the rag use case is for these, but some more creative use cases too that these things might be useful for. Have you guys heard of multimodal fusion?
Is this a fairly new term for folks? I think it’s fairly intuitive.
I’ve heard a bit mostly related to rag of, hey, I want to search my images like they’re, you know, with text and stuff like that.
Yeah. Yeah, that was my assumption. Yeah. Okay.
So we’ll talk a little bit about kind of what those mean and try to get that to be something a little bit more meaningful as we go along. And so here’s kind of the best way I can describe the differences between late and early fusion and kind of how it ends up is say I have a fusion between I, you know, putting some sort of input image of a dog in and I’m adding a text modifier of wearing a party hat.
A late fusion output is going to have trouble actually combining these things together.
It’s more likely that you’re going to have to get a dog.
You’re going to get pictures of dogs and hats, which can, I mean, it’s better than not getting those things. And so, but you’re going to have to do some additional processing to make this into something that’s useful. Now, sometimes it’s going to be able to do it. Sometimes it’s not. An early or intermediate fusion output is going to be more likely to give you your actual contextual binding because it’s doing instead of kind of taking these two concepts separately and jamming them together at the very end.
It’s letting them live in that shared latent space while they’re doing the inference, whether that’s outputting an image or, you know, going to find the proper cosine distance for something.
And so what we’re going to be talking about tonight is the Gina V4 model architecture.
There’s so many, the reason I went for this one, there’s lots of cool models out there in the space.
But this one came out, I think, maybe late last month, sometime around there.
And it’s just so cool because it does everything. It has kind of like all the little bits and bobs. It’s right there.
It’s the perfect size.
It’s not the perfect license.
We’ll talk about that later. But it’s perfect license if you’re doing hobby stuff. But it’s got all of the cool stuff to build out your pipelines for whenever all the more permissive models come along. And it’s a great topic for us tonight because we can go through their paper and talk through all their stuff. And so the beta idea here is that they’re training this little Quinn 2.5 3 billion parameter model.
And it’s very large for an embedding model.
We usually have embedding models are very, very, very tiny.
But this one’s fairly large in terms of things, still very fast though.
It’s 3 billion parameter model.
And what they do is instead of before where they would generally train, you know, the base encoder, and then they would have a step afterwards where they trained the vision encoder and they train the text encoder with the vision encoder. And they kind of try and do this thing like kind of what they do with clip, where they’re doing contrastive loss against them, one another to kind of make that shared backbone. They do these a little bit different where they just train the thing together because it’s a language model. It already has these things together.
And to get it to do the retrieval tasks and to focus on generating out the useful embeddings for retrieval, they do that through Lora’s that are focused on the different elements.
So that’s one cool aspect that we’ll talk about.
It’s able to take an image or text into the same model.
So you’re not having to do encoding with different models.
So I’m not doing T5 and clip.
I just have one model that has those things shared together, which that’s your early fusion.
It generates out the token and bullying embeddings that you can do some sort of pooling for these features.
And then the really cool thing is that it then goes into a multi-dimensional output that it can do, which is either the dense single vector, pretty much everything we’ve talked about is in this group up to this point has been these dense vectors, where you just take kind of looking at a big thing.
I’m taking this whole thing.
I’m trying to turn it into a blob. And like this whole thing means these vectors. That’s one way of doing it.
The other way of doing it is this way called Colbert, which is a late interaction, this multi-vector, where we’re basically slicing up the tokens and slicing up the document.
And kind of doing, it’s really simple when you think about it, it’s just kind of going backwards.
We’re going back to going token by token by token analysis, but it’s doing that entire cosine similarity across kind of the segments of the two resources, and then aggregating them for your true similarity.
And so it’s more costly, but as things get more efficient, that cost is going to be as negligible.
It’s the difference between having 20 megabytes of RAM in 1990 and having it in 2021.
So that’s what we’ll be talking about.
And so just for a reminder of kind of how this works, this is Clip. And so with Clip, you can see the difference between this one and that one, is that here we have the docking image going into the vision encoder once.
I think Jay did a talk on Pixel.
It’s very much like that, where it goes into the vision encoder, then it goes into the shared LLM back loan, and then it goes through, which here it’s very, very different, where we have basically the text encoder looks at some sort of text inputs, and then it generates out its vectors. And the image encoder looks at some sort of image, and it generates out its vectors. And then we’re trying to kind of train these things against one another.
And by the difference of that, you’re trying to basically create a situation where I can, with my image sort of encoder can predict what image matches some sort of text and vice versa. And you’re really just kind of going through these two things, but it’s kind of like if you had a blind man and a deaf man trying to travel through the jungle. Sure, they might be able to get it done if they have really good teamwork, but it would be sure nice if you could just do both. So the game of a vector telephone, I think it’s the best way to say it.
All right, this is just another another way of visualizing it, because I’m trying to put this a few times just to kind of get an idea of what we’re looking at here.
So Genev4, we’re looking at text and image, one fusion layer goes into the unified transformer with cross attention for both the image and textual encoding.
And that generates and works inside of the unified embedding space, where we have to do that final similarity calculation between the dual head encoders.
All right, where is my thing?
Here we go.
All right, so next we’re going to look at the matros, or not this is not my matros, because this is a late interaction.
And so this is the one that I actually did not do a, this is the only thing I didn’t do a demo of in the paper, just because it’s very hard to kind of visualize what’s happening with this one. But we will talk through it and just kind of get an idea for it. The idea here is that we do the single vector, it does the entire document, condits it into one thing, the multi vector is going to split that document into pieces, and then individually rate those elements. And the nice thing about this is that it’s just a flip of an input, and then it’ll generate out the different sort of embedding.
And so there are three adapters that they included with their release of this model, one of which is a retrieval adapter, which is for sort of short, query, long document, sort of search.
There’s something that’s looking at similarity, so trying to find things that are similar within some sort of corpus, which has a different sort of metric that it looks at.
But another one that’s very interesting, and I don’t have an example of this, I couldn’t figure out how to visualize it, but I did prototype it and get it to work, was this a code search adapter. And so even though text and code are, you know, they’re both obviously, or natural language encoder, both obviously using text-based encoding, the structure of them and the meaning of them is incompatible.
They’re very quite, really quite different whenever you’re starting to look at things like distance.
And so this is a way to kind of help with, say for instance, I was able to get it to, I wrote like some little toy Python codes.
I had to generate out a whole bunch of things for like elevators and different space things.
And I fed it an image and it was able to pop out the code that was like mocking an elevator. So it’s able to kind of cross modalities in that way.
Oh, which is very cool.
Oh, and I left this thing here for me to pay attention to and go through as I went.
And I didn’t do that. So we’re going to move on. All right. So that’s the first little jumble.
And that’s really the high level of what we’re going to talk about.
We’re going to go into some details on some of the little tidbits around.
But I guess any thoughts up front about this?
Not yet. I’m really interested to see the visualization and kind of what you’ve done with it. Yeah. I agree.
We will race towards there.
All right. So we’ll talk about, so one thing I did was the really cool model, but they also did release a really good benchmark for this, which if you care about training and stuff, it’s good for evaluations. If you just want to have something useful to test yourself, you can train on it, do your own stuff on it too. It’s a nice curated dataset of stuff too. And so this is this VDR benchmark.
This is really meant to replace the Vador benchmark, which is kind of leading one up to this point, but it’s starting to get saturated.
It’s starting to get so, you know, everyone’s getting so good at it that we need a new benchmark. So they’re helping us make one.
And the idea here is to counteract the task saturation as we expand out into more domains, like things like the code retrieval based off multimodal or audio input, stuff like that, expand the scope of those things to different sort of types of retrieval and improve the data diversity for kind of what’s feeding into it.
And so as far as kind of how they curated the data, most of it was the academic datasets, like things like VQA, OCR dataset, sure they have, you know, all the stuff from Vador and all that in here already, but they’re extending it with some additional manual annotations on those datasets for different tasks, essentially. So using the same core data, but adding different tasks, sort of activities to them, or augmenting them with synthetic generation with LLMs.
And this is pretty useful.
I would say that this is a very big issue if it’s all you’re doing. It can go wrong really quickly. But if you’re using it to fill out data where you have existing data that is real, and you’re kind of fleshing it out, this can be very useful.
And that’s kind of what they did in the paper.
So as far as, let me find something, here we go.
So the distribution of tasks that they have in here, so some of them, a lot of them were mixed, lots of things related to historic science, governmental, finance. So just kind of going across all the disciplines and trying to flesh out some of the disciplines they didn’t have before, but also looking at, you know, document diversity. So pulling in things like digital documents, charts, some of the things they added this time was lots of stuff around markdown and slides and maps. So a lot of these things towards this end were, they had scans and handwriting, but this markdown slides and maps, they had some really, really interesting data there. And as somebody who uses this stuff sometimes for things like, you know, narrative building and tabletop gaming, I love the fact that they have maps now, very useful for me. So that’s why one of the reasons I thought this was really cool. They basically, they took like a cross section of Shanghai, apparently, and they just have like lots of, lots of tasks where it’s trying to, to infer information from technical drawings, which obviously has a lot of use for a lot of people. So as far as the kind of language breakdown for this thing, then I need to talk, do a talk in AI for gaming too. Oh yes, probably, that’d be pretty fun. But yeah, so as far as the languages, still very, very English based. That’s where a lot of the writing that is out there is, you know, things like archive and stuff like that, a lot of the training sets, but lots of information for Japan, France, and a whole bunch of other places. And so yeah, it’s a very, very good multilingual sort of benchmark.
Some of the things that they care about is things, I thought this was interesting.
I had not heard of cross modal alignment, obviously. This is kind of an interesting new metric for me.
And basically, the big jump here is, think of this almost like as confidence, that an answer is right or wrong.
And the issue with Clip is that it really, if it’s not sure about something, so I’m feeding it a picture of a cat and asking it if it’s a dog.
It’s based off of kind of what I’m asking it textually and what it’s seeing in the image. It can kind of understand it, but it doesn’t understand how to really marry those things in an instruction tuning sort of way. And so what you’ll generally see is that, that there’s a, it’s unsure.
That’s what you see.
You’re looking to maximize here, the distance between your correct value and your incorrect value.
You want it to be much more confident in the correct samples and obviously much, much less confident in the negative samples. So if it was a wrong sample, you want to see it in the other way.
And so what we’re seeing with Gina in bed is that because it’s doing that early interaction, it’s able to confidently kind of suss out what is being asked of it and what the correct answer is. And so it’s not just a level of getting it right or wrong, it’s also showing its confidence level. And so some of the performance metrics that we care about here.
So one of them is the benchmark difficulty comparison.
So this is the door.
You can see here, it’s getting up to that 90% mark.
That’s not very good.
So we need to start kind of beefing up our, our evaluations. And so we can see here, same model does 90 on the door.
And I think the door is visual document retrieval or something like that.
I’m assuming that Gina VDR is also visual document retrieval. But, you know, who can say much for creativity and benchmarks.
But you can see here, there is a drop in performance there, which is good.
That’s what we want. It means we can keep growing. This is also a very interesting one.
And I think this is going to be one that’s more relevant as time goes on and the infrastructure around these things comes out is the single vector, which is our normal dense vector has performance up to about 70, you know, 73, 74%.
But with multi vector, this late interaction, we see a jump from there up to 80%.
And so right here, that’s probably not enough to, depending on what your use case, to justify the impact.
However, if this keeps moving and the jump between 90 and 99%, you know, that stays the same.
This becomes much more interesting.
But I think that it will probably get there just based off what it is.
And we’ll talk a little bit about why Jan Lacoon agrees later on.
All right.
There’s another one that’s interesting is semantics, textual similarity, where we are talking through basically how much does the understanding of a cross modal problem match what a human judge would say?
Or doesn’t have to be a cross modal, but in this case, it’s cross modal problems.
And that’s, you can kind of see here that gene embeddings is rolling out on top.
They did something similar with their last one, but they have lots of more features on this one.
But we also see here some of the other contenders.
So Cohere is another very large one, Voyage.
I think you don’t get down until BGEM3.
This is your next major open source one.
And this one has a good license.
So this is kind of what you’d use.
One that I’m also looking out very heavily for in this space, just because of what they’ve done in the past, is the NOMIC embeddings.
So they are very likely to release something pretty interesting in this space, too, which will probably be around this area, too.
But their licenses are always very, very permissive. All right. So we’re going to talk to dense versus multi-vector.
I think we can go through this pretty quickly because we’ve kind of covered it, but I couldn’t help but yapping about it before.
But the idea here is that dense is the global snapshot where it’s looking at the whole thing. It has really no clever things.
It’s a naive embedding.
And this is the fine-grained view.
I talked about it already, so I’m going to move on. But what I haven’t talked about through here is a little bit of how it does this aggregation, where it’s splitting them out into individual tokens. And it does a similarity between each individual tokens and the same token in the other area.
And then also Q1 versus D1 versus D2 versus D3 all the way to DN.
Q2 versus D1 versus D2. So it’s just a matrix multiplication, essentially.
But obviously not super efficient, but has benefits as far as being able to understand those segments that we lose with dense retrieval.
As I’ve seen this, it’s also very interesting with things like visual tasks.
That’s actually one where I saw this come out the most was Cole-Polly, which I’ll talk about later.
But there’s also things like Cole-Quinn, where it’s very good at segmenting PDF pages, has been a very interesting use case for this, but can’t really be matched in other ways.
All right.
And so, yeah, here’s the original architecture for that late interaction thing from… This is the same guy who did DSPY.
I’ve kind of talked around that.
Definitely something to look out for in the retrieval space. All right.
Any questions about any of that?
No, I just want to know how you say this word.
So now we’ll talk about Matroshka embedding. Okay. This is named after the Russian nesting dolls. And these are super cool. I’ve actually been playing around with these for some tasks and have found them to be pretty effective.
So the idea with these is that we basically… We have these large, large embeddings.
And so the ones that we’ve been playing with in the past have been like 512 to 768.
That’s really kind of the normal size for these things, at least when we’ve been talking about them.
They can obviously go much, much, much, much bigger. And so for these, they’re 2048 is the size of the natural Gina V4.
I’ve also been playing with Quinn as an embedding model.
And it’s like 4096.
I actually can’t… I have to do special vectors and PG vector to even fit it inside of the table, like half-vec and sparse-vec and stuff like that.
And so it’d be really nice if there was a way to effectively truncate those embeddings, quantize them in a way that didn’t lose me a lot of performance.
And that is what Matroska embeddings do. And these are really cool because how they function is that they essentially… They sort the vectors by the level of importance so that all of the most important, the highest impact vectors are naturally towards the front.
And what this means is that we can progressively truncate the embeddings the easiest way possible.
We just slice off the array.
I just kick out the last half of it and keep stair-stepping down. And my performance is not linear as far as how that kind of translates. It’s actually pretty good.
And so this is obviously the Russian testing doll is that you’re taking your data distribution and you’re decreasing and decreasing and decreasing it.
But your highest quality data comes pre-quantized and nice little package that all you have to do is slice it off. And that’s literally what I do in my code, just slice off the thing and call it a day.
It’s awesome. All right. And so what they do to do this is they basically train the full vector and they train it with four different losses.
And then they, you know, the 128 dimensions, 512, 1024 and 2048.
And they use that combined loss together to update the model as they go along.
And what that looks like as far as precision is here.
I’m at 2048.
You know, I’ve cost everything, but I get the full accuracy.
But I cut it in half. I just lose a little bit. And then I’m at 1024.
1024 is very, very accessible.
But I can also cut it down to kind of one of the normal traditional kinds of your 2022 numbers, 512.
I’m kind of getting to that 90% range for performance on this a lot of the times.
My cost is way, way down.
But I can also go all the way down to almost nothing.
I can run this thing on a CPU and not blank.
And so that is the power of matryoshka embedding.
Sometimes you don’t need all that performance. One place that I found a lot of use in this kind of how I’m using it in my pipeline is re-ranking. So re-ranking, I’m kind of just trying to order non-linear data along a linear axis.
And it’s already not super accurate already.
And so I’m looking just for a vibe. And so if I can do that vibe super quickly and try a whole bunch of rankings, bully for me. So yeah, that is matryoshka embeddings. All right.
Now we’ll talk about the rankability of visual embeddings.
This is another paper that came out.
And just to talk a little bit about ranking. The idea of ranking is basically that we want to take some set of inputs and then sort them.
I mean, that’s the easy way of saying it, obviously.
So if I sort something by date, that’s ranking it by date, essentially.
But the key and why we really care about this is that we’re trying to find some sort of axes for non-linear data and finding a way to convert a non-linear aspect into a linear sort of rankable axis.
And so this paper is talking about our visual embeddings, rankable.
Because it seems like it might be something where it would not be rankable.
It’d be very hard to take something as data rich and as kind of noisy as visual data and kind of convert that raw visual embedding into a rankable axis that you could do things like sorting on.
But what they find is that you can, in fact, do that.
And so the idea here, question, would that be like saying this dog is more dog-like than the southern image of a dog?
It could be that, too.
It could also be like this dog is older than that dog. Or this dog and this human, here’s a picture of a whole bunch of mammals. How old was each mammal?
It could be stuff like that, too. I was left likely. But I think maybe dogs and humans might be able to do. And so what they found is that vision embeddings are generally rankable.
So obviously, it’s not going to be absolutely perfect, but they did a bunch of sort of attributes on things that can be converted into a meaningful linear axis.
So things like age, things like the size of a crowd, how many people are inside of a crowd, pitch, yaw, and roll, aesthetics. Aesthetics is one that’s popped up a lot in sort of like think about diffusion sort of models. So sort of LLM-based adversarial training for diffusion models. And they find that if we train them on these things, we can effectively get rankable embeddings out of them, which is super neat.
I think the pitch yaw roll would have application in this town.
Very likely. Possibly. And so, yeah, this is their output. I usually don’t do this, but I did put their final statement is that they were like, well, we didn’t really think this would work. But it does. And that’s great.
We’re happy about that.
So it seems like they kind of came in skeptical.
We’re just kind of testing it out.
But it does seem that visual embeddings can be trained to kind of grab ordinal information.
That’s one of the things I have tested in a very, I’m sure I have it completely wrong.
But even I was able to get some interesting stuff out of this in my little toy application. All right. We’re almost towards the very end where we can go poke around at some stuff.
We’ll talk a little bit about dimensionality reduction or visualization just because to get a little bit of insight into what we’re looking at, what’s the big picture about?
So there’s three main types of, I say main, these are the ones that are relevant to what we’re talking about for basically taking high dimensional data 2048 vectors and trying to get it down into some sort of three or two dimensional sort of viewing angle, which is obviously quite hard to do.
And it’s not perfect.
And so we’ve looked at a few of these things in the past when we were kind of looking at a lot of the interpretability stuff that Anthropic does.
They have like their Golden Gate Bridge sort of thing that they were doing this sort of dimensionality reduction activity.
And we’ll be doing one today. The one that is the most famous, most people know is this principal component analysis, which is basically trying to find the most important axes of variation.
The problem with the principal component analysis that is trying to, it’s generally expecting data to have some sort of linear aspect to it, which really, really doesn’t work with data that is very messy, very dirty, things like visual data.
And so it doesn’t really work for our use case. It is very fast though, which is nice.
There’s another algorithm called S&E or TSNE.
This is, I think, George Hinton, not Jeffrey, Joffrey Hinton.
I can’t remember.
One of the Godfathers. He came up with this. This one’s really focused on kind of nearest neighbor sort of aspects where it’s trying to look at things that are close to it and suss out where there are connections that are meaningful that continue to be meaningful as you reduce it down the dimensionality scale.
The main hyperparameter for TSNE is perplexity, which is quite perplexing when you try and figure out what the heck it’s doing.
So it’s not always very clear how to optimize this to make things work better.
And it’s also quite slow.
That’s some of the problems with TSNE.
What we’ll be using today is UMAP, which is a fairly new version of something close to TSNE, where it’s still doing a lot of the adjacency stuff. We still care about things that are nearby and making those connections. In fact, we care a whole bunch about it because that is the main hyperparameter is your K neighbors.
So how many sort of neighbors do I reach out to to assess what the clusters in this collection are and what things are meaningful as we reduce dimensions?
And so that’s what we’ll be looking at tonight.
Are you guys, so these are some of the ones that are fairly old or have you guys worked with PCA TSNE?
I think this is 08. Yeah, I’m not fast enough to find it, but some of our, I mean, this was been five plus years ago.
We were doing a lot of things with textual analysis and topic modeling and things like that.
We’re using TSNE. PCA wasn’t quite good enough for what we were looking at, but we’re doing some graphs with this kind of stuff.
It was fun.
Again, perplexity, I don’t know what it means still, but I know how to find it. Perplexity.com.
All right, so last little bit.
We’re going to go into tidbits.
These are little things, rules for the road that I didn’t really want to get into.
There’s actually one of the papers that I kicked out just because I wanted to get to the other stuff and have a little bit more free time, but I do want to mention it, which is this VLMDeVec. This one has a very nice license.
This is Apache 2, and the main reason that the other one is not Apache is because it’s based off of Quinn 3billion, which has a research license that has to be CC40 or whatever that is, where this one is doing 7b, and it does not have that issue. It is a release of the Apache. The nice thing about this one is that it’s really, really trained on video. We can talk about text, we can talk about images, but we can also add in the cost aspect of how can I query with a video to get text or images, maybe code back with audio to talk into my phone and have it understand not just the words that I’m asking, but the semantic tone and retrieve me something back from any modality. Go find me a movie to watch based on how pissed off I am. Little use cases like that, starting to think about those sorts of things is what makes, has me very excited about VLMDeVec. This is what I’m looking out for.
I got this working. This video is very demanding because you’ve got to do it at six frames per second over how long the video is. It has some additional difficulties with it, but I think this is going to be a very interesting place to watch over the next few years. Another one to look out for in this space is Colpoly or PolyGemma.
It’s very likely that Gemma will come out with some additional interesting encoding stuff.
I think that they actually did the original release of Siglip, which is what almost everything that came after Clip has been based off. I think Pixel is a modified Siglip adapter.
I know, I think Quinn is as well.
They obviously wanted to watch out for in this space.
They obviously got this down if you look at the power of things like Gemini Flash.
If I was able to get something that’s kind of like a Gemini Flash through the Gemma series, that’d be super neat. Obviously, it’s American company, so that’s neat too. Another one to look out for, this one’s kind of the most on par with Gina that’s out on the market right now.
They are a closed source of their embedding model, so that’s less interesting, but their embed V4 kind of does the same thing that Gina does, but Gina does it better, which is kind of fun. They also let you do the text images at the early fusion. The big thing with them, these are the guys who do the Command R, which has that crazy long context length that they released open source, is that they have a 128K context length, which I don’t know why this is useful, because this would be for your dense embedding, but they have it, so that’s neat.
Good for them, I guess.
I will note, this is last little bit, and one of the things that makes me the most excited, and I think it’s the most worthwhile for spending time in this space.
You guys don’t know who Yanlacun is.
He is the head of AI at FAIR, which is a Facebook AI research.
This is different from all the llama stuff, so FAIR is kind of like this own special little fiefdom where Yan gets to play with whatever his toys are, and he’s really all in on this VJepet thing. Yanlacun is one of the three godbothers of AI.
It’s between him and Yasho Benjio and Jeffrey Hinden.
I think that, I can’t remember what, did he do the ML, the multi-layer perceptron? He was at Bell Labs back at the first convolutional neural network, I believe. I don’t know if it’s called AlexNet or what it was. Yeah, AlexNet was, I think, Yasho Benjio and Ilya Sutskiver. I’ll find it real quick.
It was, I think, 96.
Yeah, yeah, and he had some stuff in the 80s, so he has some fundamental thing that he invented.
I can’t remember what it is now.
Yeah, it was the thing that it was CNN. Yeah, it was initially to figure out what, kind of like for the post office, trying to automatically figure out zip codes from handwritten, you know what I mean, handwritten digits. Gosh, it bugs me.
Anyway, continue.
I’ll drop a link when I find the paper.
So if you watch Yanlacun at all nowadays, he’s always out there kind of poopooing all the transformer stuff, like language models are a dead end.
A lot of people misunderstand kind of what he’s saying, where he’s basically saying that if we strictly stick to language space, which is just one small little system, one small little algorithm, the algorithm of how language works, if we stick to that space, you’re not going to get to AGI. And people don’t like that, because they’re like, what do you mean? ChatGPT is awesome. And it’s, I think that he is probably right, and that we do need some additional things. And his big idea here is this concept of JEPA, which is this joint embedding PA, whatever, predictor, something like that. But basically the idea that we need to have things like what we’re talking about right now, where we are fusing early on in the process, multiple embeddings from multiple modalities into some sort of shared space. And that’s the only way they’re going to effectively get a world model. So you have to have these, these joint embeddings that live early on in the process.
And what he’s really looking at here is a video as a big means for this for, especially for things like, you know, effective robotics and things like that, that are truly, have a true base understanding of the world.
And we can see through what we’re looking at just between the fusion of the vision and the text modality, that by properly fusing those things, we do see those sorts of bumps and true understanding of these sorts of topics, which kind of lends to what he’s saying, is like, okay, so we want this to be a world model. Well, you’ve got to basically encode the world. So one last head bit to kind of think about. All right. I guess any questions on those topics before we kind of poke around?
One thing I was going to mention, it just felt familiar. The part where you were looking at rank ability of embeddings and stuff like that, that feels like when I first came across word movers distance, and somebody figured out, wait a minute, if I move in the same direction from France to Paris, and I go find the United States and move in the same direction, I wind up at Washington DC, you know, and it’s like it kind of like what you talked about.
It’s like, well, let’s let’s try it.
We’re not sure if it’s going to work or not.
But oh, this is interesting. Yeah, it seems like the general trend is that more things that we think are magic are less magic than we think. There’s more algorithmic basis for more things than we would expect.
And it’s just that we don’t understand it, basically at all.
It’s okay. There’s other things we think are really simple that are still kind of magic. It’s like walking upstairs. Any other sort of stuff on all that? We can pop back and forth too.
No, but I found it. So 1986 is when back propagation paper dropped.
That was it. That was Jeff Hinton. And then Lacune was actually 98 with gradient based learning applied to document recognition. I’ll drop the link to that in a second.
All right. So, you know, I don’t have a huge way of kind of introducing this. It’s just kind of a big blob of stuff. But what we have here, I’ve got sort of two major areas, and I can really go in the other one is kind of looking at the actual database if we ever wanted to look at that.
But we have a whole bunch of points here.
And it’s got a server sitting on the back end that is doing the diminutionality reduction.
And then something else doing the embedding and ranking service.
And so I can, you know, take this into a much, much smaller areas.
We’ve got this big large chunk in one here, but I could pop it down into something that’s like 500.
This is going to be a subset of the data that we have.
It’s a little bit easier to kind of track what’s going on.
And so what it’s doing here is it’s grabbing all these pieces and it’s trying to cluster them into some sort of meaningful groups. And so here, you know, it’s always curious on how it does it.
Sometimes it’s very obvious how these things are clustered.
I’ve noticed that the larger things get and, you know, the different parameters that I put in, it could be much more obvious.
But like here, it seems like square-ish things in the sky seems to be like what it’s tacking on here.
And it’s kind of got a loose idea of where things are.
And so over here, we’ve got a cluster of like human-ish sort of things.
So I’ve got a whole bunch of like, and the idea here was I’ve, it’s just doing like a lower Earth orbit sort of, you know, I call it the frontier above, you know, some sort of thing, sci-fi cities and lower Earth orbit.
We’re starting to poke over at Mars, you know, things that we might be doing in 21-25. So starting to kind of like think of a campaign for that to run with my friends.
I’m just kind of grabbing some stuff to think about.
It’s kind of the idea here. Got some random stuff too, like magic gathering cards. And so I’m going to poke this. What the heck’s this doing here? I got descriptions.
I ran a Quinn32B on these to give it descriptions and pull out some tags.
Based off of that, things like setting and style.
This is, I have a mid-journey sort of pipeline.
And so this is kind of something that is using that with some additional data along with it.
And you can see it’s kind of got the cluster confidence stuff like that.
Also, what we have though here is the ability to do search.
And so I can take that image right there and send it down here.
And I can go find things that are like that guy, which there might not be many.
But here I have a certain threshold.
And it’s returning a few of those.
And we can see here kind of what the similarity is.
And so you can see the first result, it says it’s really, really similar to itself.
You’ll notice it’s not 100% similar to itself though, which is interesting.
But then you got some other stuff.
So we’ve got these magic gathering cards.
And what’s the second thing here?
What’s the next most similar thing is?
Oh, that’s searching the whole database.
Okay. I’m going to want to go out to the bigger one.
Let’s see if we can find that card again, though.
All right.
So I’m going to search for magic, the gathering you to do.
All right. And so what we find the second most thing, it’s another card that somehow got in here.
And what I did here is I just asked chat GPT for some sci fi-ish words to put into Bing.
And I just kind of grabbed all the stuff that was inside of there. And so I got some chuff in there. And this is one of the chuff things that got through. And so that’s fine. Not super interesting. Why don’t we pop up to this bigger one here?
And let’s look at something like how about space stations?
So where inside of this big blob of stuff do we think space stations are?
Anybody have a guess?
I’m going to go to the thing on the far right. Thing on the far right. It’s like in its own space.
All right. Let’s see green blob.
Green blob. It was right there.
Oh, not what I picked. So we have found station.
There’s probably more space stations here.
So we can see here that this is kind of the area of things that seem the most like space stations, which yeah, I mean, so satellites, space station in orbit, sure.
And so you see here, actually, I’ve got this actually doing something interesting where it’s doing 30% keyword analysis on the captions and 70% semantic similarity right now.
And so if I flip that, this would change a little bit. And so some of these are capturing actually the space station stuff in the keywords.
But I can look at here and sure, yeah, that’s a space station.
That’s fine.
But what if I take now and so here I’ve got a rearrange thing.
And so now I can put some additional sort of criteria in here.
And I just I just did a simple one with text.
It’s not super optimized.
But maybe I go in here and put something like Mars. And so then I can sort of pop up things that look like Mars instead. And so it might be the case that you could do this a few different ways.
But you’ll kind of it gives you more capability to do interesting ways of analyzing the data.
So I could just do space station Mars. I’ll probably get a lot more things or yeah, so I get more red things here. So I could kind of get this sort of item here with space station Mars.
But I’ll probably get things that a lot more narrow up an area.
Or I might find that there’s something down here. What’s this?
Okay, this is interesting.
So what’s going on down here?
I’ve got this cluster.
It’s got some sort of vehicle on a red planet.
And a watermark, which is fine.
I’m not training anything.
And so yeah, no, so now I found I found the Mars cluster by kind of navigating through this sort of thing.
So what about a search for something that you wouldn’t expect to be in there?
Like leprechaun or rainbow?
Rainbow?
So what it does, it seems like it is grabbing the color palette.
Cool. Because it’s the nice thing here too is that, you know, it’s it’s got the joint embedding.
So I’m doing this, you know, the thing that’s doing the right.
And so right now it’s actually doing the similarity on the image embedding. I also have the text, the caption. It is also inside of an embedding. So we go here, I can actually show so frontier above.
Here’s what the raw thing looks like.
This is kind of what it’s looking at.
You have got 514 points in here.
It’s using default, the 2048.
So that kind of max thing hnsw, it’s kind of your standard stuff. I can go in here and look at the point.
So I can go maybe in here.
So this is kind of what’s actually inside of the database that I have.
And so I have a default vector specified, which is this 2048 with this.
But I also have some additional information that I’m using to kind of gather stuff.
So you have the image embedding. There’s probably some interesting stuff too, because I took this caption field, and actually I have a caption embedding too. And so it’d be very interesting to kind of do some dual sort of blendy stuff with this that I want to play with.
And then here’s kind of your structured output, I think.
See, oh no, he’s not here. There’s somebody who’s saying they’re having trouble with structured outputs. And this is one where I was getting very good effects with this, basically kind of getting it to put out, you know, tags and stuff like that very easily. So that’s some of the stuff that’s sitting behind it and why it has the ability to grab some of that data. So let’s say, okay, let’s do this by Mars.
This is a ranking rainbow by Mars.
And so right now it’s kind of, to me, this is interesting because it gives me something to kind of play with from a creative aspect.
I will note that this data set is super tiny.
5,000 images is nothing.
So imagine the ability of this where you actually have, you know, 50 images of your rainbow, 50 images of Mars, when you have 500 images of those things, and kind of the fidelity that you can get to if you invest in this past a weekend. Yeah, so there’s more fun things. You can tick in, look at whatever this thing is, kind of see where it pops out. This one’s interesting.
It’s kind of got a different shape.
So you know, sometimes there’s clusters that are very tight.
Sometimes they’re more diffuse.
It seems to be the ominous thing in the sky cluster. It’s fun just to navigate, look around. Very interesting. Yeah.
You can also go a different direction.
So we’ve been kind of navigating through search, but we can look through here too. So like we were looking at one area. Let me see if I can find the drifters.
There’s like this, I think there’s like a cyberpunk thing that got in here where there’s like a subclass called the drifter. And so I picked up a whole bunch of things like that. I can always find them out around the edge.
So we’ll see if we can find them.
Let’s see.
So let’s see what these little clusters are that have kind of formed.
Here we’ve got like some sort of exhibition space, something like that. And so halfway between these normal convention spaces, and what is this?
A weird alien art deco is whatever this thing is.
Seems to be even weirder alien art deco.
It seems like it would be really useful if you had a ton of images and we’re looking for things that were just kind of outliers that weren’t related.
When you don’t know exactly what you’re looking for.
Yeah. And a lot of the, the reason I was able to kind of put this, this is a, I was helping with training a image model. And so this sort of thing is very useful if you’re looking at all of your training data.
And so my question is, what’s right here and do I want that in my training set?
Why, why, why do I not have a more example of this?
Looking at this, I’m pretty happy it’s not in my training set.
That’s a pretty crappy image.
Why do I have so many things like that in my training set? It’s interesting you mentioned that about mid journey as part of your flow earlier.
You know, their explorer used to have kind of a semantic search like this. You’d always see similar, yeah, they cut it out about six months ago, but whenever you do use explorer to search, any image you clicked on, it would then show you images that were similar regardless of if the prompts matched.
Yeah. So this is interesting.
So up here, we’ve got kind of like these like character sheet looking things.
It looks like, honestly, it looks like AI generated swap is what this looks like to me.
But it’s got these these panels.
And so you’ve got the over here, so you seem to be like more cyborg focused.
But you got the same ish thing up here. These are a little bit different.
It’s their cyborg butlers. That’s its own cluster is cyborg butler.
Oh my gosh.
Oh, very good.
Okay.
Oh, there you go.
Very good.
You never know when politics is going to happen. Yeah. There you go.
Oh, there’s.
All right, cool.
More. I guess science is speaking to these little nine panel things apparently are up in this area.
So collages.
So this is kind of a collage area of how they collapse this time. So we do this one more time.
You know, this will be a different area entirely.
There’s not really a good way of collapsing these things down from 2048 dimensions.
Oh, I found the drifters.
I found the drifter.
Oh, these aren’t quite the drifters.
Drifters must be close by. It’s a whole gaggle. So it’s just a really tight tight little ball of stuff. There’s lots of stuff down here. So yeah. All right. Got the new grok right here. There you go.
All right. So this seems more character.
So it’s kind of like a human character sort of things. Cyberpunk-ish stuff. Character portraits.
Some guy who definitely works in Huntsville. Yeah. That’s kind of what’s here. I can keep poking at it, but I’m just now curious if Jan Lacoon is in your data set.
I doubt he would match. You were looking for like sci-fi stuff, right? Initially, Topola.
Yeah, yeah.
I don’t know. I don’t think he’d qualify, but I was wondering if you could pull that a net research. The alarm’s strong. Pick an astronaut. I don’t know. Oh, I did find a Superman.
I think Superman was in here.
Yeah.
So you got the Superman cluster over here.
Okay.
So I wonder what that looks like.
So yeah. So here’s all your human characters.
So we say Superman’s right there.
Yeah.
Yeah. Bugs Bunny. Where’s Bugs Bunny?
Oh, that’s very interesting.
Yeah.
Sometimes I get stuff like this.
I think it’s just because there’s two bugs in here.
I know it’s got- Do you find they Marvin the Martian? Yeah.
So we can kind of see where things pop up.
So the things that are most like Bugs Bunny, how far apart are those things?
You can see here, too, is that the most similar thing is 0.3% similar.
Almost might be interesting to do a further study where you’re like, okay, if the results are scattered like this, it’s not a good match.
But if you do get a clump.
Yeah.
Another thing that’s interesting is that there’s actually negative embedding.
So you can go up to negative one.
That’s something I didn’t add a good filter for.
But find things that are the least- What’s the thing that’s the least like Bugs Bunny in this dataset?
Yeah.
I think that would also be interesting. I think in general, it probably is going to be tighter since it is clustered on image.
And it’s likely going to be better clustered for image search.
That seems to be some sort of like cyborg stiff.
This is a very interesting one about this variant.
And so I’m actually going to turn off.
So right now I have on hybrid, I’m actually going to key this over to semantics. This way it doesn’t catch any of the caption to see if the caption is cheating for us.
There’s one of the interesting things about this model that you get with the joint embeddings that you would not, you would not get this with clip is that it is able to, I said terraforming and it pops out terraforming marks.
This is a game, but it’s able to encode this actual text inside of its embedding for retrieval with a pure text input, which if you’re doing rag is obviously very useful because what this means is that you don’t have to chunk your papers.
You just have to split the pages.
You have to chunk them in the sense that you split them into pages, but you’re not having to go in and do an OCR and crap like that. That’s pretty cool. Could you use this as some kind of a, it’d be interesting to see this as some kind of a quality metric on whether captions were off or not.
You just take for granted that what the caption is actually matches the image.
Yeah, I’m sure there’s some quality control, but not the internet. Sure, you could probably do it for like a quick classifier sort of thing.
It probably wouldn’t be good enough to be like the hardcore thing, but it could probably tell you that the vibes are off. Yeah, a smell test. Yeah, there’s maybe a cluster of things that are wonky here.
Yeah, so let’s see.
So let’s look at this. That’s terraforming.
That’s how that pops out there. What happens if I flip over to just keyword?
It doesn’t change at all.
Not really. Maybe I should try one. I should try one that doesn’t have words in it. Let’s do… Oh, go ahead.
One I always found interesting to see it define is emotion.
So try joy. Is that joy? No joy here. Yeah, any of them. Feature, happy, any of those. Here we go.
Let’s do hybrid, how is that possible?
Okay. It’s giving me… Okay, that’s a joyous elevator.
Let’s do happy.
Okay. Oh, we’ve got a big cluster of happy up here.
Oh, these seem quite happy.
Oh, there’s a little… What’s that sad heart on the side? Oh, a little scientist.
Yeah, that’s pretty happy.
Okay, it’s something. Got a cluster.
I’d say that’s pretty happy.
That’s a good hit. It seems to be… It’s probably likely the case that I didn’t search many happy things, I will say.
It is possible that the campaign I had in mind was less than happy.
Let’s see. Oh, we did get whatever this is.
Sure.
Whatever that is.
So yeah, I don’t know. It’s just going through a fever dream. It’s like, what’s going on? All right. Like some sci-fi character sort of stuff. Okay. Oh, I see.
It’s kind of got that mobile game ad sort of thing going on with it.
What’s over here?
Oh, that’s pretty.
Let’s see.
What if I… Can I… Oh, I don’t think I can… I wasn’t smart and I didn’t put something to change the amount of results. I want to see how far you can open it out.
Yeah, it pops through. Yeah, this is really fun. I really like doing this sort of stuff with like this kind of how I do mid-journey and stuff like that. So I just kind of like look through.
I’ve got these captions so I can feed in this stuff.
Maybe I don’t want the whole vibe. So I have always have like a style sort of thing. But maybe I copy over an image here and look through.
Just kind of put through you know, blend stuff together.
Maybe I put this together with, you know, our weird, totally not occult guys up here, you know, just see kind of what pops out. I think somebody wants to call it like surfing, just surfing the latent waves, man.
I think it’s pretty fun. But obviously, so imagine, you know, stepping back, you know, imagine that this is all documents.
So this is right now, I’ve just got kind of got images and stuff like that. But imagine that some of these are diagrams, you know, of technical specifications.
Some of them are, you know, things you have that are like news releases.
Some of them are videos. Some of them are, you know, you can have raw text. So maybe it’s just like documentation on, you know, a library or something like that.
And videos of people at conferences.
So there’s lots of interesting stuff you can do here, you know, with this, that’s outside of this little toy problem.
Okay, what is this big area?
Is it elevators?
Yes, I had a space.
So I want to do a space elevator.
And I kind of let Bing do what Bing was going to do and just let it sit. So I got lots of architectural weird elevators in this area, which I’m actually kind of okay with. This can be fun. You know, imagine you take this, and you combine, you know, this weird area right here with, I think we had like weird architectural stuff, or like cities.
You get something like, okay, so I’ve got, you know, Death Star Hunger Games.
You just start and go from there and figure out why that exists.
So yeah. Yes, from there, there are any sort of interest areas as far as stuff to talk about.
I can keep poking at dots with my name.
Hey, Josh.
Hey, I was just curious what your text back is. I see the Q-drant for the vector database, but how did you put all this together?
Yeah, so it has quite a bit of stuff in the end.
So the Q-drant is the vector database.
I did this because it has the ability to do the multi-vector stuff.
You know, PD vectors, something I’d also look at. And so I have Q-drants on the vector database.
This front end is just a little React front end.
I have a server behind it that’s kind of serving as the back end here that’s made with bun and Hano, just kind of a little rest server serving as an API gateway.
And there’s a Redis cache in there.
So like whenever I go these queries, I can do the search again.
And you know, it’s just cached.
Anytime that I go over this, you know, it’s going to go pull this from the database. But the next time I go over it, it’s cached stuff like that. It also caches the actual embeddings and turns them into those Matroshka embeddings. So when I go over here and do a, whatever this is, I do a search.
And so it’s doing a search here with a full 2048.
But whenever I do this, what it’s doing this re-ranking with, it’s not using a GPU.
It’s just using the Hano server.
And it’s using those Matroshka embeddings. So I can do it super quick on the CPU.
So that’s there.
Going further back, there’s a whole cluster of like Python related stuff.
None of this stuff works out of the box right now. So the Genev4 that you can’t do Genev4 and BLM effectively. So I had to write a custom Python stuff, Python sort of server fast API. And I have that with like Celery worker, that’s doing the cache worker sort of stuff. So it kind of takes a casque of stuff and we’ll return it out.
So I go here and say I want to increase my number of neighbors to 50, something that’s going to take a little bit of time. In the back end right now, it’s setting up a task queue, setting out some polls and then talking to one another. Whenever it finishes that out, it sends it here.
And now if I hit that again, so say for instance, I refresh.
Well, it’s going to pull me the last one that I was on.
But you can see that one was much larger and it just went super fast too.
So that’s that.
Then got a few pipelines that are using some sort of DAG things to do the mass computation of stuff.
And it’s just kind of various things in the back end from there on. Very cool.
Thank you for the talk. Yeah. What’s this?
This kind of follows my point from earlier that I’ll use AI to help me do something faster.
And then it just gets me sidetracked into, well, hey, what’s in this cluster?
You know what I mean?
And then I look up and I’ve lost two hours learning about why these things are similar to these things.
And it’s part of what makes it fun though. AI is the journey, actual codes, the destination.
Yeah, I definitely want to, a lot of times I’ll throw these sort of things away, but I’ll probably take this one and jam it with like a million images and see how interesting that gets and stuff like that.
But I think it’s going to be a fun little thing to play with.
It’ll be nice whenever this sort of thing is a lot more accessible, which I imagine it will be at some point. Yeah, it’s just interesting to see the visual representation of the layout. You know, I’d love to see actual connectors between, oh yeah, like on search terms.
Yeah. Yeah, actual connectors, maybe a 3D version, you know, easy stuff. Yeah, I’ll actually, I’ll pull this one up. This is kind of a different realm or the same realm-ish.
I can pull up my Neo4j.
This is a different sort of thing that I was doing in that area.
You could get this and then throw it over to Tony to put it in VR. Yeah, so I do have, I’m going to do this in 3D too. I just didn’t want to try and bite that off, but there’s, you can kind of tell that there’s, there is dimensionality that it’s trying to hit here. Yeah. And so that’s why it’s, it’s like 2.5D to a certain extent right now.
I’m going to connect this thing.
Oh, I have to be, okay. Well, that’s too much work. Yeah, or even organize it based on, okay, this is terrain, this is building, this is sky, and then have it present in mosaic form and just see what combinations are rated as similar to each other. Can you get a whole environment based off of what it puts next to each other?
Yeah.
And I mean, just imagine this sort of thing too. One of the other things I’m really interested in obviously talking about diffusion is the ability to take something like this and, you know, point something at this image, just have it move. You know, just have it kind of sort of move around. That’s super accessible. I think in five years, it’s going to be doable on local hardware.
And approaching that time, it’s just going to get easier and easier.
But you’re starting to see people do like live diffusion of stuff. I know you go to a lot of those mid-journey sessions and I mean, that’s one of the things they’re talking about, but it’s not just mid-journey. Oh yeah, no, it’s, they’re just making it more accessible for a price. Right. Yeah, I think stuff like that’s very, very interesting. Very cool. Very cool to play with. Obviously, if you are interested in, you know, creativity and sort of self, you know, exploration, like a tabletop sort of stuff, narrative sort of stuff. I’m very cool for that. You know, we always think about, you know, those little holograms are like on the table with Blade Runner and all that sort of stuff.
And I think to me, that seems like a given at this point is that there’s going to be media where you can kind of curate your own little thing, you know, that you might share with like your family or stuff like that, you know, little things like that, that could be fun, shared creation events.
Makes me kind of excited for that sort of stuff. I don’t have anything, anything else. It’ll be, how long do you think it’ll be before you mentioned you had to throw some Python together to actually, you know, hit this thing? Do you think it’s going to wind up getting like shoved somewhere that’s easy to work with soon? Oh, like this code base?
Well, not your code base necessarily, but like, you know, like you had to, you had to hop through some hoops to get the, you know, the model to work.
You know, just the paper didn’t come with like code you can run.
No. How long do you think it’ll be before it’s hit something like Hugging Face where you can just pull it and run? Oh, like the model itself? Oh, it’s going to be this year.
Yeah.
Okay.
I think, I think you’re going to start seeing, and number one, Gina V4.
I think Gina V4 will get BLM support within the next month or two. Okay. V3 is supported.
Yeah.
Yeah. That’ll get supported pretty quickly. It’ll be a subset just because anything with multimodality is, especially like the embedding stuff.
So you’ll probably see like BLM, SG Lang.
If you want to, if you can do, if you can do transformers, you’re good.
So you can do it with transformers. Okay. The only thing I had to go down to Torch for was some stuff with the the re-ranking. Okay. Got it. No, I mean, this is super cool stuff. I feel smart again. Well, cool. I won’t hold us any longer.
I guess last call, any comments, any discussions, musings, and this was great. Thank you, Josh. Very good. Thank you, folks. Thanks for listening to me app. And I’ll see y’all the next one.