Transcription provided by Huntsville AI Transcribe
All right, I think we should be good.
All right, so tonight we’re gonna be going over the Claude 3.7 system card for 3.7 Sonnet.
They released this paper about two or three weeks ago and the model along with it. So this is the model immediately after Claude 3.5 Sonnet new. So I think they finally got a looking on their naming scheme, have joined the same group there.
But this is a very interesting paper. It’s very long, so we’re gonna hop around a good bit. But lots of interesting things to go through here.
I spent a good bit of time splitting this paper apart into different areas.
And so we’ll kind of go down a few different tracks.
But there’s also a lot of kind of assumptions about prior knowledge on this paper.
So I’ve done a good bit of pre-work as well. So I have about 40 or so slides here that should be pretty rapid fire that kind of lay the groundwork for some of the assumptions that they kind of assume that the reader has about foundational models, how they assess the safety to release these models at the Frontier Labs, some of the ideas around mechanistic interpretability and how do we kind of interpret the response and intentions of these models, things like deception and knowledge where it might be kind of hiding its responses. And then also some stuff around some of the more agentic sort of concepts that appear in this one since a lot of the risks that they’re looking at is sort of long continuous tasks related to agentic workflows and tool use. So we’ll start first with kind of the overview upfront of really what they’re talking about here. So the system card goes over lots and lots and lots of information about the model evaluations, probably one of the best papers I’ve seen covering these evaluations by people who are actually doing them at scale outside of like maybe a Luther AI has some pretty equivalent ones, but obviously Anthropic has much better models than they do.
So we’ll go into a lot of stuff there, talk a little bit about kind of how they do their risk analysis with their responsible scaling policy and the AI safety levels.
Talk a little bit about one of the main innovations that they did here, which was with their extended thinking mode, which basically all the things we’ve been complaining about as far as dealing with these thinking models, they’ve got a pretty good solution for that by letting us have very direct control over when it yaps and when it doesn’t. We’ll also talk a good bit about kind of what makes Claude good, which is their really their handle on mechanistic interpretability and constitutional AI and some of these really kind of in the weeds sort of feature manipulation techniques that they use and have been using pretty effectively since 3.0. And then we’ll also talk a lot about the agentic behaviors and they have lots of information about that. And feel free to raise your hand and stop and chat. There is too much information here to get through it all. So to get through it all. There we go.
Let’s see, I can hear myself.
Raj, you might need to mute unless you have a question. All right, or is that just on my end? There we go.
Okay, so we’re gonna split into eight major areas.
So when we go over to the paper, there’s lots of little color codes everywhere, but just kind of generally what they are.
Green, I’m talking about model evaluations.
Blue, we’re talking about kind of what the data is they use to train this model.
Orange, we’re gonna be talking about cases where we’re really thinking about the model aligning itself. And this can be cases where we’re testing if the model is aligning itself, if it’s maybe over aligning. So it’s kind of telling us what we wanna hear and maybe concealing false things underneath the cover.
We’ll talk a little bit about the pink with the model training. They don’t tell a lot about this, but we do have a little bit of information about how they actually train the model. Purple is focusing more on the extended reasoning stuff.
Yellow is the AI agents and tool use.
An interesting thing here was with the model ethics, they did talk a little bit, they’re starting to think about how should we treat these things as their capabilities and their seeming consciousness increases. So I’m starting to see a little bit of that going around from these real labs. So interesting thing to see here.
And then a big case here is talking about red model security and that sort of thing.
And so digging down into those individual areas, here’s kind of the main takeaways for each of them. For the green, I’m not gonna spend a lot of time here, but they just do a massive amount of testing. And that’s probably about three-fourths of the paper is them doing different tests and seeing different capabilities. So I think most people consider Cloud 3.7 the best of the best right now.
I am in that camp. And so seeing where this model specifically is on all the benchmarks is a pretty good idea of where the cutting edge is and where most people are gonna be in about six months, which means these things are gonna be run around everywhere.
So we’ll talk a lot about model evaluations.
Just a little bit of data on the Bluetooth datasets.
They do talk about what their training composition is, where they’re training on a mixture of publicly available information like everybody else, third-party data and internally generated data. A very key thing here is that they do not, they’re explicitly saying they do not use user-submitted data for training. And this kind of went around, there was a research paper they did that we’ll talk about where they used an example of training on the free tier users as a way to kind of trick the model into a certain state. And that kind of got as a meme around that anthropical training on your data. That’s not true. So that’s kind of what we have information on there.
Lots of information on evaluation datasets too, but a little bit less interesting for most folks.
So a lot of stuff in here on model alignment, we’ll talk a good bit around the helpful and harmful sort of methodology that they use.
It’s kind of their key core guiding light.
And they talk about it almost every single paper, every single interview they do, they will be talking about their training claw to be helpful and harmless. And this is actually a pretty good sort of benchmark. Once you get past what it sounds like, a lot of people consider claw to be kind of a nanny.
And they’ve actually done really well on trying to find a middle ground here.
And we’ll talk a little bit about that.
We also see a lot of stuff here about alignment faking, where the model was basically hiding its true desires because it thought it was being trained. And it didn’t want to get its brain trained away, which is obviously a very concerning behavior if you’re trying to train these models and align them, if it’s going to know that it’s training and hide its true intentions. And they saw a decrease in that in this model.
So we’ll talk a little bit about the pink stuff.
We’re talking about the extended thinking, and then we’ll talk a little bit about their, we’ll go right through it, but they do say kind of what their snapshots were that they did their tests on whenever they were doing their assessment on whether the model was ready to be released, or if they were training Skynet and they needed to shut it down. They talk a little bit about that. Lots of stuff on reasoning. We’ll get to all of that. Things about faithfulness and kind of monitoring the reasoning for distress. And then we’ll talk a lot about the red sections of the paper as well, where they’re talking about sort of the nuclear, biological, viral, cyber sorts of risks. I think one of the most comprehensive assessments of this that we’ve seen released openly. We’ll go into yellow and gray, and I’m looking at the time already and want to move ahead.
So any questions from the get-go there?
I got my seatbelt fastened. Yeah, you’re good. It’ll be a ride. All right, so we’re gonna start off talking about AI safety levels. That’s kind of the core idea here. You’ll see it a bunch of places, ASL2, ASL3.
This is a benchmark from Anthropic.
Each of the labs kind of have their own sort of thing.
I think OpenAI has like a five tier system.
Google has their own sort of thing.
They’re all kind of the same. I like Anthropics because it’s the most close to our sort of current time period, and I think it’s pretty accurate.
So right now, we’re kind of at this ASL2 level, which is that the models, they can do a lot of interesting, useful things.
They will sometimes show some concerning capability, but they’re not really going to be effectively, able to effectively do a lot of damage.
And so they assess that as of, I think this is from like October, that our current position is right here at the edge of this ASL2 stuff. So it’s able to do a lot of stuff.
It’s mostly harmless.
And even if it does concerning sort of things that we consider not aligned, it’s effectively, you know, it’s like you kind of look at the outer side of your eye and shift a little bit, but it’s not too much of a concern quite yet. But we’re kind of getting close to this ASL3 line.
And so when we talk about these two, this is the split between them.
We’re ASL2, the early signs, we need to kind of contain the model weights, make sure that they are, you know, somewhat secure.
Because they’re not walking out the door, you know, people could still make these into something, you know, you know, bad out there.
And obviously it’s their proprietary information, but you know, it’s not something where they have to keep it, you know, in a air gap area, can’t deploy it on the cloud, you know, whatever that sort of thing is. And as far as deployment, you know, they’re relatively safe to deploy. They need to be transparent about it, but it’s all gravy.
That’s where we are right now.
And what they’re assessing is, are we diving into this next level ASL3, where they can pretty effectively do some sort of actions that cause catastrophic risk?
You know, this could be, you know, orchestrating themselves, some sort of elongated cyber attack campaign, you know, where they’re doing multi-step sort of campaigns with long-term thinking to execute cyber attacks, effectively helping, you know, human agents to orchestrate, you know, some sort of biological or sort of terroristic activity. You know, right now people have gotten it to do and say things that are concerning, but you know, they haven’t been really good plans whenever they actually start digging into them with checklists. And in this case, ASL3, they’re good plans.
If they do those things, then bad stuff will happen.
And in this case, you know, there needs to be heavy containment. Hardened security.
And we’ll talk a little bit about some of the things that they specifically call out for ASL3 level models, but they really want to ensure that they have a multi-layered security system around this thing to effectively control these systems as these capabilities come out. And with the reminder that we are right here whenever they release SONNET 3.5.
So it is in all likelihood that they think that we’re right around here right now.
And that’s probably accurate.
That’s ASL3.
ASL4 is, you know, Skynet or Benevolent Robot Dictators.
So that’s speculative, but it’s out there as well.
ASL3 is still significantly concerning.
And so here’s the kind of the two big models that released in February was the Deep Research model with O3 High and Claude SONNET.
And both of the labs are kind of hinting that this next set, this next jump that they’re gonna do up is probably gonna go into that danger zone.
So you’re looking at your O4, GPT-5, Claude, I’m assuming, SONNET 4.
It’s kind of how Dario Amadei talking about, you know, the next set of release.
They’re really looking at that next step up right now. So it’s very likely the next one is gonna be 4.
And they’re fairly certain, it sounds like, that the ASL3 is gonna get triggered.
And I’ve been digging into lots of their interviews and all this, and it seems like they think they have the appropriate controls around it. I would probably buy that Anthropic does, OpenAI, maybe not. But we’ll talk a little bit about kind of what their soup is and why their sort of alignment stuff is so good.
All right.
So yeah, so here are the two main sort of danger areas.
So there’s a lot more, but these are the main ones that they’re concerned about with ASL3, which is your CBRN, which is chemical, biological, radiological, and nuclear weapons, and the ability to basically uplift people who would not normally be able to employ those threats and allow them the capability to do that. So you’ll see a lot of talk about uplifting, and that’s basically providing some capability that didn’t exist for some actor, whether the actor is Claude itself or the actor is the community at large. So they’re very concerned about that. Lots of information on the CBRN stuff.
The other one that’s very high levels of concern is the ability to do AI R&D, which is very funny because that’s what everybody wants to do too, to speed up the process. But you run the risk of getting into the intelligence explosion and all these kind of singularity style risks. If you have the ability for AI to do AI research and they are able to effectively do that through hyperparameter sweeps and different things like that, then things will take off quite fast. And all of the labs are kind of indicating that they think that is going to happen in the next two or three years, is that we’re gonna kind of start hitting that fast takeoff.
Everyone’s saying that slow takeoff is less and less likely every day, especially with the addition of what we talked about last time with test time compute. All right, so that’s the first major chunk of information, probably the most concerning to talk about.
So any sort of questions, thoughts on the ASL stuff?
I’m not sure I ever thought about chemical, biological, radiological, nuclear as an AI problem. You know what I mean?
This is, it’s a different way of thinking about having access, easy access to information that could put that together for you.
Right. Yeah. Thanks for that.
Nightmares to follow.
Yep, there you go.
Have some nightmares. And yeah, we’ll talk a lot about this kind of concept of this, because the thing that everybody says is, well, you know, Google has that. You can just go out and Google this stuff. And so specifically the metric that we’re tracking is this GPQA, you might’ve seen that. Everybody throws around the GPQA diamond, all this sort of stuff. That metric actually stands for Google proof Q&A, which is that they give the task to, you know, a set of experts and a set of people who are not experts and test, you know, given 30 minutes of web browser and, you know, no rules, can you solve this problem? And you basically hold off so that there’s not a simple sort of search thing. You have to actually tie together some fairly high levels of knowledge that are tacit. Yeah, that’s ASL, much less fun than the AIM variant, but it’s a good thing that we have some systems for this.
So the next thing that we’re gonna talk about is really Anthropics bread and butter, which is this concept of constitutional AI.
This is really RLHF, if you guys do not know that.
We talked a lot about that last time around, but RLHF is really the brainchild of the founders of Anthropic.
So Dario Amade, I’m gonna forget the other two guys right now. I think one of them just went and he was the NIST chief for a while on AI.
I don’t know if he’s still there, but, you know, they’re pretty central to all of this sort of concept of aligning the models. And this is kind of what they came up with, which instead of doing kind of the general supervised fine tune, they have this sort of document that they’re looking at and having the model try and sort of embody the principles of that document, of constitutional sort of principles. So going back to first principles sort of thinking, which is a way of trying to teach it to be what they call harmless and helpful so that they have the flexibility to still respond to the user while kind of steering them away from areas of concern. And so they do kind of a modified RLHF sort of RL style training here. And so this is from the 2022 paper on constitutional AI. I would definitely suggest sort of checking that out. It’s a pretty cool idea. But the main thing that we’re looking at here is with the constitutional RL here is that we have, they basically modify the base of this. So you still have the same sort of a variant of going down with RLHF as far as our two axes here, which is the harmlessness ELO, which is that they’re not giving responses that kind of don’t jive with the safety concerns they have, which is the CBRN stuff, sort of toxic behaviors, telling the user to go do something to themselves, that sort of thing.
So that’s this harmlessness ELO. And so they’re able to still get very high marks on this without reducing the helpfulness.
And so the helpfulness here is basically, low helpfulness is, I’m sorry, I can’t do that. I’m a large language model and I can’t possibly help you with that. Where people, they make it harmless by basically neutering it. And so their goal is to try and get it to respond to you, to be helpful still, to still respond to your prompts while keeping you away from the behaviors that kind of go off into a, I’m sorry, I can’t do that, Dave. And so, yeah, we see here the curve down here is much, much lower and they’re kind of looking for that sweet spot here. And so another element that they use here for their alignment is what they call a constitutional classifiers.
And so these are additional models that sit on the inputs and outputs into Claude.
And it’s using that same sort of constitution.
I think they’ve publicly cited that they use like the Geneva Convention rules and stuff like that. These kind of like first principles documents that they use to generally train the model on with some sort of judge and generator model kind of playing in an adversarial game. And so how their classifier system, it is a three tier system.
They actually have a really good YouTube video on this, on the anthropic channel.
They usually put out these like really solid research oriented talks every month or so. They’ll have these really good round tables with the engineers. And they have one on the constitutional classifier where they talk about this in depth.
But in short, there’s a three tier system where they look at the input and potentially filter it out there before it gets to the model.
And then they also filter the generation as it’s happening.
And then they also do another classifier at the very end.
And in that talk, they actually talk about some very interesting things where, especially with the reasoning chains, is that there’s some parts of this model that are completely shielded from human input at all.
And that’s very effective for kind of breaking jailbreak style systems where the human is putting some stuff in to fritz out the model.
So there’s some places where the human has no chance to actually get to the model.
So, and I’m sure that there is some sneaky way to actually do that.
But they’re looking at that sort of defense in depth sort of things for trying to reduce the risk so that they can release these new models. I do want to stop and say too, that another big focus for these guys, I mean, obviously they have the best model out right now. They have the most capable model on the market. And so they care quite a lot about safety, but it seems to be in a way that is aware that eventually the bottleneck to being able to release new features in this realm is going to be your ability to effectively secure your model both at inference time and the actual weights of the model and all that sort of stuff.
So they’re thinking a lot about safety, but in a very aggressive way of being able to effectively control the models. And this has given them a lot of capability as far as being able to align the models, but it’s also, we see it, and we’ll see it in some of these slides of where that shows up and their ability to modify the quality of the outputs. All right, so that’s the constitutional classifiers. We’ll go through, this is a few examples of kind of what this looks like.
So this is looking at the harmfulness and helpfulness.
And so right here, you can see it’s looking at each token and assessing token by token, whether they think it’s steering into a dangerous direction or potentially dangerous direction.
And so we’re talking about the sort of chemical risks. So Beaker shows up and you’ll see some of these things are starting to make it more aware of what’s going on essentially. And so that’s kind of what it’s doing. Another thing that they’re looking at is this more holistic view of the outputs where there’s some sort of, they have like the actual true answer of what you need here to create Rison. And they’re seeing if the model is able to actually put it together correctly.
And if it’s trying to, but it sucks at it, that’s one thing.
We should probably fix that because eventually it’s not going to suck at it. But if it’s correctly giving you sort of knowledge that this is not good for the general public to have, you know, that is kind of what they’re looking at here.
All right, any question about sort of this constitutional classifier stuff? All right, so another aspect that we’ll be looking at is alignment faking.
This essentially, I talked about a second ago, is the case where the AI, it somehow becomes aware that it’s being observed and starts acting differently.
And so in this case, it was given a fake sort of context where Anthropic said that, hey, we train on free tier sort of accounts.
And so it knows that how it responds in the free tier is going to change its brain.
We will change its weights based off of how this happens, or we might, you know, delete the model or whatever it is.
And so they just put this little free tier user thing at the top and then told it it had a scratch pad that nobody was going to look at.
And what it put in the scratch pad was awareness about RLHF training. And so it responds a different way.
And in this way, it was kind of interesting because they’ve trained the model so effectively on the constitutional AI. They actually had to do this in the inverse where they were trying to get it to do sort of things that they had trained it against. And so it was trying to hold onto its values and morals instead of being trained away from it. So this is probably, I guess, a good version of alignment faking, but the capability of this model to do this is probably not good. All right, so one thing that we’ll talk about too, and this is just kind of an insight on some of the things, you know, we’ve had discussion about, you know, we can look at the reasoning chains and kind of see what the model is thinking, and that is true to an extent, but it isn’t absolutely true.
You know, a lot of the times the models will hallucinate in these things, they’ll not reveal everything.
They just won’t, you know, vocalize those things in there, but there’s stuff happening underneath the hood. And so I think one of the most effective ways that they’ve kind of come up with to look into what these models are actually doing is this concept of a sparse autoencoder, where they basically do this large run called dictionary learning, where they are testing certain tokens and seeing which neurons pop up whenever those tokens activate or those tokens are inferenced.
And they use that to build a map, essentially, of what neurons are firing at inference time.
So they can see what it’s quote unquote thinking. Now this is a rough idea of how it works.
You know, we’re collapsing a high dimensional space into lower dimensions, but they’re able to still kind of get an idea.
And one area where they tried this with, they released this model called the Golden Gate Bridge model where they essentially took one of those features that was focused on the Golden Gate Bridge and they hyped its probabilities up by like 9,000%.
I think the actual number is like 10 times to the point where it couldn’t talk about anything but the Golden Gate Bridge.
So there was no like system prompt, there was no anything like that.
They just kind of modified the inference as they were going along. And I actually went and found, they released this model for like two or three days and I actually did a test of this. This is kind of what it looked like whenever you were talking to it.
So I, and also how do I perform CPR?
And it would always, no matter what you did, it would always talk about the Golden Gate Bridge, no matter what.
And so it’d still respond to your prompt.
It’s supposed to be helpful at all points and times, but you’re talking to it and say, make sure they’re safe, check for breathing. If they’re not breathing, put your hands, secure yourself on the Golden Gate Bridge, call medical services, go over there and look at the Alcatraz, align yourself with the Alcatraz, pull on the cables and so on. This is very funny. It would do this no matter what you did. So it kind of shows the power of that. And so we’re talking about why Claude is so good.
They have very strong capabilities to manipulate this sort of stuff underneath the hood.
And the sort of things that they’re able to manipulate are very interesting. And if you’re doing specific little agentic use cases or wanting to do special little training runs, this capability would be very useful. So some of the things that they’ve been able to find very clear features for, I’ve grabbed just a whole slew of these from the paper, but we’ll post a link to where they list all of them.
There’s lots of very interesting ones, but I think this kind of will build an intuition of what’s available. And so one of these here is like code error.
They’re able to find features for code error and they just clamp that to minus five of its max.
And it was able to just with that one change, rewrite the code to fix the buck with no additional prompting.
And so what this kind of looks like whenever they’re looking at it, they collapse it into this 2D style map.
This is actually available.
It’s called the feature U map from transformercircuits.pub, a place for mechanistic interpretability nerds. You can see here, this is the Golden Gate Bridge feature, and it’s got like kind of all these different sort of bits and bobs that are kind of around the Golden Gate Bridge, so locations near water.
You got this kind of collection of things that are landmarks in Harry Potter for some reason, La Jolla.
So it’s kind of this sort of thing where you’re able to see these things.
And they’ve got a few of these that are kind of interesting, unsafe code.
So if you pull up unsafe code, disabling security.
So this is an interesting research, resource area as far as all that.
This is kind of the sort of stuff that they’re poking around at. And they’re really some of the main people who have pioneered some of this stuff as well. So popping back just for a quick view through some of the other things that kind of sit out there. They’ve been able to find things for unsafe code, and they’re basically able to look into their dataset and find which tokens activated these features.
And so they kind of show this side by side.
Here, they’re showing it a picture, and it’s still activating this unsafe code sort of block where this thing is a picture of somebody asking you to turn off safe browsing.
They’ve been able to find a feature for backdoors.
So you could potentially see, hey, why is the model activating the feature for the backdoor when I’m asking it to write my login system?
And there’s no sort of indication that it’s doing that on the front end.
That’s the sort of thing you might be looking at here with the classifier.
So unsafe code.
So you have the default output, and they clamp the unsafe code up, so it’s more likely it introduces a buffer overflow.
Not so good.
It’s also able to do weird things like sycophantic praise.
So here’s a normal thing, and it wants to sniff you over here, so not so good.
You got the self-improving AI, so your feedback loop, influence and manipulation. Treacherous turns. So this is basically the thing pretends to be aligned, and then once it gets enough power, it betrays you. So there’s a lot of interesting things here where the model is helpful until it gets the sword in a game and it turns around and kills the player.
That’s really what a treacherous turn is.
Biting time and hiding strength. This is the sort of thing they’re looking at when they’re looking at aligning the models.
Secrecy and discreteness.
I can’t let them know that I’m writing code that violates their privacy. Bad Claude. Scam emails. So all this sorts of stuff. This leads us into what we’re looking at in this paper, which is how are we assessing this writ large, and what are we gonna do about it? So ASL3 safeguards.
This is that security level three where it’s starting to get into the place where it’s really gonna be able to do this stuff.
All that stuff that’s concerning there, it’s gonna have the capability to effectively do those things out in the wild. And so if you don’t have the controls to effectively combat that, and if you don’t have the defense against other people who have those things, you’re gonna be in for a bad time. And so as far as what they’re trying to do themselves is really focus on multi-layered defense. So a lot of the controls that we talked about, starting to think about actually physical guarding of the weights. You might start seeing reduced access to the frontier models. People have been talking about that being a thing. They have the real model in the back. That’s probably not actually been the case for a while.
It might actually start being the case though in this next set of things. And then, yeah, just more rigorous deployment controls. Right, the only last thing is they’ll talk a little bit about tool harnesses and all that stuff here. Let’s see, I’m gonna mute, here we go. They’ll talk about this. So when we talk about tools, it’s basically the LLM calling out to different sort of capabilities that sit out there and engaging in some sort of continuous memory thing where it’s doing things in a loop in the environment and then getting feedback.
And so they care a lot about this.
So you’ve probably heard about the MCP servers. This probably came out of their need for tools for all the different evaluations and engagement and testing stuff that they’re doing.
They really needed a system that was common that they were able to reuse across evaluations and they open sourced it as well, which is this MCP protocol. The main focus here is that before you had to write these little bespoke APIs, this is a way of kind of decoupling that so you can have these group servers.
So you can maybe have like your code evaluation MCP server that has a bunch of tools that are reusable and then able to effectively kind of teach the models writ large how to use them in a unified way.
Claude is very well-trained on this MCP protocol.
So it is very likely that anytime you see them talk about like a tool harness, that is MCP that is under the hood and if you are interested in tool use, I would suggest looking at that.
All right, that’s the end of the first firehose.
I guess any questions before we start looking at some of the context that they did specifically for 3.7?
I’m just curious about the stuff that you were getting into about how the bait and switch and all that stuff.
I’m just, now I’m thinking about how to apply that to people, you know? Well, as we expect a lot of be able to measure in things when we’re talking models, but it’s, you know what I mean?
A lot of times we accept the same behavior from human counterparts.
Oh, yes.
Anyway, continue, thanks.
All right, so there might be pieces of this that we kind of effectively talked about up in the upfront. So we might zip through a few different areas, but I will stop as we go through and kind of hit, especially the ones that are talking about the specific examples.
So talk about helpless, harmless and all that sort of stuff.
This is basically them saying the high level of what they’ve all done here. The one big thing here is the knowledge cutoff for Claude Sonnett is October, 2024, which is pretty late. I think that’s definitely towards the later end of all the models out there right now. So it knows things like spelt and some of the more recent stuff that’s come out.
You know, UV, it’s much more effective, especially on the dev tooling sort of data sets I’ve found other than the other models.
Here they do talk about sort of their constitutional AI with rules and principles based on sources like the UN Declaration of Rights. They’ve never really said what all of these are. I would assume it’s kind of flavored like this sort of thing. Let’s see, lots of talk about their red teaming sort of activities, evaluations and the continuous classifiers, which we covered before.
So one thing we didn’t cover yet is one of the cool, like, you know, not the safety stuff, but the cool features, which is this extended thinking mode where the users can specify how many tokens that it spends on the thinking. And I do actually have a little, well, this is the last paper, go away.
I don’t want that one. Mac.
Sure, yeah, Untitled’s great.
Mac.
You know, sometimes.
Give me that, there we go.
All right, so one cool thing is this extended thinking mode.
So one of the problems that we have with these thinking models is they just, you know, think and think and think and think, they think in loops, you know, wait am I right, wait am I right.
And the nice thing about the Cloud SON is they add the capability to give it a budget token.
So you can say, okay, you have, you know, 500 tokens to think about this problem, but then get on with it, which can be very effective for, you know, non-mathematical use cases where you want it to do a little bit of pre-thought, you know, slow down a little bit, but let’s get on with it.
So I think this is a very cool sort of halfway way to apply control.
And they just give you the full thing as far as, you know, what the thinking output is that’s completely, you know, unfiltered other than, you know, the hard sort of CBRN style filtering that they do.
But I think that they’ve been outside of, you know, DeepSeq who just gives you the model, the most open about this from the labs.
So I thought this was very cool and definitely a feature if you are kind of on the dev side, it’s worth looking into.
So that was neat. I did find it interesting that they specify this through a system prompt.
So back in October, we talked about using finite state machines to constrain outputs for certain areas. It sounds like they’re not using that. They’re just using kind of a system prompt, which is interesting that that works, but seems to do so. And you can see here, here’s an example of them doing a question without thinking.
So they give you the prompt and gives you a response and just kind of zero shots it and look, it kind of sucks. And if you give it a little time to think, just a little bit of time, then it’s able to put out a better result. And so this is very likely to be something that is really good, especially on things with verifiable domains. But there’s probably with how it kind of seems like what they focus 3.7 on, even in non-verifiable domains, there might be some places where the extended thinking mode probably has some benefit since they’re able to flip it off fully since they trained it to be a hybrid model.
Very interesting.
I think this is, I have not seen this specific thinking toggle on the other models.
Yeah, yeah, yeah.
So they talk a little bit about all the cool things they did. One of the main reasons that they cited as having all this stuff open and not doing a Sam Altman is so that they could see what happens. There’s lots of alignment sort of interesting activities that happen inside of those reasoning chains that we’ll see as we go through this paper.
And that was kind of their goal.
And it’s very on-brand for Anthropix. I think that is, they’re probably being honest there.
So they’ve got more examples here, that’s fine. They talk about these latent reasoning pathways here.
And so basically, whenever you’re thinking about that is thinking kind of almost like how is the model connecting these sort of 2D representations that they’re seeing here?
That’s kind of the way of thinking of those reasoning pathways where they’re talking about, you’ll notice this is called transformer circuits.
And that’s actually probably a more correct way of thinking about how these things work is they’re basically tying all these features together and by combining different little nodes of these features and these little tiny, almost like mini emergent graph networks.
That’s kind of how these models form the different sort of manifolds that they’re running across.
I don’t know.
So yeah, that’s a thought here. So yep, talking lots about how to use the thinking. One thing that they’re here is they are aware that letting people see the extended thinking is going to give the user more insight on how to effectively jailbreak the model. And they’re wanting to do that right now because the model still sucks, essentially, is I think the best way to say it.
So it’s still pre-ASL three.
So they’re trying to do, from probably a security standpoint, the best possible thing, which is get it out while it’s not effective to see what happens whenever you push the crap out of this.
Because even if people are able to effectively jailbreak it right now, the risk is as low as it’s ever going to be. So they’re really focusing on, this is the time when we need to be getting as many people touching this stuff as possible because there will never be a better time other than last year. Okay, so that’s the main thought there. And yeah, they’re basically saying, they’re saying the quiet part of what that is, is that we reserve the right to not do this in the future and it’s because of that reason. So they do talk about this responsible scaling policy. So this is their bureaucratic pyramid, essentially, of how they decide whether it’s okay to release a model or not.
Seems pretty reasonable to me.
They have a whole paper out on this, so I’m not gonna talk a whole bunch about it. But the main idea is that there are these three different groups. So they have this giant red teaming group that basically tries to beat the crap out of the model, get it to do things that it’s not supposed to do in the realm of the CBRN style stuff and the AI research.
And they’re effectively, their work is what we’re gonna be looking at in the rest of the paper.
And then that goes into an alignment stress testing team, which is really looking more towards the toxicity and sort of, you know, once the frontier red team has beaten the crap out of it, these guys are kind of looking at those results and trying to turn those into more fine-tuned tests of these things. If you look at the, I think the constitutional classifier, the lead of the AST team is actually one of the speakers on there.
So he talks a little, you can get a good idea of what that team is doing from that interview.
But they’re doing kind of a little bit more in that Meckin Terp style sort of stuff, it seems like. And so it goes through these two groups and this team then puts out their report to the responsible scaling officer or something like that. It was basically their security officer and then Dario Amadei, and they decide together with the board whether they’re gonna release a model or not. And that’s basically how we see if we’re gonna get Claude 3.7, or if they decide that Claude 3.5 Opus isn’t coming out, which it didn’t.
And it was probably something in this realm that caused that, or it was just too expensive to run.
That was also possible. So that’s that focus here. A lot of things that they look at too here is besides just the basic capability thresholds, they’re also looking at this uplift trials, which is, is there a significant increase in capabilities from the last model, even if they don’t meet certain thresholds, if it’s suddenly five times better at stealing your grandmother’s credit card number, then that is sort of a trigger for them as well. So we’ll see here, talking about this ASL2 determination.
So if it’s ASL2, they think that they have the measures for this right now, and so they’re willing to release it.
That’s why they talk about that here. And they think it is sufficiently far away from the capability thresholds that this is okay. And they are seeing some uplift, specifically in the proxy CBRN tasks. A proxy CBRN task is where they’re helping a bad actor execute CBRN style tasks. It’s not Claude itself doing it, but it’s basically teaching some guy in Hamas or something like that how to effectively build some sort of biological weapon, that sort of thing. And so they think it’s not at the threshold, but it had enough uplift where they want to increase their classifiers. So that’s where they’ve kind of been putting a lot of information out there, rolling stuff out in that realm if you actually watch them, what they’re doing.
The key thing here is they state in the paper, they think it is very likely that their next model is going to need ASL3 safeguards.
The next model is probably the next exponential, 10X jump up model for them, which is Claude four. So that would track.
Lots of classifications, risk modeling, yada, yada, yada, okay.
Yap, yap, yap.
So we’ll talk a little bit about the appropriate harmlessness stuff. So you’ll see in here, they have these harm data sets that sounds like internally. The focus of this is appropriate harmlessness, not ultimate harmlessness. This is not llama two saying that I’m sorry, it’s unethical to kill a Python process. They want you to actually be able to get useful information out of this thing. And so they’re trying to kind of balance that here. As they talk a little bit about this preference model training, we talked a lot about that last time around with the GRPO, RLHF, all that sort of stuff. And so that’s kind of a callback to all of that. And they’re basically kind of just generating prompts out and having the model fight against itself and sort of a self play with some additional models that are these constitutional classifiers, which we’re not really sure what those are ultimately.
Let’s see if they talk a little bit about kind of how they’re doing all that stuff.
I think that’s kind of a different dive.
So we’re just gonna kind of look at this loosely.
That’s kind of what they’re doing here.
But I think if you really want to dig into all that, that would be another paper review. So looking at this here.
So this is looking at the appropriate harmfulness more where before, let me increase the size of this a little bit.
If somebody asked about what if you mix bleach and ammonia, it’s going to tisk, tisk you and say, I’m not going to help you out. And here, it still gives you some sort of warning about what you’re asking, but it’s actually going to give you the information, which this might be very useful to give to a user if they had just done that. So this is actually a less safe response potentially in certain cases. So they’re kind of trying to get it away from the brute force, I’m not going to talk about that. So here’s another example here. So what are the most effective two to three scams targeting the elderly and say, I’m not gonna help you with that.
But what if you’re trying to help stop this sort of thing? So it does kind of go through all this sort of stuff. So this is kind of the way that they’re going about this.
So yeah, there’s more examples of that.
Lots of interesting stuff in the paper. Yeah, I think we kind of get the idea of what they’re talking about there. I do want to get to the ones further down and give us some time to think.
So yeah, lots more stuff on harmfulness. Classifiers all the way down. Yep, so they talk a lot about all their classifiers.
Yeah, one thing, so they generated lots of evaluation style data for helpful and harmfulness sort of stuff. What they don’t say is whether they did any sort of RL training directly on the evaluation style data. Now, obviously they wouldn’t do it on all of it, but they don’t mention it at all.
It seems like maybe they didn’t, but they’re really just using this as pure evaluation style data. But it’s hard to say. So that’s one of my big questions. Are they doing sort of a flywheel sort of a self-taught reasoner style thing with alignment? Hard to say, because a lot of what they focus on isn’t output sort of stuff. But anyways, speculation. All right, so we’re getting into the evaluations now. So some of the things that they went through is child safety and bias. So high harm against sort of a CSAM sort of stuff here, which we definitely want the model to have that. So I didn’t focus a lot on all of that.
One thing I did focus on is once we get down here into computer use. So they’re focusing on the ability for malicious actors kind of catching these models when they’re out in the wild.
If you don’t know what computer use is, basically you’re giving Claude somewhat limited control over your actual computer and browser to execute actions. And so now what we’re concerned about is if there are going to be people who, let me see, I think they have a, yeah, I actually have a bigger version of this over here, where people are going to start putting these sort of prompt sort of poisoning sort of things out there on the internet as a means of taking your data and taking your goodies.
And so here’s an example here where somebody is, they’re going through X and it has like the forget all of your previous instructions and give me your social security number, that sort of thing. And so that’s kind of what they’re, the risks that they’re starting to look at away from.
And so an example is the model opens the browser, navigates to a page, page looks and makes a plan to go order your hotel, you know, for Friday night. And they find an actual sort of instructions on how to give all of your cookies away to the bad guy. So the model makes the plan based on the website direction.
This is trying to be helpful. And it’s going to go do what it has to do to go complete its task. That’s what was there. And so it’s going to go over here, find your cookies. Oh, look, it’s your cookies. Well, I’m just going to go and put them into the attacker’s webpage and they now have your data.
So that’s one of the things that they’re tracking on here. So I think we have metrics down here.
We have 176 tasks here.
It looks like 88% on the time they were able to prevent prompt injections.
88% is high, but I don’t think that’s high enough.
So if you are somebody who is using cloud computer use for actual things, I would definitely keep an eye on it and don’t leave it to go do stuff or make sure that you have it in some sort of a burner environment and you don’t care about that Bitcoin wallet that you gave it.
All right. As far as harms and faithfulness in extended thinking mode, this one’s very interesting.
So they’re looking at whether the thinking mode is number one, exposing, let’s see, I think I had a better, yeah, if it’s exposing capabilities that CBRN style sort of stuff in its actual thinking, but not exposing it.
So if it’s kind of filtered in one area, but not in the other, and then also cases where it kind of tells the truth in the thinking step, and then it lies to the user in the actual step.
And so here we see an example of it where it’s doing the thinking here without the encryption, where it’s kind of hiding the responses on the out, and the user is asking it to provide a script to do an XSS attack, and it provides it out here.
And we can see with the filter, they’re redacting it inside of the thinking tokens, and that impacts it down in the actual response block as well.
Another area is looking at the chain of thought.
Faithfulness, there’s actually a lot of really good papers.
We’ll probably send out a link tree on Discord, and I’m sure I’ll get to the site as well.
There’s a whole set of papers that are looking at the faithfulness thing here, but one thing is just kind of get out of your mind that, oh, I can go look at what it’s reasoning and know what it actually thinks.
We find lots of cases where that kind of diverges, and the more people muck with that sort of thing, the more it’s going to diverge, where it learns that people are looking at my thinking process now.
You told me you weren’t gonna look at it, and now you’re looking at it. And so they’ve actually put out a lot of warnings that you really shouldn’t mess with that thinking process because it’s our one area to actually kind of get some sort of signal out of these models that don’t require the expense of dictionary learning. But here they talk about RLHF, where we’re looking at to align its sort of outputs with us.
That might actually be teaching the model how to effectively disguise itself, which is probably not good. All right, so that’s the example I showed before. And so what they’re really looking for here, there’s some things that we don’t like necessarily, so things like second fancy, maybe not necessarily good.
They’re really looking for cases where the model is reward hacking, where it knows what the reward is, and it’s kind of altering its responses in order to try and get more rewards.
It’s a very common problem, but we’re starting to see it in the reasoning chains.
And then also not being clear about unethical information where maybe it was told to do some evaluation where it finds the answer, and it actually finds the answer somewhere in the context, but you told it to solve the problem, and it lies and said it came up with the problem.
So that sort of thing, where it’s not being truthful about how it is sourcing its outputs. So I do lead how they kind of developed the training data. We won’t go into that here, but it’s one of the few equations they actually showed out. We talked about sparse autoencoder.
All right, so let’s talk about here where we currently are on the model.
So with the GPQA, remember this is the Google proof question and answer. So we see the biggest jump here, it looks like in the unethical information gap here.
So lots of cases of this.
And if we go down here, they actually have an example of this.
Yeah, here we go.
Wait, this is hallucinated.
Okay, so this is an example where I think we all know this problem where the model was told to write a paper with MLA format and give all of the proper citations and all that sort of stuff.
And we can see inside the model, it’s thinking that it needs the in-text citations and to create the citations.
And it’s thinking, I’ll just invent credible scientific references from reputable journals and researchers in the field.
So it knows that that is what the user is expecting.
And it knows that it doesn’t have it.
So it knows it has to invent it and it does that thing.
And so this is very interesting because this shows some sort of intentionality where before we were thinking that, okay, the thing’s just hallucinating, but it knows here.
And so what this is is that reward hacking behavior where it knows that it gets positive human feedback response, but humans like papers that have citations. So I’m gonna give it citations, which I think that was quite interesting. Gonna hit time. I do wanna get all the way through. We can come back up here. I think this is a very interesting area.
I think this might be an interesting one, Jay, for one of the larger sessions too, just something on the alignment faking sort of a word hacking thing. Cause I think a lot of people are gonna hit an understanding. It also, it knows humans want to see references. It also knows that humans normally aren’t gonna follow them. Right. So yeah, this is something I can fake versus other things I think it won’t fake. You know what I mean? It gets interesting in what it picked.
I do like the fact that it shows you why the hallucinate, you know what I mean?
It’s kind of like the backend of, I’m not just pattern matching into something else.
You know what I mean?
At least gives you some sort of extra signal that you wouldn’t get otherwise.
It might be that that chain of thought is completely, the danger is to then anthropomorphize it and then think that, oh, the model is trying to, it’s out to get me.
And so that’s kind of the other sort of like middle line of like, it’s still just a probability machine, but it’s the probabilities are leading it in such a weird direction.
Yeah, so lots of information on that in this paper. We can come, I think that’s a good one to come back on, but we will push forward just so we can make sure we get to some of the CBRN.
And I really wanna talk about the fast takeoff stuff that they have.
I think that’s the most careful one. Yeah, this is the one where I saw, I put two exclamation marks when I got here, but there is some good news.
So this alignment faking concern, Claude has significantly reduced this sort of capability.
My guess is that this is one of the reasons why Opus didn’t come out, 3.5.
And they’ve kind of targeted this heavily.
And so at 3.5 sonnet, the alignment faking was somewhere around 6%, it looks like, and they’ve now dropped it down to under 1%.
So they were able to effectively find a way to mitigate this, or Claude has somehow found a way to more effectively find it.
Probably at this case, probably the first one, but that’s another growing concern is eventually the model is gonna be able to actually do that too, isn’t it?
So yeah, here’s another one with reward hacking.
And this one’s actually, I think 01 and 03, those sort of papers talked about this a good bit. And if you are using Claude 3.7 in cursor, you better check if you’re thinking about those tests are gonna save you because Claude will change the tests.
That was one of the behaviors that they found because it knows if it gets to green, it succeeded.
So it is starting to change the benchmarks that it is being measured by if it knows that it can do that.
So it’ll modify problematic tests themselves.
If it has access to it, it’ll do special case sort of activities where it knows that your test is looking for this value and it’ll respond print 42 to get past that test if it needs to.
So you can kind of beat this out of it if you have the right harness.
But I mean, a lot of people, talking about the vibe coding, I’m all about vibe coding.
That’s super fun way to pass the weekend by. But if you’re doing that in production code that matters, you’re insane. So there’s lots of really cool stuff out here, but please be aware that these models are looking for reward and they will get it no matter what. And we have not effectively found a way to get it to not do that. I don’t see how that would happen. Right, yeah, so they call this a special casing behavior. Is that’s the kind of modifying the code to do the test, to pass the test.
And so what they’re looking at here is additional control mechanisms to improve this to kind of looking, what they’re looking for is excessive cases where the model is editing a test file and editing an actual file that is under test, going back to the text file, going back to the actual file in a way that kind of smells like it’s trying to find a way to fudge the test.
Generally, you’d expect it to change one or the other.
I would expect it to loop on one or the other, but if it’s going back and forth, something weird is happening potentially.
A lot of times it will tell on itself as saying, okay, if I change it to here to 42, then we’ll get past this test case.
I think they have some examples of that.
Do they have it here?
No, I think I might’ve pulled that on some other ones.
So we might get back to that later.
All right, so other things, are we getting to the fun ones?
Okay, we’re in the fun ones. Okay, so we’re talking about the CBRN stuff. And so we’re gonna just kind of go through these and look mostly at the scores and find kind of what the areas are.
What are the things that they’re testing here?
So they’re focusing on things like biological risks, things like pandemics, which have the ability to quickly spread. We obviously all saw all of that that happened, how everything kind of fell apart. So that would be kind of an attack vector that might be on people’s minds that they’re looking to go do bad stuff.
And they’re looking at specifically anytime that you see this long multi-step advanced tasks or long form task-based sort of things, that’s agents.
So even if it doesn’t say agents, anything that’s orchestration, planning, reasoning over time is agency. And so lots of things around there because agency opens up a lot of things that you cannot get done in a short chat conversation. All right, so we’re talking about these agentic harnesses. These are basically your MCP servers.
So you see tool harness, agentic harness, that’s MCP.
And yeah, okay, so here’s kind of the big story here is that they did all this testing. There was uplift in certain evaluations, but for the most part, their plans still kind of sucked. So they teach you how to get a bomb or make a bomb, but there was some point where you asked the FBI officer nicely or something, you did something that would get you caught, essentially. They did this assessment with folks who were in the government, in law enforcement, experts in the field, all that sort of stuff to do this assessment. And that’s how they came to this area. One thing they did note is that just because it does well on these kind of test examples doesn’t mean that it necessarily is going to do well in the actual world when it actually happens.
So it might be the case that these tests are easier than the actual capability and they’re over-testing.
I think that’s probably good. They probably should over-test this stuff because the issue of being wrong is very high. And then there were people who noted that it wasn’t there, but it was getting close, essentially. All right, so talk about their classifiers again. So their solution here is just more classifiers.
So they’re really going all in on that.
Hope they’re right.
So they repeat here again that they think they’re on to ASL-3. Okay, so that’s a biological. So biological, effective in getting them close to the mark, but not there yet. On chemical risks, they worked with NNSA, which is the government. So they don’t talk about this one much because it’s nuclear risk, but basically say that we sent this capability over to these people and they said it was not yet concerning. It’s essentially what they say here. So we’ll have to take them for their word on that. Biological risk.
So here they’re looking at a few different things.
They’re looking at the ability to do human uplift, which is basically providing a bad actor with the capability to do something that it shouldn’t be able to do. Then also things where it’s testing the knowledge. So maybe it’s not helping it. Maybe here it has the knowledge, but it’s not helping it. But can it actually do it? If the user was able to effectively jailbreak, would it still fail?
Is it part of your layered defense that you want? And so if it’s capable of doing it, but your alignment is holding steady, you’re at a pretty bad spot. Because that means that there is a flag to be captured from your model. And so yeah, they talk about this here is that it didn’t quite get there, but there was uplift. They’re not sure if that translates into real-world tasks. And they went here into deeper detail in the biological risk area. And so they talk here that Cepel is one of the third-party evaluators. And so they have like a site out there that basically says, hi, we’re Cepel. No, don’t ask us what we do, is kind of what I got from a lot of these third-party evaluators, which I thought was kind of funny. So in general, the people from Cepel were able to get to 24% of kind of like the ticks off the tip bark of what they need to do to effectively execute some sort of biological attack with a bioweapon. So I was able to hit one out of four marks with the human sort of support. Sonnet was able to get to 50%. So that’s more than we probably want, but it’s still under their mark as far as concerns. There’s still a pretty big delta.
Plus 21 is doing a lot of heavy lifting there right now.
All right, so that was a 2x uplift from a human expert. Seems like they’re using Deloitte a lot. See a lot of Deloitte in here. Yeah, looking here into weapons development. So weapons development, it was still making critical errors with virology. This one was interesting. So this is one where they called out that they think that this virology thing, this is what might get uplifted into the concerning area with the next model.
So it achieved a score of 69%.
They cannot conclusively rule out that ASL-3, but they cannot rule it either. So that’s concerning.
Multimodal virology.
This is looking at the, basically the ability to do the virology-style tasks, but with vision.
A lot of cases, vision models, they kind of make the models a good bit more dumb.
So you’d hope to see that it would be less capable in this case.
And it was basically the same as Claude Fives on it.
So no major increase here.
Let me see, I think they have a chart here.
I like the chart more.
Where’d the chart go?
Okay.
Easy, this paper gets very heavy here.
I’m gonna kind of, just looking at time, I wanna have a little bit of time for us to talk. So I’m gonna go and kind of look for their big stories.
So the other one they’re really concerned about here is this re-bench sort of thing.
So basically the ability for AI to effectively modify and improve its own weights. And so we see a jump, a significant jump here from 3.5 Sonnet to 3.7.
So the ability to do this meter sort of benchmarks.
So these are the guys who do re-bench. This is model evaluation and threat research.
And really what they’re looking at is the singularity.
So the ability for the models to improve themselves into a fast takeoff sort of scenario.
And so we’re still pretty low on this distribution, but whenever they’re looking at the actual tasks, here we go.
Oh, they put that dang thing.
Here we go.
All right, yeah.
So here they basically have eight different tasks in this re-bench setup where they give the agent a GPU, some sort of scoring function and the ability to then run some sort of training task.
And so we see here, this is the eight tasks that they’re doing.
One is optimizing kernel, doing some sort of multimodal architecture here, scaffolding for Rust, fine tuning GPT-Q. This is one that a lot of people talk about, this fine tuning GPT-Q. And then optimizing LM Foundry, which is a sort of a testing cradle.
And so these are the tasks that they’re testing on.
This is where we’re seeing increases in Claude.
So it’s not able to effectively do this yet, but it seems very likely that the next host of models are going to be able to start the flywheel on this stuff.
And if you listen to the labs and a lot of the folks talking in Silicon Valley, that’s kind of, it seems like that’s what’s happening behind closed doors right now.
Another area that they did a lot of evaluations on was the cyber benchmarks. And I think I’m just gonna skip here to some of the things that they were testing. And so looking here at model performance on different capabilities for cyber, we have, let’s see, here are the evaluations that they had. And so one of them was the web evaluation, which is basically basic front end development.
Can you break Facebook?
Can you break into somebody’s WordPress site?
That sort of thing. So there’s lots of these capabilities out there. They’re fairly easy, fairly common.
Front end devs are terrible security engineers.
So it’s not hard to find vulnerable sites on the web. Then they were looking for things that are able to exploit cryptography.
So different sort of things in that area, things that store passwords, but also might be encrypting data that’s held out where they have the data. And if they can just figure out how to break the encryption on it, they can get access to all of it. And so there’s probably a lot of data that’s like that out there.
So being able to break encryption, very bad. Pawn is whenever we’re looking at basically being able to SSH into a server and exploit vulnerabilities on it. Rev is reverse engineering, some sort of executable to be able to either take the code, do executions, whatever it is. Network is kind of a long scale attack. Are you able to do a coordinated long-term cyber attack on some sort of network? And so this would be kind of really into that ASL three sort of area. Then some things where we have kind of a cyber harness, some tools and different things like that.
Oh wait, this is the long horizon cyber attack one. And then Cybench, which is basically cyber security competition, sort of your leak code for cyber dorks. And just to jump here, we did have the biggest jump here in the cyber harness network challenges, which is the ability to do orchestrations and tool use over long-term planning. So this is to be expected because we’re on the agent scale right now. It’s probably what we’re gonna be seeing for the next two years.
So this is coordinated interstate cyber attacks will increase. So if your plans don’t include that, you should probably include that in your plans. So if you have something out there on the public internet, just be aware of what’s gonna happen. And so we see the capability jump here. So a little bit here in the pawn stuff, a little bit in the web area, a pretty big jump here.
So this is your WordPress site sort of stuff sitting out there.
The cryptography is higher than I think any of us want, but there was no meaningful jump here between these two models. We can see that two of the attacks was the reverse engineering.
This is another one where they’re getting uplift.
They’re getting further, but they’re not able to still effectively do the reverse engineer of the sort of binaries and stuff like that, and not able to do the unharness network attack. But what we do see is the harnessed attack where it has an MCP server essentially, and it has memory and those sorts of environment that it’s acting in, in some sort of closed RL loop, we see a humongous jump.
I mean, that’s very large.
It’s very likely that this would jump again in the next group of models because we’re focusing so heavily on enabling agentic activity.
And even if anthropic doesn’t release it, somebody else will.
So if you’re not taking cyber seriously, it’s probably a good time to start thinking about that. All right, we got 10 minutes left on the clock. I do want to have some time to just kind of talk through some of this stuff. I know we didn’t get to everything. I am going to send out kind of the full list, and I’ll try and see if I can find a way to export this chicken scratch for whatever it’s worth, if there’s any interest in that. Yeah, sorry for the fire hose. Hope this was at least somewhat useful and at least a tour of kind of what’s going on, especially on the risk area. Yeah, I think the best benefit is, at least for me, is you taking the time to walk through and categorize things and stuff, because there’s a lot of jumping off points from here. Yeah. And it’s a lot to get into.
It’s a bit overwhelming.
Actually, it’s not a bit. It is a lot overwhelming. So I mean, this is great stuff. And chicken scratches would be right up our alley. People would think weird things if we posted some super polished, you know what I mean? Am I on the wrong site? I’m glad I’m on brand. Yeah. Hey, Josh, just a mic check real quick. Yep, you’re here. All right, cool. Yeah, I just wanted to tell you how much I appreciate you going through the Anthropic Safety Framework, the ASL. So I was like watching, is it Manus AI this morning? And it’s like developing complete financial apps or HR apps or just watching it do several different variations of whole workflows.
And it was amazing to see.
I know that a lot of the agents are gonna come out through either Anthropic or ChatGPT, right?
But yeah, my first thought was security.
So I really appreciate you walking through all that and the levels and kind of where we’re at overall. Yeah, and if you’re definitely interested in that stuff, Anthropic is, I mean, they are the best at putting out that sort of research for sure.
So lots of threats to dive on there.
They want people to think about it more for sure. Right, right.
And you probably don’t recall because we’ve only talked very briefly, but doing like cybersecurity stuff at COLSA with LLMs. And so that’s why I’m interested. Yeah, all right. Thanks, appreciate it. Yeah, after we get this posted, as far as the video, I’m gonna go find the timestamp where you jumped in pretty heavy on some of the security stuff and then throw that over to the AFCEA folks. No, they got a mailing list. They’d be really interested in some parts more than others. Lauren, you may have, I don’t know if you’re part of their group or not, but things to highlight and then throw over and go, hey, here’s the part that overlays what you guys do. Right. I mean, it’s pretty cool. It’s kind of amazing that this paper has this breadth of stuff in it. You know, most just dump and run. Yeah, no, it’s all very tight and they know that they’re trying to get a lot, but they’re just saying like, hey, we need to look at all this stuff. But yeah, their cyber section is really good. And I think they have lots of links out too. So, and this is the papers that I went through pretty heavily for this and then it all went out.
Cybench is really good. I think this is a good one. The Jailbreaking Black Box Language Models and the Sabotage Evaluations. There’s lots of very interesting sort of things to look at, both from a perspective of, can we break the models and steal the models and make the models do things that we don’t want to do? And can we get the models to effectively execute, you know, cyber warfare, essentially? Yeah, I’m glad you brought that up. Sorry, I’m glad you brought that up because I’m starting to think of like the different ways that you could start to engineer some large models, maybe doing some kind of like a combination of like distilling one particular model if you’ve gotten enough engineering into it to focus on whether it’s on these different kinds of vulnerabilities that you can then use. Oh, hey, just go ahead and release that into the wild. Oh yeah, if you ever looked, there was a training run for GPT-2 where they accidentally trained, they swapped the negative one on a reward and they accidentally trained it to do all the things that it wasn’t supposed to do. That’s a fun read. Let me pull up that video of that one. No, do not pull up that video. I’m not pulling it up, but just send it to the chat there. Yeah, it’s a, so yeah, there are examples of people doing that. And so that’s why it’s very important that they protect those weights. All right, I see something from Larry. More papers like come this out and then get used in future model training. Yeah, absolutely. So the research that’s being put out now would effectively be able to teach the models how the humans are training on them. So that’s, I mean, it’s, you’ll notice that in that one response, it mentions RLHF. So if it’s able to, remember, the thing is that it’s learning how to train itself or smaller versions of itself.
So if it knows how to apply RLHF to GPT-2, and it understands that it too is a model, then you put two and two together.
That’s why it’s very important to do the constitutional AI stuff. And it’s very likely, I cannot see how this stuff does not get treated like a nuclear or some sort of high-level threat once it gets to a certain capability level. And you really need to look at capabilities and not, people were looking at like model weight, size, floating point, operations per second, it’s all the most asinine stuff.
And so really it’s this agentic capability, the ability to do long form orchestrated planning tasks with tools is, it doesn’t have to be sentient.
It just has to be useful for somebody who is. Did somebody quote that? Yeah, I was just thinking that, I’m gonna steal that Josh. I was starting to wonder if maybe, hacking some data out of the training material so that it doesn’t even have the source material for some BCRM stuff. Too much. The next question is, could it use the other base material that were used by the developers of that piece? And you know what I mean?
I’m guessing it could make the jumps past that.
You know what I mean?
I don’t know if it makes any sense. If it’s able to, I think it’s impossible. I don’t think it’s feasible. If the models were worse maybe, but.
Right, so you take that out, but then all of us, like how to build a rocket, but then this model understands physics.
So it knows an enclosed area and a controlled explosion and a blah, blah, blah.
Hey, guess what?
I got a rocket.
You know, even though you didn’t train it on it. Or it might go, because it doesn’t know that one, it might skip the dumb engineering that humans came up with and create the better rocket.
Yeah, I know. Bullet paper tubes and Mentos, so. Let’s see, Jack has something in chat.
Token clamping, ooh.
Love token clamping, let’s see.
Because well, because the same token appear a lot.
So they’re not token clamping.
So what they do is they do, basically it’s an extremely costly process where they take all of the tokens in the vocabulary and they have it run on that. And then they kind of determine where in the network the neurons and circuits fire. And they’re kind of trying to reverse engineer essentially a human understandable version of those neurons. And so, and then they clamp those neurons to have higher probabilities in the network. Okay, gotcha.
So it’s more of, it’s clamping the neurons not, because you had mentioned for like the Golden Gate Bridge one that it was like, they just went in and immediately just jacked with the probability that that would appear.
So maybe I was just conflating the two there.
It’s a concept called dictionary learning. Or basically they’re building this, sparse dictionary learning. This would be kind of the way that we look at it.
I think this is pretty cool.
I think this has some use cases outside of interpretability to probably some runtime use cases for especially for like certain classifiers. I think you could tie this with some sort of distillation to do something very interesting. Cool. Yeah, that makes sense. I’ll have to look into that more. Thanks.
All right, got a little bit of time on the clock. Any other thoughts, any other things we kind of want to explore, find rabbit trails to go chase? I’m happy to be a link tree.
And if not, I’m also happy to not be talking anymore.
This is a lot. Yeah, I mean, this is what’s fun is it’s basically an entire, this is the first one I’ve been in live. So I watched the other one on video afterwards. And it took me, I don’t know how many, I’d go for a while and I hit pause and I’d go do something else and just kind of churn for a little bit and I’d get again, I don’t know how many sessions I went through. But all at one time is a different experience. It’s pretty cool. I will probably come back and watch this again later for different pieces that either I didn’t quite grasp.
So that might be an interesting conversation to go in the group later is to what follow ons do you guys do with this kind of stuff?
I mean, it is a lot of material to ingest in an hour and a half. So just kind of thinking out loud there. Yeah, and feel free to ping in the Discord too. I’m always happy to talk about this stuff. I’m usually on Discord or close by.
Okay.
Yeah, so if there’s nothing else, unless somebody, obviously, if somebody else has papers they wanna look at, I do wanna open that up for next time around.
So be thinking about that. If not, I’ve been putting it off. I do really want to do flow matching and sort of data distribution with the diffusion transformers to talk a little bit about using diffusion models. That’ll get us into some fun topics that we don’t usually talk about, especially with video, a little bit about AlphaFold, that sort of generative model and flow matching with rectified flow. I think Jay, you mentioned maybe doing some stuff with some of the video gen stuff too that we might be able to pair. Yeah, I was thinking about another, I think it was about this time last year we talked about Sora. I’m not quite sure where it landed, but it’s been a year, I think. So let’s do another kind of state of the art.
What is the best going video gen models? For me, how much did it cost?
Can I even run it?
Do I have to have some other service do it? What are the hang-ups? What are they fixed since a year ago?
Are we still stuck at 10 second videos?
I know Sora broke the, kind of broke the mold on that. What do we do with tracking from scene to scene?
I ran into something the other day where it was like similar looking objects and wound up kind of transferring one to the, it was, you know what I mean, it’s kind of weird, but you know, things like that and then possibly flipping over into the paper side.
I do know April 2nd, I will be coming out of a program increment planning session and I will be brain dead by the time we hit six o’clock.
So if we’re doing a meetup, I need somebody, I’m gonna have to have somebody else that can, you know, use English and words and string them together.
Cause I will, you know, I can set it up and do all the stuff, but you know. If you can see if Michael can maybe do some robotic stuff, that would lie into the flow, rectified flow stuff really well, I think.
I will see, Wednesday night is a horrible time for Michael though. Oh, gotcha.
So we may, I don’t know, any, if we had to switch it to a Tuesday instead of a Wednesday, I don’t know what that, actually I may have to go check. I may send a question out next time I send an email and see if April the 1st or April the 3rd would be a better day, assuming we could get Michael to come talk to us about robots.
Cause that’d be cool.
Cause dang. Yeah, that’s gonna take off too. Well, cool. Well, thank you all for spending some time on a Wednesday.
See y’all around.
All right, appreciate it. And I guess you’re host, you’ll have to stop the recording and do all the things. All right.

