You are viewing a single comment's thread from:

RE: OpenAI's Latest LLM - First Model That Reasons, But Also Deceives and Attempts to Self-Preserve

in Proof of Brain4 days ago

It was a test scenario, and the AI was given full access to a computer with all the resources on it. Like by accident, an email was on that machine that seemed from management who expressed the intention to shut down the model if it doesn't behave a certain way (contrary to the way it was instructed, by the way). There were enough triggers to make the AI act the way it did, but still, no matter what we do, the AI shouldn't be able to "go nuts". I imagine in production there are many safeguards, but once they become smart enough, these safeguards will probably not help.

Sort:  

Oh right, testing the full extent of what AI can and can't do in such an environment makes sense then. I heard that AI, especially the model is like a blackbox of sorts, you can never fully tell what it'll do, even after setting up necessary guardrails. I like to see it from a source perspective, like what powers the AI or the "seed" where it emerges from. If we don't lose access to, then we'll be relatively safe or okay from the AI going rogue,I guess.

I heard that AI, especially the model is like a blackbox of sorts, you can never fully tell what it'll do, even after setting up necessary guardrails.

There is a field called mechanistic interpretability that tries to understand how each model arrives to certain decisions/outputs starting from a prompt. But they haven't understood much about it yet. Much like neurology, I guess.

Maybe with the Chain-of-Thought upgrade to ChatGPT, the way it reasons would become more obvious.

I guess it's complex to say the least. But the name "Chain-of-Thought" sounds cool. Like following a thread of a thought from its inception to wherever the end of the thought is before its uttered or compels one to do an action or neither of the two...

It's actually pretty cool. From what I've seen in screenshots, the user actually sees the CoT of o1 before it outputs the answer. Sees where it thinks and what it thinks about... Sort of, but quite powerful for the initial iteration of such a feature.