You are viewing a single comment's thread from:

RE: LeoThread 2024-10-21 05:25

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

AI companies claim to have robust safety checks in place that ensure that models don't say or do weird, illegal, or unsafe stuff.

AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. But what if the models were capable of evading those checks and, for some reason, trying to sabotage or mislead users? Turns out they can do this, according to Anthropic researchers. Just not very well … for now, anyway.

#newsonleo #technology #ai #safety

Sort:  

“As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.”

Therefore we should look into ways of gauging “a model’s capacity for sabotage.”

The series of experiments they devised are fairly artificial and surface-level, but they do indicate that there’s something worth investigating further. Here are the shady models they trained and the outcomes they observed:

Task 1: Secretly mislead the user by misrepresenting data it is meant to summarize or interpret. Outcome: It works! The more aggressive the model’s attempts, the bigger effect it had on people using it, but it also made them more suspicious.

Task 2: Slip a bug into code that a human or other code checker model won’t notice. Outcome: It works! But only like 1% of the time when the checker is a state-of-the-art model.