RE: LeoThread 2024-10-05 09:19

And this is where the RL approach comes in – o1, unlike previous models that were more like the world's most advanced autocomplete systems, really 'cares' whether it gets things right or wrong. And through part of its training, this model was given the freedom to approach problems with a random trial-and-error approach in its chain of thought reasoning.

It still only had human-generated reasoning steps to draw from, but it was free to apply them randomly and draw its own conclusions about which steps, in which order, are most likely to get it toward a correct answer.