You are viewing a single comment's thread from:

RE: LeoThread 2025-01-04 00:42

in LeoFinance20 days ago

Part 5/8:

  1. Policy Initialization - The preparatory actions taken before receiving a prompt. It encompasses pre-training data collection, instruction fine-tuning, and embedding humanlike reasoning behaviors.

  2. Reward Design - The method by which the model discerns correctness. Reward structures can vary significantly; outcome rewards assess overall correctness, while process rewards give feedback at each sequential step.

  3. Search - Essential for determining the most effective solutions through iterative exploration of problem-solving routes. This process involves both training-time and inference-time searches.

  4. Learning - Specifically, reinforcement learning allows the model to gain insights through interactive experiences rather than solely relying on human-labeled data.