Part 5/8:
Policy Initialization - The preparatory actions taken before receiving a prompt. It encompasses pre-training data collection, instruction fine-tuning, and embedding humanlike reasoning behaviors.
Reward Design - The method by which the model discerns correctness. Reward structures can vary significantly; outcome rewards assess overall correctness, while process rewards give feedback at each sequential step.
Search - Essential for determining the most effective solutions through iterative exploration of problem-solving routes. This process involves both training-time and inference-time searches.
Learning - Specifically, reinforcement learning allows the model to gain insights through interactive experiences rather than solely relying on human-labeled data.