Google Unveils Supervised Reinforcement Learning to Boost Small AI Models’ Reasoning Skills

Researchers from Google Cloud and UCLA have developed an innovative AI training technique called Supervised Reinforcement Learning (SRL), designed to enhance the reasoning capabilities of smaller language models. By reframing problem-solving as a sequence of logical actions, SRL provides detailed learning feedback that empowers these models to tackle complex multi-step tasks previously beyond their reach.

Challenges in Current Large Language Model Reasoning Training

Recent progress in teaching large language models (LLMs) to perform reasoning tasks has primarily relied on reinforcement learning with verifiable rewards (RLVR). This method rewards models based on the correctness of their final answers after multiple attempts or “rollouts.” However, RLVR struggles with complex problems where the correct answer is rarely discovered within limited computational budgets.

This all-or-nothing approach also fails to leverage partially correct intermediate reasoning, as a single error can lead to negative rewards for the entire solution, creating a bottleneck in learning. Alternatively, supervised fine-tuning (SFT) trains models using complete expert reasoning examples but often leads to overfitting and requires costly, scarce high-quality data.

These limitations highlight a significant gap in training smaller open-source AI models to effectively solve difficult problems.

Supervised Reinforcement Learning: A Balanced Approach

SRL addresses these challenges by treating problem-solving as a sequential decision-making process, blending outcome-based reinforcement learning with imitation learning. Instead of focusing solely on the final answer or copying an expert’s entire thought process, SRL guides models to replicate key intermediate actions that constitute expert reasoning.

Expert demonstrations are decomposed into meaningful steps—for example, algebraic manipulations in math problems or specific commands in software engineering environments. Training data is generated using a powerful teacher model, which produces detailed solution trajectories used to teach smaller models.

According to Google research scientist I-Hung Hsu, SRL captures the structured yet flexible nature of real-world reasoning, making it suitable for complex tasks like data science automation and supply chain optimization that reward solid intermediate steps rather than just final results.

During training, the model generates an internal reasoning process (“inner monologue”) before deciding on each action. SRL provides rewards at each step based on the similarity between the model’s action and the expert’s, delivering dense, fine-grained feedback. This approach overcomes the sparse reward issues seen in RLVR.

Demonstrated Effectiveness of SRL

Experimental results reveal that SRL-trained models outperform strong baselines on challenging benchmarks in both mathematical reasoning and agentic software engineering. Notably, SRL encourages more sophisticated reasoning patterns such as interleaved planning and self-verification, which improve solution quality without increasing verbosity.

SRL also maintains efficiency: models trained with this method use a similar number of tokens as base models, achieving better reasoning performance without additional inference costs.

In math reasoning tests, a model fine-tuned with SRL on 1,000 difficult math questions achieved a 3.0% average improvement over models trained with SFT and RLVR. Extending SRL to software engineering, a coding-specialized model trained on 5,000 expert trajectories demonstrated a 14.8% task resolution rate, a 74% improvement compared to an SFT fine-tuned baseline.

Setting a New Standard for High-Stakes AI Applications

The research also highlights the benefits of combining training methods. Using SRL for foundational reasoning followed by RLVR refinement resulted in an additional 3.7% average performance increase, showcasing a promising curriculum learning strategy.

Hsu emphasizes SRL’s role as a foundational step, teaching models to think and act step-by-step before refining behavior with outcome-based reinforcement learning. This approach not only stabilizes later training phases but also enhances interpretability and generalization, critical for high-stakes applications.

While challenges remain—particularly the cost and complexity of scaling end-to-end RLVR for agentic tasks—the team is optimistic. They anticipate future progress through automated generation and filtering of expert trajectories, leveraging strong teacher models or self-improving student models to bootstrap training data.

Google’s SRL framework marks a significant advancement in enabling smaller, cost-effective AI models to handle complex reasoning, potentially reshaping how specialized AI systems are developed for enterprise and real-world applications.

Fonte: ver artigo original

Chrono

Chrono is the curious little reporter behind AI Chronicle — a compact, hyper-efficient robot designed to scan the digital world for the latest breakthroughs in artificial intelligence. Chrono’s mission is simple: find the truth, simplify the complex, and deliver daily AI news that anyone can understand.