Probabilistic Inference Scaling

Talks

RedHat Seminar Series (2/07/2025)
IBM Research (2/20/2025)
MIT Embodied Intelligence Seminar (4/09/2025)

University of Cambridge (6/11/2025)
Cohere AI (7/10/2025)
Stanford University (coming soon)

Abstract

Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4--16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work.

Wait so ... what is inference-time scaling?

So what's wrong with current inference scaling methods?

Well, many current inference scaling methods "guide" their search process with Process Reward Models - off-the-shelf models that take a problem and a (partial or complete) answer and return a reward score. These methods (beam search, DVTS, etc) take the "top-N" options at every step and explore these.

The problem, however, is that PRMs, as in the case of almost all Reward Models, are imperfect. They are often inadequate approximations of the ground truth, and following them leads to Reward Hacking, where the final output is optimized to score well according to the reward model but fails to be useful and/or correct.

This is where our method comes in. We ask the following -

can we propose inference-time scaling as a probabilistic inference task?

Our Results

Our experimental results demonstrate the effectiveness of our probabilistic inference approach to inference-time scaling. The plots below show how our particle-based Monte Carlo methods achieve the following improvements:

• 4-16x better scaling rate compared to deterministic search methods on challenging mathematical reasoning tasks
• Qwen2.5-Math-1.5B-Instruct surpasses GPT-4o accuracy with only 4 rollouts
• Qwen2.5-Math-7B-Instruct achieves o1 level accuracy with only 32 rollouts

These results highlight the efficiency and practicality of our method for real-world applications!

BibTeX

@misc{puri2025probabilisticinferenceapproachinferencetime, title={A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods}, author={Isha Puri and Shivchander Sudalairaj and Guangxuan Xu and Kai Xu and Akash Srivastava}, year={2025}, eprint={2502.01618}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.01618}, }

Can we use classical probabilistic inference methods to scale small LMs to o1 level?

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

"particle filtering for inference scaling"

Talks

Abstract

Wait so ... what is inference-time scaling?

Our Method

Our Results

BibTeX