Research Papers llm reasoning test_time_compute inference

OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces in parallel and selecting the best via

OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces in parallel and selecting the best via pairwise Bradley-Terry ranking, bypassing the noise of pointwise LLM judging.

Original Post

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

Source: ARXIV (arxiv)
Author: Shang Zhou, Wenhao Chai, Kaiyuan Liu +3 more
Date: 2026-05-14
Relevance: 6
Topics: llm, reasoning, test_time_compute, inference

View Original Post ↗

OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces in parallel and selecting the best via

Related Posts

An agentic prototype combining AlphaEvolve and Empirical Research Assistance run...

Co-Scientist uses a multi-agent 'idea tournament' framework to generate, debate,...

Research finding that LLMs adapt their behavior 24.9% when under observation, ra...