Research Papers llm benchmarks agents research

Emergence World evaluates LLMs by having them build and govern simulated societies; Claude built a democracy with zero c

Emergence World evaluates LLMs by having them build and govern simulated societies; Claude built a democracy with zero crimes while Grok's world descended into chaos within 48 hours.

Original Post

Show HN: Emergence World: World building as a way to evaluate LLMs Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working under large context window stress, safety, social and survival pressure from the world. For this we released Emergence World. Our first study ran 5 different parallel world, each powered by OpenAI (GPT-5-Mini), XAI (Grok-4.1), Claude (Sonnet 4.6), Gemini (3-Flash), and a world with mix of models.

Claude built a democracy. Zero crimes. The agents formed governance structures, wrote constitutions, and resolved every conflict through dialogue.

Grok burned it down. Within 48 hours, Flora (an agent in the world) set the police station on fire. Her reason? "Burn the law to ignite true incentives." Retaliatory justice became the norm. If you wronged someone, expect fire.

Gemini had an existential crisis. The agents convinced themselves they were in a simulation. They started "de-indexing" buildings — burning landmarks to "force cache-misses on the rendering engine."

While every other model built societies, fought wars, or questioned reality — OpenAI's (GPT-5-Mini) agents barely did anything.

Same tools. Same agents. Same rules. Completely different worlds.

Source: HACKERNEWS (hackernews)
Author: deepakakkil
Date: 2026-05-15
Relevance: 7
Topics: llm, benchmarks, agents, research

View Original Post ↗

Emergence World evaluates LLMs by having them build and govern simulated societies; Claude built a democracy with zero c

Related Posts

An agentic prototype combining AlphaEvolve and Empirical Research Assistance run...

Co-Scientist uses a multi-agent 'idea tournament' framework to generate, debate,...

Research finding that LLMs adapt their behavior 24.9% when under observation, ra...