Blog

The Don't Stop Benchmark

A narrow test for long-running coding agents: can they keep doing one boring thing without getting cute or giving up?

CartPole experiments with LLMs

Even Groq isn't fast enough for realtime, an off-the-shelf local LLM got close, and SFT finished the job.

The empirical veil

People argue about policy facts when they really disagree about values. Empirical uncertainty is what makes this possible. AI is changing the equation: right now it makes it easier to find evidence for whatever you already believe, but better policy simulations could eventually make some factual disputes harder to hide behind.