The Don't Stop Benchmark
A narrow test for long-running coding agents: can they keep doing one boring thing without getting cute or giving up?
A narrow test for long-running coding agents: can they keep doing one boring thing without getting cute or giving up?
How I replaced my Google Home with a desktop Hermes Agent voice assistant.
Even Groq isn't fast enough for realtime, an off-the-shelf local LLM got close, and SFT finished the job.
People argue about policy facts when they really disagree about values. Empirical uncertainty is what makes this possible. AI is changing the equation: right now it makes it easier to find evidence for whatever you already believe, but better policy simulations could eventually make some factual disputes harder to hide behind.