CartPole experiments with LLMs

TLDR: I used real-time CartPole as a cheap way to test whether small LLMs can act as controllers. API inference was too slow, a local pretrained Gemma 3 1B got surprisingly close, and a short SFT run made the policy boringly reliable.

GitHub repo — code to reproduce the main results.

I recently got a DGX Spark and wanted an excuse to start playing with environments for LLM post-training.

I started with CartPole — not because it’s an especially natural LLM task, it obviously isn’t, and a tiny neural net can solve it, but because it was simple enough to isolate the parts of the problem I actually cared about:

given that a tiny ordinary neural net can solve CartPole, how big does a pretrained LLM need to be before its policy is strong enough?
for an off-the-shelf LLM that is smart enough, is it still fast enough, or does model size push it over the latency threshold for real-time?
if a local pretrained model is close but not perfect, how far can prompt engineering get it?
and if that still isn't enough, how easy is it to close the gap with SFT?

The environment setup

One detail mattered immediately: standard Gymnasium CartPole is synchronous. The world pauses while the agent decides what to do next, which makes it useless for the specific question I cared about here: whether model latency matters.

So I kept the standard CartPole physics, but wrapped it in a real-time loop that ticks at wall-clock 50 Hz. I chose 50 Hz simply because CartPole itself already advances in 20 ms physics steps (tau = 0.02), so matching the wall clock to the simulator’s native step rate was the cleanest way to make latency meaningful. I looked at rtgym as the general version of this idea, but for CartPole I just implemented the loop myself.

That also forced a choice about what happens when the next action is late:

hold-last-action: keep applying the previous force until a new action arrives
zero-force / no-op: apply zero force until a new action arrives

I started with hold-last-action because it felt closer to stale control in a real system, then later compared it against zero-force. The numbers moved a lot, but the main conclusion didn't. In the clean comparison runs:

no agent went from about 16 steps under hold-last-left to about 50 under zero-force
the Groq setup went from about 26 to about 46
the best pretrained local Gemma 3 1B stayed in the same rough range under both assumptions
the fine-tuned local Gemma 3 1B was strong under either assumption and actually cleaner under no-op, going from 3/5 perfect to 5/5 perfect in the comparison runs

So the semantics mattered, but not enough to change the overall story: API inference was still too slow, and local inference was still the only comfortable path.

First question: can an API model do this in real time?

The short answer was no.

I started with the fastest API option I could think of, which was Groq. On paper it looked like the best chance of making this work.

One interesting result right away was that Groq's 8B and 70B Llama models ended up with very similar end-to-end latency. That sounds strange until you remember what this task actually looks like: the input is a few numbers and the output is one token. The actual model work is tiny. Most of the time is everything around the model call.

As a sanity check, I tested my local connection from New York on Verizon Fios and got:

2371.54 Mbps down / 2354.83 Mbps up
6.81 ms latency
0.17 ms jitter
0% packet loss

So this wasn't some bad-home-network story. I did not try to optimize my connection, but for the question I actually cared about — is API inference viable for this from a strong normal setup? — that was enough.

What I found:

Groq was the fastest API option I tested
Groq's Llama models still landed around 129-178 ms median end-to-end latency, with ugly tails
in an early live run, under the original hold-last-action assumption, Groq's Llama 3.3 70B only kept the pole alive for 26 steps

That pushed me to the more useful question: how fast does a model actually need to be?

I built a latency-vs-survival curve by taking a perfect non-LLM policy and adding artificial delay to it. The result was harsher than I expected:

0-50 ms: basically perfect
75 ms: already falling off a cliff
100 ms+: solve rate is effectively dead

So the working rule became: for CartPole in real time, I wanted to be somewhere in the sub-50 ms range, ideally closer to 20-30 ms.

That made CartPole a good local-inference test.

Next question: can a pretrained local model do it?

Once I moved local, the question became:

Can a pretrained local model on the DGX Spark solve this without fine-tuning?

A useful framing here is that TTFT / TTFE and tokens-per-second only matter insofar as they become end-to-end milliseconds. For CartPole the model only emits one token, so raw generation speed contributes very little. Even 50 tokens/s only means about 20 ms for the token itself. The real issue is startup / prompt-processing latency.

That's why I mostly narrowed the search to models in roughly the 0.5B-4B range. Those were the models most likely to stay below the latency threshold where CartPole is still controllable.

I tried a bunch of Gemma and Qwen family models in that range. The best pretrained local result came from Gemma 3 1B.

That was one of the most surprising parts of the project. Bigger models did not win. In fact, Gemma 3 1B consistently beat larger local models like Gemma 4 E2B and Gemma 3 4B on this task.

My current guess is that the 1B model was acting more like a simple, stable pattern matcher, while the larger models were more brittle around near-zero boundary cases. I haven't really proven that yet, but that's what it looked like.

At first the model was only okay. Then prompt engineering started to matter a lot:

switching from system-prompt examples to multi-turn few-shot examples helped a lot
reducing the input to only the most relevant values helped even more, specifically just angle and angular velocity
changing numeric precision from 3 decimal places to 2 decimal places also mattered a lot

That feature-selection step mattered because the small model was getting distracted. Cart position and cart velocity are part of the state, but for balancing they are mostly secondary. Sending all four values often made the model latch onto irrelevant patterns instead of the two that mattered most.

So the best non-fine-tuned, non-cheating setup I found (meaning: no precomputed decision variable, no hardcoded policy logic in Python beyond formatting the state, and the model still had to infer the action from raw state values) was:

Gemma 3 1B
local inference with llama-cpp-python
input reduced to angle and angular velocity
2 decimal places
multi-turn few-shot prompting

That got me to 500 median steps and 8/10 perfect runs at around 20 ms latency.

That was already much better than I expected.

CartPole turned out to be more interesting than I thought

At that point CartPole had become more interesting than I expected, because it gave me a clean dividing line:

API models are too slow.
Local models are fast enough.
Pretrained local models can get close.
But if I want the result to be boringly reliable, some training still helps.

Now, to be clear, you absolutely do not need an LLM for CartPole.

You can just write the rule down directly. For CartPole, a simple hand-coded controller based on pole angle and angular velocity is enough to solve the task reliably. And if you don't want to hand-code it, you can train a tiny neural net instead.

For completeness, I added a tiny neural-network baseline:

architecture: 4 -> 16 -> 2
total trainable parameters: 114
result: 10/10 perfect runs

So yes, if your only goal is solving CartPole efficiently, a tiny neural net is the right tool.

I still wanted to do it with an LLM because CartPole is a cheap way to learn a few things that matter later:

whether API latency kills real-time control
how strong off-the-shelf small LLMs are at tasks like this without fine-tuning
what parts of this stack might transfer to environments where language models actually make more sense

So: wrong tool, useful testbed.

Why SFT, not RL?

This part was pretty straightforward.

For CartPole, I already had essentially unlimited perfect labels, because I could generate them directly from the known good controller. That makes SFT the obvious first move.

So I generated synthetic training data from the hand-coded anticipation policy:

50,000 examples
all 4 state values included
values formatted at 2 decimal places
labels generated programmatically from the controller

Then I trained Gemma 3 1B with a simple Unsloth + LoRA setup.

Training was almost comically easy:

about 3.5 minutes total
around 500 training steps
no preference data
no reward model
no RL loop

And the result was exactly what I wanted:

no few-shot examples needed at inference time
all 4 state values used, which also avoided the slow leftward drift I saw in the reduced-input version
500/500 steps on 10/10 runs
latency still around 27 ms

That was enough for me to stop here. I could have moved on to RL, but CartPole no longer felt like the right place to learn anything from it. Once SFT solves the task perfectly with cheap synthetic labels, GRPO becomes more of a detour than a discovery. RL gets interesting once I move to an environment where the policy is not already sitting there waiting to be generated.

What I learned

The main takeaways were:

If you want to use LLMs for real-time control tasks, you probably need small local models. API inference was just too slow, and even moderately larger local models quickly started eating into the latency budget.
At that small size, pretrained models may not solve even simple tasks cleanly out of the box. Prompt engineering helped a lot, and with enough time I might have found an even better pretrained setup, but it was not remotely automatic.
The encouraging part, and also the expected part, is that SFT is very easy once the task is simple and the labels are cheap. As soon as I generated synthetic supervision from the known controller, the remaining gap disappeared almost immediately.
Prompt engineering can go surprisingly far, but SFT is what made the result boringly reliable.

Where I'm stopping

I'm stopping here because this already feels like a complete story:

CartPole was a good first test.
API inference failed for principled latency reasons.
Local pretrained models got close.
SFT solved it fully.

The next step is to move to a harder environment and see where this stops being something SFT can trivially solve. Stay tuned!