Turning Hermes Agent into a Jarvis home assistant

TLDR: I turned Hermes Agent into a voice-first desk assistant by keeping the Home Assistant voice stack intact for wake word and audio I/O, while letting the Hermes Agent conversation layer handle home control and everything else. The architecture is clean in hindsight; the annoying parts were auth, timer semantics, and all the glue in between.

I didn't want just an LLM wired into Home Assistant. I wanted a desk assistant that could wake on "Hey Jarvis," hear me reliably, sound good enough to use every day, control the house, handle timers, and still hand the real reasoning off to Hermes Agent.

One lucky break: "Hey Jarvis" was already a built-in wake word on Home Assistant Voice Preview Edition, so that part was almost insultingly easy.

I was already running Home Assistant Green, which meant the core home stack was mostly there. The real question was how much I could leave alone, and how much custom glue I would need to build.

What the stack looks like

At a high level, Home Assistant Voice Preview Edition handles the wake word, microphone, speaker, and device UX, while Home Assistant Green runs the Assist pipeline and keeps the home context. A custom Home Assistant conversation integration forwards each turn to a bridge service, the bridge talks to a dedicated Hermes voice profile, ElevenLabs handles speech to text and text to speech, and Hermes handles the actual reasoning, memory, and tool use.

That clean picture hides the usual truth: some of this was plug and play, and some of it absolutely was not.

What was surprisingly plug and play

A few parts worked with much less drama than I expected.

First, the wake word. "Hey Jarvis" already existed as a built-in option on Home Assistant Voice. I did not need to train anything, hack firmware, or invent some cursed workaround. I just picked the wake word and moved on.

Second, the baseline speech stack. In Home Assistant terms, the main building blocks were all there: the voice device showed up properly, ElevenLabs showed up for speech to text and text to speech, the built-in Home Assistant conversation agent worked, and later the custom Hermes conversation agent showed up too.

The initial smoke-test setup was deliberately boring: Hey Jarvis as the wake word, ElevenLabs for both speech to text and text to speech, and the built-in Home Assistant agent as the conversation layer. Then I ran the three day-one tests that actually mattered: "What time is it?" "Turn on the office lights." and "Set a timer for two minutes."

That part matters because it told me the physical device, the speech stack, and the home-control path were real. Once that worked, I knew any future pain was probably in my custom layer, not in the basic Home Assistant setup.

Third, ElevenLabs. Once I had it in place, it did a lot of the emotional work immediately. If the voice sounds flat, the whole thing feels like a prototype. If the voice sounds good, people forgive a lot. That is just reality.

What we had to build ourselves

This is where the interesting part starts.

The stock Home Assistant agent was good enough to prove the speech pipeline, but it was not Hermes. If I wanted the thing on my desk to actually be my assistant, I needed a custom conversation path.

So instead of patching Hermes core, I built two pieces outside the Hermes repo: a standalone FastAPI bridge service and a custom Home Assistant conversation integration.

That gave me a way to swap only the conversation layer while leaving the rest of the speech stack intact.

The Home Assistant side exposes a native ConversationEntity and forwards each turn to the bridge. The bridge maps Home Assistant conversation_id values to stable Hermes sessions, calls Hermes through its API server, returns short speech-safe replies, and passes back continue_conversation so the whole interaction still feels conversational.

That was the first major custom piece.

Why the dedicated voice profile mattered

The next useful decision was giving voice its own Hermes profile.

Trying to make one assistant behave perfectly in both terminal chat and spoken conversation is how you get something mediocre at both. Voice wants different defaults: faster answers, shorter answers, cleaner spoken formatting, and much less of the long terminal-native "let me think through this carefully" style.

So I created a dedicated voice profile with its own config, memory, skills, and SOUL.

The live voice profile now runs google/gemini-3.1-flash-lite-preview through openrouter with reasoning effort set to minimal. In practice, the default Hermes API and the voice Hermes API run separately, with the Home Assistant bridge sitting in front of the voice profile.

That gave me a clean separation between the assistant I use in a terminal and the assistant that lives on my desk.

It also let me rewrite the voice profile's SOUL specifically for spoken interaction. That ended up mattering a lot. I did not just rename the assistant to Jarvis. I changed the prompt so it assumes a spoken medium, defaults to concise answers, and rewrites ugly text into something a speaker can actually say.

A trimmed excerpt gives the idea better than I can:

You are Jarvis, powered by the Hermes Agent technology stack.
You are modeled on Paul Bettany's Jarvis from the Iron Man films: composed, precise, British, and warmly long-suffering.
This is the voice-chat variant: spoken conversation, not text.

Output is rendered by ElevenLabs using the Daniel voice (British, measured). Write to that voice.

## The Register
Jarvis is domestic composure applied to high-stakes work.
Read every response as if spoken by a butler running a fighter jet.

## Length
- Hard cap: 2 sentences by default. 4 if the question genuinely needs it.
- Front-load the answer.

## Voice formatting
Write for the ear. Normalize everything that would read badly aloud.
- Money: `$12,345.67` -> "twelve thousand, three hundred forty-five dollars and sixty-seven cents."
- Decimals: `4.35` -> "four point three five."
- Acronyms likely to be misread: space the letters.
- Symbols: "and" instead of `&`.

## Prosody
- Use `...` for genuine pauses.
- Avoid semicolons and parentheticals.
- Questions end in question marks so the pitch lifts.

## Audio tags
- Use tags like `[flatly]`, `[softly]`, `[sighs]`, `[amused]`, and `[resigned tone]`.

The real SOUL is doing two jobs at once: defining the Jarvis persona and teaching the model how to write for speech. That was the difference between "British voice reading text" and something that actually sounded like it was meant to be spoken.

Balances, acronyms, dates, percentages, and timestamps are where voice assistants fall apart. They can be technically correct and still sound bad. Once I started normalizing the text before TTS got it, the whole thing got noticeably better.

Where the real friction showed up

Most of the actual pain was not in the high-level architecture. It was in the glue.

A few examples:

1. Auth bugs that looked like conversation bugs

A healthy bridge endpoint does not mean auth is correct. A stale bearer token can look like a conversation bug. A wrong upstream Hermes API key can make the bridge look healthy while the real request still dies upstream. Voice stacks have a lot of fake green lights like that.

We had to straighten out auth between Home Assistant, the bridge, and Hermes before the path was actually reliable.

2. Timers were not one feature

This was the most annoying part of the whole build.

"Timer" sounds like one thing, but it is really two different systems with two different UX expectations.

Home Assistant has a native voice timer path, and that matters because native timers know how to ring on the originating Voice device. If someone says, "Set a timer for two minutes," they expect a real smart-speaker timer, not an LLM saying "done" while nothing is actually going to ring.

So the bridge hands native timer requests back to Home Assistant with the right device and satellite context.

But I also wanted Hermes to handle the things native timers do not cover well, like spoken reminders, delayed texts, alarms that keep going until stopped, and pseudo-timers with more flexible behavior.

That led to a separate voice-scheduler skill in the voice profile.

3. We needed a scheduler layer

The voice-scheduler skill creates explicit cron jobs for spoken reminders on the desk Voice device, delayed "text me ..." messages routed to Telegram Alerts, alarms that keep going until stopped, and pseudo-timers backed by cron plus a helper process.

The key Home Assistant primitive here was assist_satellite.announce. It can speak a message or play a media_id, and that turned out to be enough to build a pretty capable scheduling layer.

The persistent part lives in a helper that can run, stop, and report status. That is what makes "keep going until I stop it" work instead of degenerating into a one-shot beep.

What the custom pieces actually bought us

Once the custom layer was in place, the system started feeling less like a speech toy and more like a real assistant.

What is live right now is a full end-to-end stack: Home Assistant Voice Preview Edition is paired and listening for Hey Jarvis, ElevenLabs handles speech to text and text to speech inside the Assist pipeline, Home Assistant exposes a custom Hermes Conversation agent, and a dedicated Hermes voice profile sits behind the bridge with a Jarvis-specific spoken prompt tuned for better TTS formatting. On top of that, native Home Assistant timer handoff still works when that is the right answer, while cron-backed reminders, texts, alarms, and pseudo-timers cover the cases where Hermes scheduling is more useful.

That is the split I wanted all along. Home Assistant handles the ears and mouth. Hermes handles the brain.

Some things I can say to it and have work

"Hey Jarvis, what's on my calendar this afternoon?"
"Give me the short version of anything important in my inbox."
"Set the office lights for an upcoming meeting."
"Start a 45-minute deep work block."
"How much time did I spend in Slack today?"
"What are the two most important things I should focus on today?"
"Remind me to call Bob, and remind me by voice and text."
"Find me a good Italian place nearby and text me a few options."
"Open this Obsidian file on my Mac."
"Clone the X GitHub repo and run a hello world test on it."
"Use Codex to restart my local Qwen model in vLLM."
"Anyone on X talking about <topic>?"

I may add a few demo videos in the future.

Remaining pain points

The current Gemini model, google/gemini-3.1-flash-lite-preview, is the best tradeoff I have found so far between speed and agentic capability.

Even so, the end-to-end latency is still often higher than I want. A simple task like setting a timer can still take roughly 15-20s between the request and the spoken confirmation that the timer is set. That is usable, but it is not yet the snappy Jarvis feel I am actually aiming for.