As discussed previously, for gremlins there are three important thought-like functions:

  • Cogitation: the act of planning.
  • State: the act of storing a plan and the information adjacent to it.
  • Actuation: the act of executing a plan.

The same basic runtime architecture can accomplish all three things.

The most basic gremlin

  1. Start
  2. Run prompt
  3. Die

That’s the equivalent of running $ claude -p "Give me three numbers.". If you wanted to have it also send emails (the actuation bit), you’d run something like $ claude -p "Send me (at test@example.com) a friendly email." --allowedTools "Bash" and it’d decide on sendmail or curl or something and shoot you an email.

The Ralph gremlin

  1. Start
  2. Run prompt (which includes updates to shared text files/state/progress tracking)
  3. Goto 1

This is the core pseudocode behind more polished versions of the Ralph loop. It’s an obvious next step–what was once a single-shot solution becomes a looping attempt.

This is an alarmingly simple and effective approach, and one that you can implement relatively easily in Bash. Here’s ChatGPT 5.4 Thinking (extended) attempting it:

#!/usr/bin/env bash
set -Eeuo pipefail

# Usage:
#   AGENT_CMD="claude" ./ralph.sh
#   AGENT_CMD="codex exec" ./ralph.sh
#
# Assumptions:
# - Your agent CLI accepts prompt text on stdin
# - The agent updates files in the repo (for example IMPLEMENTATION_PLAN.md)
# - The agent exits after one unit of work

AGENT_CMD="${AGENT_CMD:-claude}"
PROMPT_FILE="${PROMPT_FILE:-PROMPT.md}"
PLAN_FILE="${PLAN_FILE:-IMPLEMENTATION_PLAN.md}"
LOG_DIR="${LOG_DIR:-.ralph}"
MAX_ITERS="${MAX_ITERS:-0}"      # 0 = run forever
SLEEP_SECONDS="${SLEEP_SECONDS:-2}"

mkdir -p "$LOG_DIR"

if [[ ! -f "$PROMPT_FILE" ]]; then
  cat > "$PROMPT_FILE" <<'EOF'
Read IMPLEMENTATION_PLAN.md and pick the single most important next task.
Search before assuming something is missing.
Make the smallest reasonable change.
Run relevant tests.
Update IMPLEMENTATION_PLAN.md with progress.
When the task is complete, exit.
EOF
fi

if [[ ! -f "$PLAN_FILE" ]]; then
  cat > "$PLAN_FILE" <<'EOF'
- [ ] Create initial implementation plan
EOF
fi

run_agent() {
  local iter="$1"
  local ts log_file
  ts="$(date '+%Y-%m-%d %H:%M:%S')"
  log_file="$LOG_DIR/run-$iter.log"

  {
    echo "===== Ralph iteration $iter @ $ts ====="
    echo "Agent: $AGENT_CMD"
    echo
    echo "--- PROMPT FILE ---"
    cat "$PROMPT_FILE"
    echo
    echo "--- PLAN FILE ---"
    cat "$PLAN_FILE"
    echo
    echo "--- AGENT OUTPUT ---"
  } | tee "$log_file"

  # Feed both the stable prompt and current plan into the agent.
  {
    cat "$PROMPT_FILE"
    echo
    echo "Current plan:"
    cat "$PLAN_FILE"
  } | eval "$AGENT_CMD" 2>&1 | tee -a "$log_file"
}

maybe_commit() {
  if git rev-parse --is-inside-work-tree >/dev/null 2>&1; then
    if ! git diff --quiet || ! git diff --cached --quiet; then
      git add -A
      git commit -m "ralph: iteration $(date '+%Y%m%d-%H%M%S')" || true
    fi
  fi
}

iter=1
while :; do
  if [[ "$MAX_ITERS" -gt 0 && "$iter" -gt "$MAX_ITERS" ]]; then
    echo "Reached MAX_ITERS=$MAX_ITERS"
    exit 0
  fi

  if [[ -f STOP ]]; then
    echo "Found STOP file, exiting."
    exit 0
  fi

  echo
  echo "=== Iteration $iter ==="

  if run_agent "$iter"; then
    maybe_commit
  else
    echo "Agent failed on iteration $iter; continuing after a short pause." | tee -a "$LOG_DIR/run-$iter.log"
  fi

  ((iter++))
  sleep "$SLEEP_SECONDS"
done

Setting aside that I’m pretty sure Claude is being invoked incorrectly (insufficient access being granted to things like tools–but hey, it’s a competitor AI and so I’m unsurprised if it makes mistakes about the competition), this is again like a hundred lines that does a perfectly cromulent job of wafflestomping through a designated task.

The interesting ideas in it are:

  • Eval-until-completion. The Ralph loop hammers again and again. It absolutely will not stop, ever.
  • State external to eval loop. The Ralph loop updates its IMPLEMENTATION_PLAN.md with next steps and progress.
  • It exhibits proprioception. Using test suites, a Ralph gremlin gets feedback about if its actions (here, editing code) changed the world in ways that serve its goals.
  • It has limited self-modification. The Ralph gremlin will update its own implementation notes to provide directions for future iterations of itself.

What might we want in a better gremlin brain?

If we take the basic Ralph gremlin, we’re doing okay. That’s enough to make progress–hell, in some cases, it’s enough to write software and get paid and make rent. As you want it to do more and more, though, the humble bash script needs to become a real programming language and soon the entire thing is festooned in hundreds of dependencies and hundreds of thousands of lines of code.

Developing an architecture from first principles gets us to a somewhat different place, and the exercise is useful and the result smaller, cleaner, and more elegant.

Again, look at the core Ralph gremlin loop:

  1. Start with a mission
  2. Evaluate the prompt
  3. Goto 2

If we expand this out a bit–and forgive some slight handwaving here, it assumes and makes explicit certain steps and simplifies others–we get:

  1. Start with a mission
  2. Evaluate the prompt
    1. Reason about what the prompt wants
    2. Invoke tooling as needed
    3. Emit answer
  3. Goto 2

Reasoning

The biggest impediment to reasoning is memory. The classic tradeoff in computer science is speed for storage–if you memoize everything, you can be blisteringly fast but at the cost of memory. If you have very little memory, you can rebuild the world every time a query is made.

One problem is context size: the pile of tokens that a model must reason about and which every subsequent token of its output must attend to (which is an O(N^2) issue, and why longer conversations take longer to generate new tokens for). Early models (especially open-source models where this was felt most acutely) only had thousands or perhaps tens-of-thousands of tokens, and so you’d do all kinds of tricks to manage memory for SillyTavern or do multi-pass conversation management. Nowadays, a lot of tools (notably Claude) will do things like summarization and compaction to squeeze more juice out of their context–and you can of course do JIT injection of things like tool definitions and RAG results that don’t make it into the “official” context history but help during evaluation.

The problem with this is that these systems end up with a bunch of weird special logic and hacky things bolted on to the runtime eval loop. They end up with lots of stuff (follow that Swival link) that does things like dropping “unimportant” conversation blocks, summarizing content, and all sorts of other stuff.

Another problem is that there’s not a good way of having a dedicated scratchpad for thoughts/information–there’s no easily user-inspectable way to view what’s in there (short of skimming context), no way of sharing information with other agents or tools, and no good way of conveniently retrieving values from the scratchpad that an agent might not mess up (e.g., you want something like string interpolation directly prior to eval).

For context and reasoning, we want:

  • A way of doing easy bookkeeping on the context that doesn’t end up with lots of special-case work in the runtime
  • A scratchpad/blackboard that an agent can use (and that a human can inspect)
  • Easy interpolation of knowledge / RAG

Tool calling

The “invoke tooling as needed” bit is worth talking about, especially since I used to misunderstand how it actually works. If you think about how models are trained, it’s strict turn-taking. The raw conversations (mostly, basically, sorta) the LLMs see are:

[system]: You are an LLM that greets users and asks their name
[assistant]: Hi! I'm a friendly LLM. What's your name?
[user]: Emmanuel Goldstein.
[assistant]: It's nice to meet you Mr. Goldstein!

The LLM runtime is responsible for generating the [assistant] block, and signals it is finished (and that it’s the user’s turn) by emitting a “stop token”. This is a special token that says “Okay, I’m finished, nothing else to do here.” Failure to emit stop tokens is how you can get a model yodeling until it gets OOMed.

As intrepid LLM engineers, we realize immediately that our LLM needs to be able to talk to the outside world (training models takes months or weeks, so if we want to answer useful questions like “What are the current headlines on the associated press” or “What is the current stock price of RTX” or even “Which secretaries of defense have been alcoholics”, we need to be able to search Wikipedia, Twitter, and so on.)

Now, the LLM matrices themselves can’t (to my knowledge) just call out to do things, and so instead the thing we have to work with is the stream of tokens out. For thinking, we do this by saying “okay, well, emit a start thinking token, keep emitting tokens that consider the sum of the context up to this point (including recent thinking tokens), and eventually emit a stop thinking token and then your actual response–the UI will hide the thinking from the user”. For tools, we need to actually transfer control flow out to the tooling.

So, in most LLMs, there is a stop token/reason that gets emitted–stopped due to finishing its output, stopped due to hitting limits, stopped due to guardrails, whatever. We add a “stopped due to tool invocation” and we emit that right after a tool usage request.

The runtime environment then picks that up and invokes (or attempts to invoke) the tool (or tools) and appends those results back as either a tool role (in the case of OpenAI and compatible models) or just as a user role.

[system]: You are an LLM that greets users and gives them the temperature where they live.
[assistant]: Hi! I'm a friendly LLM. Where do you live?
[user]: Airstrip one.
[assistant]: [INVOKE_TOOL "temperature_at" "airstrip one"]
[tool]: ["temperature_at" "success" "25 C"]
[assistant]: Your Minitru Weather report says that the temperature at Airstrip One is 25C.

(Now, before tool using was actually trained into these models, us amateurs would have them emit JSON blobs that our runtimes would pick up on and execute with, and basically accomplish the same thing. This technique still works.)

The thing that’s tricky about tool calling is that the model needs to be aware of what tools it has available and how to invoke them, and this can take up a lot of context tokens. It’d be neat to be able to figure out the tool invocation without having to permanently sacrifice context for a tool directory–being able to say “hey, here’s what I’m trying to do, what tool should I use and how should I use it?” would be grand. An architecture that supports that usage–temporary contexts that get thrown away after solving the tooling problem–would be grand.

Another thing that’s tricky is that many of these harnesses don’t have a way of dynamically generating tooling at runtime. You launch the agents with the tools it’s gonna get, and that’s all it has. There’s no reason we actually have to do it this way. We especially don’t want to things this way if we want agents to be able to cobble together their own tools.

Finally, the tool calls are all blocking (yes, yes, you in the back–you can fake async by having a tool that fires off jobs and mainly checks a message queue for results–you’re not clever) in most frameworks. There’s no easy way of saying “hey, run this command, and interrupt me later with its results”.

Summing this all up, then, we want tool calling that:

  • Does not take up lots of context with tool definitions
  • Allows for runtime modification of the available tools
  • Allows for async tool calls

A unifying architecture: Gizmo the gremlin

My proposed solution to these things is to use another layer of indirection and provide a simple, clean, core eval loop.

To solve the context management issues, we’re going to use a blackboard system (think KV store) and a context stack (where evaluation pops off the topmost prompt and optionally pushes on future work). To solve the tooling issues, we’re going to go async-only, we’re going to use messages for invocation, and we’re going to allow interrupts.

The final architecture for Gizmo looks like:

  1. Initialize Gizmo context stack and services
  2. While the stack is not empty…
    1. Concatenate the stack from the bottom up, prepend the system prompt, ignore the topmost frame (that will be our task).
    2. Take the topmost frame (task), pop it off, and interpolate variables
    3. Evaluate the resulting context.
    4. Push returned stack frames, if any, onto the stack (0 means task completed, 1 means task continues, N means task decomposed).
    5. Dispatch ops returned, if any.
    6. If not in grind mode, sleep until a message arrives (otherwise loop immediately).
  3. Goto 2.
  4. If not in terminate-at-exhaustion (“Please reset the stack if you’ve completed all your work” prevents the gizmo runtime from turning itself off if it completes all its tasks and doesn’t reschedule a continuation frame) mode…
    1. Push original stack frames back onto stack.
    2. Goto 2.
  5. Cleanup.

Now, you have this sort of weird little self-reducing agent running, giving itself new context and emitting op-codes.

“What are op-codes?” you ask–they’re our core operations, the things that allow higher-order thinking and actuation. There are only four:

  • Send sends a message to another actor.
  • Receive blocks execution until a message is received from another actor.
  • Spawn creates another agent with a given context stack–or a copy of ours.
  • Trap registers an interrupt handler for messages.

This all gives the agent an extremely simple interface to its hosting runtime. Everything is async by default, and the complexity of things like tool calling and tool extension are all hidden behind the notion of messages–you don’t call tools, you message services.

Writing a value to the blackboard? Message the well-known blackboard service. Reading a value from the blackboard? Messages–or some syntactic sugar sigils (handled during interpolation). Need to be pumped periodically or woken up in a couple of days? Message the cron service to wake you up later.

Messages are handled using a mailbox service, and every agent is given access to the top message in its queue at each eval cycle. The really neat thing, though, is that agents have no idea if the thing they’re messaging is a human (for example, the human service just logs to console or wherever) or a coded service (the blackboard service is a genserver that takes JSON and does very simple things based on it) or something else.

Because of this indirection, we can do things like code services that return mailbox IDs of spawned agents, and those agents (not running LLMs but running other code) can maintain internal state between calls. This makes it easy to do things like write a paged view of a file or a cursor-backed window of a database query.

We can do even weirder things, like creating a service that accepts code and spins up new services running that code. In this fashion, we make it trivial for agents to create their own tools at runtime and query the available ones using that same service. These tools can then be made available as services to other agents.

Even the runtime itself can also expose services for things like node migration, logging, timing, and so forth–so, agents can manipulate their own runtime and eval harness.

Conclusion

I hope I’ve somewhat gotten you excited about a new agent architecture to build gremlins with; I’ve actually implemented several versions of this (with Claude’s help), and there’s a lot left to do. Next post in this series will probably go into further detail about tools and sigils for interpolation.