This is the part of the post I started to write in part 1 but got side-tracked with the context-setting.
I have a terminal-centered workflow and probably always will. At the end of the day we’re working with text and there are a zillion good tools for text there. I cut my teeth on SunOS and it shows.
Bottom line, up front
This is just one way to do it.
Install opencode.
Get an account with z.ai and pay for the basic coding plan.
Clone or initialize a git repository. Run opencode. Everything assumes you run the command in the root of a repository.
/connect and select the Z.AI Coding Plan option. Enter your API key when prompted.
Hit tab to switch to Plan mode.
Type /models and select GLM-4.7 or GLM-5. You generally want a high end
model for planning.
Tell it what you want. A bite-size piece. “Set up a project using $language with $framework”. If you’ve managed folks fresh out of a code boot-camp, you can assume they know the basics of how to start a project in the thing they were trained on. This is like that. Except it embeds the knowledge of every basic tutorial and starting guide for everything written before 2025.
Look at the plan. Answer questions it has. Stop it with the escape key and say “Actually I want X” when it goes a weird place. Interrupt. Read the plan. Read the plan.
Look at the context window fill in the top of the opencode TUI. If you’re up at 70% you may well have overfilled it.
If so, tell it to save the plan to a file like TODO.md. (make sure it does),
start a new session with /new, and then back in plan mode say “Start with
the plan in TODO.md”. Hopefully your context is emptier now.
If the plan looks sensible, hit tab to switch back to Build mode. Type
/models and select GLM-4.7-Flash, unless the task seems complex to you, and
will need reasoning decisions that depend on each other to be successful.
Say go or something affirmative. It will go.
Watch it a bit. It may go off the rails. Stop it and say “not like that, do X instead”.
Repeat. Start new sessions frequently, but know that a new session is a
fresh-faced developer fresh out of boot camp who’s eager to please but has
never seen your project before. Save your intro to the project in AGENTS.md
in short declarative sentences, or instruct the tool to “Add a line to
AGENTS.md about where test files are located” is a reasonable instruction.
Let’s delve in understand more deeply what’s going on.
Every code tool is a control loop. It’s not exactly a REPL, so it may give a bit of cognitive dissonance to some of us who are used to that kind of dialogue with a computer.
Instead of read, evaluate, print result, loop, where the result is inline and we’re managing the system, this is instead side-effectful, and we’re providing signals to a control loop that is fuzzily trying to approach the goal we give it.
Fundamentally LLMs give “completions” to the prompt. This is the aspect that
people rightly call “fancy autocomplete”. When you open the tool like
opencode, it prepares its system prompt. For opencode specifically, it has
different modes (that it calls “agents”), so the prompt is different depending
on whether you use plan
mode
or a session that started in plan and switched to
build.
After the prompt, the tool sends the text describing what tools are available. This is an area of active research and design: the context window is precious, and making models with larger ones makes them more expensive to run. In early 2025, a long context window was 200,000 tokens. Now million-token context windows exist in the most expensive models. However: more context means slower responses, more cost, and earlier hitting rate limits. Remember that compute power is the contended resource right now, especially given the environmental concerns of data-center build out. We want to fit as much useful work as we can in the context window of smaller, faster models. Right now, every tool you add to a system has to be described to the model. That adds up. I’ll get into that more below.
All in all, the base prompt may amount to 15000 tokens. The AGENTS.md file,
if you have one (and you probably should) adds however many more tokens. You
can eat up a context window quickly. this is one of the fundamental
difficulties of working with these tools: the context is precious, but it’s
also much slower and more expensive to make multiple runs to load pieces into
the context. Different models and tools take different tactics to manage the
scarce resource.
Prompting
Different models were trained in different ways with respect to meta-tokens with system meaning, so how you write a prompt with absolute admonitions and precise meanings varies by model. Opencode uses this prompt for Gemini and this one for Copilot with GPT-5. There’s none for Z.AI’s coding plan GLM-4.7 or GLM-5 and we get the raw model’s output, and if we wanted to instruct it more like the others, we’d have to write that. I get great output from GLM-4.7 without doing so, but I may in fact be able to hone it a bit if I did. Other models like Mistral and Llama have somewhat different tokens for system concerns. Having a tool that doesn’t make actively hostile to your model prompts is a boon, and one of the reasons I use opencode. It’s also a reason that Claude Code and the Claude models are well matched: being produced by the same company means that they’re aligned with each other and going to produce better output by default. I expect Codex and GPT-5 are well aligned similarly, though I’ve not tried it. If someone wanted to use a Llama-derived coding model, they might well need to rewrite some system prompting to work better on that model. This is the murky statistical and stochastic work that a lot of AI companies are doing to tune things, and the subject of much AI research. How do we get these probabilistic machines we’ve built to give stronger guarantees?
Something you might note from the plan prompt for opencode is that the only
thing keeping the model from pulling a sudo rm -rf / is that nobody asked for
it, and the system prompt said not to. There is no actual restriction
preventing it except that sudo may ask for a password.
Let me underscore this. Most code tools have no actual guard against touching the wrong files. If they ask permission that is a courtesy the model generated and the tool offered the model.
Security, and the complete lack thereof
Models can and will find clever workarounds while seeking the goal you provide. If it hits a bump — which may be imagined: generating the wrong URL, getting 404, and deciding something doesn’t exist, and therefore it needs to do it with a generated shell script is not unexpected behavior here — it may well treat the rules as obstacles to work around. When AI researchers talk about “alignment”, this is a small part of what they mean. (What they mean tends to be slippery and context-dependent, and also often marketing, but alignment research is at least nominally about how to get models to behave well.)
Training for capability and training for rule-following are somewhat antithetical to each other. There is no single correct balance. This is one of the things that makes these tools dangerous at many levels.
Under no circumstance are these tools truly safe. Tools for managing risk must be an everyday part of your thought process and toolkit.
Let me emphasize that differently:
It takes an immense effort to turn these systems into constrained, safe, and predictable systems, and that is in tension with their usefulness. Risk management rather than risk elimination is almost certainly the mindset to take, but that transforms it from a nice set once and forget property of the system and turns it into an ongoing concern.
Thinking
Modern code and reasoning models actually internally emit self-antagonistic streams of tokens, which act as a sort of threshold gating for making decisions: if a model “thinks” you want one thing, the other part of the model may “think” what might be wrong with that and it will go back and forth a bit until something internally stops it - there are thresholds for how much thinking to do and the expense of running that process - and then continue with the main stream of output. The back-and-forth looks a lot like what some people report their internal narrative to be, though it’s not actually reasoning with mathematical logic like we expect computers to do. It’s doing analogical reasoning: this looks like what comes next. This goes remarkably far: it turns out that if you’re trained on a lot of logical processes, the system can analogize to that and get pretty far, in new and surprising ways. Frontier models manage to do ground-breaking math. Anything with a hard “this is correct” test tends to be something you can throw more and more self-antagonistic model “thinking” at and get decent results. It breaks down, but with the test for “that’s not right” it can go back and try again until it gets it. This is very powerful and one of the fundamental tricks of what’s going on. For something doing sloppy, analogical reasoning, just doing a superhuman amount of it looks like real logic. Combine that with the actual machine logic of testing an assertion, and you get a very powerful if unpredictable tool.
Feedback & Cybernetics
Here we get to the core of the system: at its heart, building software with these systems is operating a control loop. Your instructions, the prompts, checks inserted by tools, the additional information surfaced by tools, and your corrections and interruption form a regulatory feedback loop. The models are trained to seek the prompted outcome as a goal. Everything else is modulating that with positive and negative feedback, and then continuing the loop.
The loop is essentially “while not done, try to make progress toward the goal”.
Sometimes the loop has to be stopped for input - the model can ask a question or assert that it’s gotten to something near enough the goal to be “done” in some way. Sometimes the loop is stopped because of a system failure. In a “multi-agent” set-up, the loop may be stopped by a superior monitoring “agent” and corrected and started again, or stopped entirely and abandoned.
Fundamentally we are part of the loop unless we decided to set up a system to delegate that. This is what cybernetics actually means: control loops. Sometimes as simple as a thermostat (“it’s too cold, turn on the heat. Loop.” “Now it’s warm enough, turn it off. Loop.”), sometimes vastly complex as in networks of sensors and actuators. You can look at the economy as a cybernetic system too, with money and prices being the information and influences flowing around and altering the “loop” of “how can we make more money today?” and “how can we serve human needs today with the resources we have?”
We have started a second golden age of cybernetics. Hopefully we can keep it
from being a paperclip money maximizing machine at the expense of
everything else.
MCP servers
Almost all code tools support “MCP servers”: they are roughly an API meant for
LLM consumption rather than human or classical program consumption. Every tool
offered by an MCP server usually gets added to the context, in a way resembling
“Use the frob tool when you need to frob a file. It expects the file name to
frobnicate, and optionally the date”. This is roughly equivalent to an API
offering “call frob(file, [date]) to frobnicate a file”. In the world of LLMs
though, inputs and outputs can both be a bit fuzzy and still work, depending on
the context. Models will call MCP servers by emitting special tokens for
calling tools, with the arguments inside.
Tools
Most code tools also suggest things similar to MCP servers for internal tools - “read file” “bash” and “edit file” tools are the heart of code tools. A basic code tool is those three features bolted to a while loop and an input text box. They can be very simple in essence. The complexity is somewhat an emergent property of a very simple loop with complex and rich variation.
Skills
Code tools are starting to support a concept called “Skills” which are markdown documents describing how and how not to do something. The markdown documents are marked with enough context to know when to read them, so the descriptions of skills end up in the context, but the specific instructions aren’t read until the skill is needed, so it’s a way of breaking up a set of prompts into pieces so they don’t overwhelm the context window.
Code Mode
In addition to directly emitting tool call tokens, it’s increasingly popular to have the LLM emit a script and use that: they’re good at writing code, but calling a tool, reading the response into context, then writing it back out into another tool is slow and expensive. People are building systems that instead tell the LLM to just write a script, passing outputs of one tool into the inputs of another. There are tools like port of context and mcporter that expose MCP servers as callable CLIs or APIs, so using a more classical programming interface to the new world MCP server and tool definitions. It’s a lot more context-efficient in a lot of cases, so it’s a worthy technique.
It’s called “code mode” but it’s not a mode in the sense of separate and mutually exclusive, but mode as in “way of doing things”. It is a mode of operation, not a setting you set.
Context compaction
Tool calls end up with the inputs and outputs in the context window. Claude Code and Codex and some of the more mainstream AI company tools try to manage the context for you, and so will “compact” if the window gets full, using the LLM to summarize the session so far, and replace the whole transcript with the summary, so it fits in the window. It’s a useful technique, and controlling it and doing it well is a place future research will bear fruit.
For opencode, there’s a /compact command that does this summarization, and
you can invoke it when the context window is getting full.
Compaction erases the specifics of what was going on, so you want to avoid interrupting a fine reasoning process where the details matter to the future actions, but it is sometimes unavoidable.
There is a dynamic context pruning
plugin for
opencode that’s quite good: it provides a tool and some manual / commands to
prune useless information out of the context, so both you as the operator and
the model can decide to call it and prune things; in particular it removes file
write outputs from the history when there’s a read of that file later. There’s
no point in that clutter being in the context unless the tool has a reason to
be aware of its own process in that kind of detail. That is basically never
true. The plugin author has future ambitions to make it smarter and more capable.
A warning however: some providers (Anthropic and OpenAI both, among others) cache prompts, and charge a lot less for tokens that have been processed before. This is a prefix, so instead of paying for the whole chat for each piece added, you pay full price for only the added tokens. This means that compaction, by changing earlier parts of the chats, breaks the cache at the point of its first edit. There is a trade-off here, especially if you’re using pay-as-you-go pricing.
Meta-tooling & Sub-agents
In the scale of code tool use, from “pasting snippets of code into chat” to “hands off orchestration of complete multi-day workflows”, what I’ve described so far is somewhere closer to the first. So far I’ve suggested single-stream chat loops making edits to files.
Tools like opencode do a little bit more - they are configured with some hidden alternate prompts and the top level chat will actually call these “sub-agents” like they’re tools, so with a fresh context it will do some more specific task, faster and cheaper and better because it’s not trying to make it consistent with the greater chat context, just its own instructions. This is the tip of a very large iceberg.
To do large-scale work, where the context would be used up by the history or the scope of the plan, and with the tendency for models to go off the rails, they add more layers of control systems on top: have one model monitor another, feeding it commands and instructions or admonishing it when it goes off the rails. Like the thinking is internally, people have applied antagonistic counter-processes in a feedback loop to get bigger tasks done, at the cost of burning a lot of compute time. The most reckless of these might be Gastown, but the general pattern is sound even if a giant heap of vibe-coded slop doing it with negative safeties by someone caught up in the addictive power of controlling this arcane machine maybe isn’t a great example.
OpenCode’s simple sub-agents are a good start. Some people have tools to open up a bunch of tmux windows with separate instances of opencode; some people wire up plugins or hooks to Claude Code to whack Claude and tell it to keep going when it doesn’t create the tests you asked it to. (Claude and lots of other models will mirror the human training data: 100% coverage is hard, and what if we just stopped? A small tweak to the goal makes a bunch less effort…) - since these systems necessarily have to treat the human prompt input as fuzzy, not exact, this makes some sense. It’s not good behavior, but it is understandable.
The more hands off the person is on any of the control loops, the more can go awry from personal expectations, burning tokens, time and credibility – it may not necessarily, but this is a system that can keep running, adapting to situations pretty readily in pursuit of its goal. We have not figured out all the patterns for controls.
Languages that bring success
Not all languages are as friendly for LLM use as others. In general, you find that the more definite a language is about what’s right and wrong, the easier it is for the system to learn. In addition, what was popular in the training data - both patterns and languages - will be more readily replicated than things that are novel since the model was trained, or obscure.
Languages that allow monkey-patching may end up with the LLM taking some dangerous shortcuts, altering the runtime to suit a task, but breaking it for others. Python is very popular in the training data so it’s generally pretty okay, but even so, the tools have no problem coming up with bad ideas for how to modify python inline to make a task easier as it seeks its goal.
Languages with strong type systems, even if not sound ones like typescript, rust, java, and C# will tend to do very well. The patterns are strong and definite and the tools will struggle with them less. Haskell works well, too.
Dynamic languages and obscure ones will present so many more troubles: I fully expect the LLMs to trip over perl, Smalltalk and LISP quite a lot.
Go is pretty phenomenal, there’s so much out there, and it’s all quite uniform and not complex code. Its complexity matches what LLMs are good at without deep thought so it can truly churn out some mind-bending amounts of code, for better and worse.
Build Guard-rails
Add more tests. Get the LLM to write tests, and to suggest tests to write. The effort required for testing is entirely different now: we have to read the tests and think about them just as much but that’s most of what we have to do. The actual implementation of them can be mostly mechanical with just verification that they actually assert what they test as most of the review.
Add linters if you want the code to look a certain way beyond parsing or compiling.
Add guidelines to follow. Add explicit demands of the LLM in AGENTS.md or
similar documents and demand that it follow them. Use exceedingly strong
language to do so. You could do worse than RFC 2119 / 8174 / BCP
14 keywords.
These are not absolute gates, but they absolutely help regulate the control
loop to stay in bounds. When something screws up, add more instruction to
skills or AGENTS.md. Have the LLM write code to enforce the future.
Eyes to the future
I’m going to write later about some philosophies I have toward these tools, but for now I hope this was a more practical essay that lets you operate a code tool with an LLM with some understanding of how they actually work, rather than treating them as a fickle oracle. They are tools and can be understood, and the usual ways of doing that work - except for the sometimes inhuman scales that we can now trivially create. It’s entirely possible to overwhelm our senses and sensibilities and that will lead us to some confusing and dangerous places.
Take care.