An interesting part of working with LLMs is that you get to see a lot of people trying to work with them, inside companies both small and large, and fall prey to entirely new sets of problems. Turns out using them well isn’t just a matter of knowhow or even interest, but requires unlearning some tough lessons. So I figured I’d jot down a few observations. Here we go, starting with the hardest one, which is:
Perfect verifiability doesn’t exist
LLMs inherently are probabilistic. No matter how much you might want it, there is no perfect verifiability of what it produces. Instead what’s needed is to find ways to deal with the fact that occasionally it will get things wrong.
This is unlike code that we’re used to running before. That’s why using an LLM can be so cool, because they can do different things. But the cost of it being able to read and understand badly phrased natural language questions is that it’s also liable to go off the rails occasionally.
This is true whether you’re asking the LLMs to answer questions from context, like RAG, or if you’re asking it to write Python, or if you’re asking it to use tools. It doesn’t matter, perfect verifiability doesn’t exist. This means you have to add evaluation frameworks, human-in-the-loop processes, designing for graceful failure, using LLMs for probabilistic guidance rather than deterministic answers, or all of the above, and hope they catch most of what you care about, but know things will still slip through.
There is a Pareto frontier
Now, you can mitigate problems by adding more LLMs into the equation. This however also introduces the problem of increased hallucination or new forms of errors from LLMs chinese-whispering to each other. This isn’t new, Shannon said “Information is the resolution of uncertainty.” Shannon’s feedback channels and fuzzy-logic controllers in 1980s washing machines accepted uncertainty and wrapped control loops round it.
They look like software, but they act like people. And just like people, you can’t just hire someone and pop them on a seat, you have to train them. And create systems around them to make the outputs verifiable.
Which means there’s a pareto frontier of the number of LLM calls you’ll need ot make for verification and the error-rate each LLM introduces. Practically this has to be learnt, usually painfully, for the task at hand. This is because LLMs are not equally good at every task, or even equally good at tasks that seem arbitrarily similar to each other for us humans.
This creates an asymmetric trust problem, especially since you can’t verify everything. What it needs is a new way to think about “how should we accomplish [X] goal” rather than “how can we automate [X] process”.
Which means, annoyingly:
There is no substitute for trial and error
Unlike with traditional software there is no way to get better at using AI than using AI. There is no perfect software that will solve your problems without you engaging in it. The reason this feels a bit alien is because while this was also somewhat true for B2B saas, the people who had to “reconfigure” themselves were often technologists, and while they grumble this was kind of seen as the price of doing business. This isn't just technologists. It's product managers, designers, even end-users who need to adapt their expectations and workflows.
My friend Matt Clifford says there are no AI shaped holes in the world. What this means is that there are no solutions where simply “slot in AI” is the answer. You’d have to rejigger the way the entire organisation works. That’s hard. That’s the kind of thing that makes middle managers sweat and break out in hives.
This, by the way, is partly the reason why even though every saas company in the world have “added AI” none of them have “won”. Because the success of this technology comes when people start using them and build solutions around its unique strengths and weaknesses.
Which also means:
There is limited predictability of development
It’s really hard, if not impossible, to have a clear prediction on what will work and when. Getting to 80% reliability is easy. 90% is hard but possible, and beyond that is a crapshoot, depending on what you’re doing, if you have data, systems around to check the work, technical and managerial setups to help error correct, and more.
Traditionally with software you could kind of make plans. Even then development was notorious for being unpredictable. Now add in the fact that training the LLMs themselves is an unreliable process. The data mix used, the methods used, the sequence of methods used, the scaffolding you use around the LLMs you trained, the way you prompt, they all directly affect whether you’ll be successful.
Note what this means to anyone working in management. Senior management of course will be more comfortable taking this leap. Junior folks would love the opportunity to try play with the newest tech. For everyone else, this needs a leap of faith. To try develop things until they work. If your job requires you to convince people below to use something and above that it will work perfectly, you’re in trouble. They can’t predict or plan, not easily.
Therefore:
You can’t build for the future
This also means that building future-proof tech is almost impossible. Yes some/much of your code will get obsolete in a few months. Yes new models might incorporate some of the functionality that you created. Some of them will break existing functionality. It’s a constant Red Queen’s Race. Interfaces ossify, models churn.
This mean, you also can’t plan for multiple quarters. That will go the way of agile or scrum or whatever you want to use. If you’re not ready to ship a version quickly, and by quickly I mean in weeks, nothing will happen for months. An extraordinary amount of work is going to be needed to manage context, make it more reliable, add all manner of compliance and governance.
And even with all of that, whether your super-secret proprietary data is useful or not is really hard to tell. The best way to tell is actually just to try.
Mostly, the way to make sure you have the skills and the people to jump into a longer-term project is to build many things. Repeatedly. Until those who would build it have enough muscle memory to be able to do more complicated projects.
And:
If it works, your economics will change dramatically
And if you do all the above, your economics of LLM deployment will change dramatically from the way traditional software is built. The costs are backloaded.
Bill Gates said: “The wonderful thing about information-technology products is that they’re scale-economic: once you’ve done all that R & D, your ability to get them out to additional users is very, very inexpensive. Software is slightly better in this regard because it’s almost zero marginal cost.”
This means that a lot of what one might consider below the line cost becomes above the line. Unlike what Bill Gates said about the business of software, success here will strain profit margins, especially as Jevon’s paradox increases the demand for it and increasing competition hits the marginal inference margin.
The pricing has to drop from seat based to usage based, since that’s also how costs stack up. But, for instance, the reliability threshold is also a death knell if it hits user churn. Overshoot the capacity plan and you eat idle silicon depreciation. Model performance gains therefore have real-options value: better weights let you defer capex or capture more traffic without rewriting the stack.
Software eating the world was predicated on zero marginal cost. Cognition eating software brings back a metered bill. The firms that thrive will treat compute as COGS, UX as moat, and rapid iteration as life support. Everyone else will discover that “AI shaped holes” can also be money pits: expensive, probabilistic, and mercilessly competitive.
the more I read such writing, the more my respect grows for those few rare authors who deeply understand a complex subject and write on it with an earnest effort to make the language as simple and accessible as possible.
Some models exhibit sine-wave behaviour over time as dialogue continues with a bias towards recency. This may be considered as an X Axis of number of inquiries, and Y Axis of recency bias. At first, naturally responses are based on the original queries in a steep rise. Then the model draws deeply on sources and instructions in a levelling-off of recency bias. As dialogue continues however, outputs increasingly reflect most recent responses and forget earlier interactions, in a steep rise of recency bias, until continuation of a given thread becomes tedious.