Distilled knowledge is the key to progress

"All our knowledge falls with the bounds of experience", Kant

Sep 28, 2022

Here’s something you might not know. The Empire State Building was completed using 7 million man-hours, built over a year and 45 days. That’s roughly 1.4 million man-days. 20,000 man-hours were spent on creating the replicated lobby ceiling (which is like a couple days work looked at this way, but still)! 5 workers died in that year, but they built the first 100+ storey skyscraper the world had ever seen, towering over the New York skyline.

Compare the Shard, tallest building in the UK. It took 8 years for approvals, 3 years to complete, and at the highest point had 1450 people working on it. Assuming this was double of the average daily number of people working on it, that’s c.750k man-days.

In almost a century of advancement, we managed to cut the work needed to build a state of the art skyscraper in a modern Western city by half. The Empire State Building cost around $500m of today’s money to construct, and the Shard cost around £435m.

However, with an average wage inflation of 4.7% a year, $1 spent in 1931 is equivalent to $41 in 2012. Which means, just from this simple calculation, before we estimate anything else for materials or complexity, we should imagine skyscrapers are 20x more expensive to build today.

Some of this has nothing to do with construction, but various bureaucratic delays (and a minor global financial crisis). For instance, planning permission. In fact its eponymous name came from this complaint from the English Heritage, arguing the new building would, in his words:

tear through historic London like a shard of glass

One lesson from this could be that we did things way better back in the day. There are plenty of examples on Patrick Collison’s website. We have now, through a combination of an overactive vetocracy and plenty of overengineering at every level, managed to substantially increase the time required.

But then, this is not the whole story.

The Empire State Building had several retrofits. When it was built it didn’t have air conditioning, it was almost 20 years before it got it. Instead it had big windows for ventilation. It didn’t have electricity throughout the building. The elevators were manually operated. The materials it was built with, like heavy masonry and plaster, were fireproof so the building didn’t have sprinklers or pressurised stairways. Not to mention it wasn’t particularly energy efficient or good at dealing with particulates in the air. At least the last of this was a $165 million redevelopment in 2019 to make the building green, which worked by reducing the energy usage by about 15 percent.

And for each of these things, the retrofits and improvements which have made the building what it is today, there were many more man-hours added to the original billing. Like for like, could it have doubled? Marx might not have had entirely the wrong idea with the labour theory of value after all.

This too isn’t the whole story.

The decrease in the footprint due to the more complicated materials used means a large proportion of the man-hours is also hidden.

If we look back at the major construction projects from long ago, this becomes clearer. Cathedral constructions in the middle ages often took several decades, if not stalled for centuries due to lack of funds. From quarrying marble and limestone by hand, to transport around through muscle power, to getting timber, to carving and hauling up blocks, and so on and on.

The Amiens cathedral took a couple of decades to construct, lightning fast by the standards of the medieval periods. And it was the largest Gothic cathedral built in France then.

Many of the skilled workers relied on other trades to keep them at work. A master blacksmith made all the metal tools required while skilled carpenters made the wooden handles for these tools. The number of tools required for such a task as building a cathedral was remarkably small:
Pickaxe and axe Brace and Bit Hammer Sledge hammer Chisel Auger Saw Mathematical dividers Plane Squares and templates

On average, the construction site of a large church would have between 50 and 100 skilled craftsmen working on it. And in the absence of premade materials there was no choice but to get the very stone that’s needed cut in a particular way.

While foundations were being laid, skilled craftsmen worked in quarries and produced blocks of stone that would be used in the building process. It would not be unusual for as many as fifty advanced skilled apprentices to work in a quarry along with 250 labourers.

Assuming more labourers as well, over two decades of working 6 days a week, that’s 1.44 million man-days. Same as the Empire State Building1.

Time it seems is a flat circle in massive construction projects.

When we think of the levels of innovation we have today, one way to think about productivity is as distilled man-hours being a part of what you purchase. If you create a building today, you’re not just using labour to build it, but the distilled labour of all the work that’s gone into creating the materials and knowhow for the present-day -labour to use.

This is distinct to Adam Smith’s use, which implied that the value of a good was related to the “toil and trouble of acquiring it”, ie labour saving for himself or the quantity of labour he can save or purchase. And it’s different too to Karl Marx’s point of view, that “socially necessary labour” is the core of what a good is worth. It’s also different to David Ricardo’s idea of it as relating to “real cost”, or the value of a good being proportional to the quantity of labour used in its production.

It’s different because thinking this way includes not just the labour of creating a good or service, but also the implicit distilled labour that has gone into creating the components and the sub-components within it, and the knowledge to put them together. It’s a way to see the relationship of the skyscraper we’re building to the entirety of human civilisation, from the discovery of stone cutting stone to the mathematical knowhow of how to create taller structures to the artisanal breakthroughs of knowledge in how to build a façade!

It’s a way of analysing if we took a bunch of tabula rasa humans and popped them on a part of earth, how long would it take for them to recreate it.

Our ability to craft new developments here are reliant therefore on our ability to create distilled knowledge, and disseminating it to others. For instance.

If we were able to learn quicker, in essence create better tools, then we could increase the speed of learning. For instance, books allow us to export memory faster as do computers for analysis and calculation.
If we were able to disseminate information quicker, we could enable the use of more distilled knowledge. The internet surely qualifies.
And lastly if we were to use the distilled knowledge better that would result in the reduction of man-hours that need be spent on any particular project.

And so the first bullet is basically the technological ecosystem. Most of the startups and companies that are around focus on distilled knowledge. Let us do [Insert hard thing here] for you, so that you don’t have to spend more manpower and resources trying to do it. The successful ones are the ones that distil knowledge and saves you from spending untold man-hours trying to recreate a problem. This is why the decision is often “build vs buy”.

The second bullet is what we achieved through the internet, or indeed any mass communication medium. It would’ve been megaphones presumably once upon a time. Or the printing press and invention of pulp paper.

And the third is where we had methodological shifts making things better. It’s also what we’re hoping to achieve with the incredible advances we’re seeing in AI. That’s our hope!

III

AI models today do things akin to magic. Large Language models that produce miracles like being able to write a speech like Abraham Lincoln but about ripe tomatoes, or produce a Shakespearean sonnet about a traffic jam, or the incredible image generation capabilities of DALL-E2 which can create a painting of New York by Salvador Dali and Gustav Klimt.

Currently the amount of distilled knowledge that exists inside GPT-3 is based on its 175 Billion parameters, 500 billion tokens (mostly from Common Crawl, though I learnt Wikipedia is only 3 billion of these) with 45 TB of text. This is the ahistorical account of everything that humanity has codified as worth knowing.

So how good is this distilled knowledge really? If our growth depends on better distillation, then this could be either the silver bullet or a castle in the sky.

One way to look at this is to compare with a source of knowledge with which we have intimate knowledge. Our brain. Human learning, such as it is, is bounded by the constraints of reality. It’s been compared as having anywhere from 10^15 to 10^20 FLOPs of processing power2.

Among a small number of computers we compared, FLOPS and TEPS seem to vary proportionally, at a rate of around 1.7 GTEPS/TFLOP. We also estimate that the human brain performs around 0.18 – 6.4 * 1014 TEPS. Thus if the FLOPS:TEPS ratio in brains is similar to that in computers, a brain would perform around 0.9 – 33.7 * 1016 FLOPS.5 We have not investigated how similar this ratio is likely to be.

Leon Derczynski 🌲🏔️@LeonDerczynski

If GPT3 took 50 petaflop-days to train arxiv.org/pdf/2005.14165…, w. GPUs at 10^8 flops per watt arxiv.org/pdf/1911.11313…, so those 1.2E18 flop-hours used 12 GWh to train? E.g. 12 hours of a whole nuclear reactor? At 0.73kg per kWh that's.. 8.8 megatons of CO2?! #sanitycheck #nlproc

11:09 AM · May 31, 2020

32 Reposts · 116 Likes

Partly, this is a category error. Humans don't learn through transformers and back propagation, or create weight matrices. But output is output, and it turns out using these tools also gets you something that can imagine what would’ve happened if Salvador Dali was mixed with Gustav Klimt in some huge multi-dimensional canvas. And replicating this by an individual is difficult, unless someone happened to have independently studied both artists and had a hankering for painting New York.

If trying to compare petaflops is not the best way to understand humanity, we can actually look at those who produced masterpieces. Da Vinci apprenticed for 7 years. Michelangelo for only 1. And somewhere in there they clocked their 10,000 hours in becoming a master. But then Da Vinci wasn't a blank slate either. He had talent and training and good teachers. He also had the benefit of a hundred billion humans over a hundred million years of undirected evolution enabling him to paint in the first place.

Human evolution is around 10 million generations of at least a 100 billion neurons (with 100 trillion synapses). That’s 10^16 total neuron-generations. Or at synapse level probably 10^19 synapse-generations. GPT-3 (codename davinci by the way) has around 175 Billion parameters, 96 attention layers and 3.2m batch size. Are the batch sizes and attention layers analogous to neurons and generations? Honestly, it’s unclear. Neurons in the brain do way more than the neurons inside neural nets! But its a good point of comparison, and probably shows we’re closer to human-level performance in a particular task, even though its woefully lacking in both common sense and ability to generalise.

But what this seems to show is that each new model, trained to perfection, does probably a far better job than the average training and selection a generation of evolution has managed to give us. However in the absence of actual selection within the real world, there are no pressures that’s pushing the man-hours input so far into GPT-3 into making it more robust in the world.

The current model might well be a few orders of magnitude away from the capabilities of the human brain. But that’s not enough. The models haven’t been burnished by the incessant pressures of evolution within a physical domain where incorrect assumptions about reality equals death.

Vgr has a really interesting article about AI being AT (Artificial Time) because what it does is compress the information in the world so that we can all become much older than we are.

If I connect to a writing-assistant AI in the right way, even with significant inefficiency, I'd be effectively writing like a 1046-year old rather than a 46-year old. If I could learn to go spelunking in the latent spaces of these models, I’d be able to write in ways no human has ever written before.

I think this sounds true, but is insufficient. And the reason I think that is because the distilled knowledge is not a function of time per se, as much as a function of hidden causal relationships in latent space.

We’re creating superhistory but without an ability to be temporal. It can learn from everything we have codified, but it can’t learn from things we have never codified. Moravec’s paradox is one, but it’s more than that. It’s the fact that the connections it crafts are observations, hypotheses, untested against reality.

It’s the fact that latent spaces are webs of correlative relationships vs an actual causative web. It's post hoc ergo proper hoc industrialised at scale. Whether sufficient numbers of correlations can provide us with a view of what could be causation is what the world is grappling with. That's where Gary Marcus has planted his flag saying no, and that's where Yann LeCun has too come around to, saying the current methods have the flaw that by themselves they won't lead to a flash of consciousness.

Which means that in order for AI as it is today to add substantially to our ability to think better, or to both take in, understand and use the synthesised version of knowledge that's gone into it, it would need to be able to create good models of the world as it is and navigate it.

Unfortunately what today’s AIs are great at is converting the terabytes of information it ingests into creating a Platonic landscape of correlations. All relationships it can glean and all patterns it can see are laid out in beautiful statistical glory.

Recently there was a paper on learning to reason from data, by Zhang, Li et al, which argued that instead of learning to emulate correct reasoning functions, LLMs today have learned statistical features. That is, it has created that beautiful Platonic landscape from the information it had, but with limited insight or reasoning in trying to figure out how to navigate that landscape or what happens when presented with something which goes against the existing correlations within that landscape.

It is a perfectly valid belief that a sufficiently large and robust landscape creation will, within its boundaries, contain the seeds of everything we have learnt - from mathematics to logic to reason to physics to common sense to the law. But this is perhaps too much to ask of 175 billion parameters. Without a “system 2” process to guide our ability to make sense of what we see, informed by an inner congruence to reality, we’re doomed to creating perfect snapshots that don’t evolve.

This is the equivalent of a hyper-smart organism coming into contact with an alien world. Whatever it has seen or experienced is understood in hyper-real technicolour detail. However it’s still not evolved in that landscape. Correlations break all the time. Type I and II errors exist everywhere. Perturbations provided through learning rate manipulations and batch size changes do not provide sufficient depth to its understanding of what the world actually is.

And therefore when we ask it to draw a horse riding an astronaut, and it provides a silly drawing in return because it’s not tempered by the reality of the physical world. Of knowledge of gravity or relative weights or what horses actually are or what astronauts actually do.

There are no incongruities inside the mind of an AI. The prompts we provide or questions we ask run as vectors inside the giant multidimensional matrix, transformed into abstractions with complex calculations, bringing forth an answer. Bear in mind, the answer is not “wrong”. Because GPT-3 doesn’t know what is “wrong”. It only knows what it has calculated, and its correspondence with its inner reality.

For it, each outcome is as stable as the last. Because each outcome is of roughly equal weight within the Platonic realm of ideas that is the weight matrix inside it3.

This too then is the failure mode of distilled knowledge. When the knowledge it has learnt leans too closely to the side of solipsism, it leads to hallucinations. And what we're seeing are the hallucinations of artificial intelligence. It's only once we treat it that we can step out of this4.

This is an order of magnitude less than the pyramids, which seem around 15 million man hours, though that feels like an overestimate. Herodotus claimed it took 100,000 workers over 20 years working three months at a time, though this definitely reads like poetic exaggeration.

These algos are different to the way we learn, and maybe more complicated, but it does give us a sense of the scale of learning that has been embedded into the algorithm.

In what would seem a complete surprise to the AI researchers of half a century ago, we have successfully imported instinct into an artificial being, yet not intelligence.

Strange Loop Canon

Discussion about this post

Ready for more?