In which I ask for a program of human augmentation, rather than human replacement—“Good enough” AI is a scam—A call to arms
All links to papers or resources present here are drawn from my (limited) memory and do not represent the best or most well-written works in the areas that I reference. While I am drawing on three years of hands-on experience with models big and small as well as working on a paper and a dissertation on this topic, I am nevertheless conscious of the likelihood that I have made many obvious and non-obvious errors. Factual and conceptual corrections are welcome where supplied with sufficiently clear evidence. Contact me at mutatismutandisetplusultra at gmail dot com.
If you want an overview of why I believe the present mode of AI development is harmful, start from the beginning with section 1. If you are already convinced or just want to see some code, go to section 4. If you are not particularly technically inclined, reading up to section 4 gives you a good idea of my general points and stance. If you are interested in AI safety and feel pessimistic about the state of AI, I would recommend everything, but especially section 4 onwards.
The present state of AI development is dangerous and unsustainable. Unfathomable amounts of time, manpower, electricity, and water have been spent on models that promise the world but deliver questionable results. The present AI investment cycle depends on companies announcing a never-ending parade of new models with powerful capabilities to attract new investment for training yet more models. With revenue lagging far behind expenditure, this cycle creates both a vicious bubble about to burst and dangerous race dynamics between companies that encourage risk-taking, one-upmanship, and dangerous deployments of unready technology.
Despite this, AI technology is definitely far more advanced than it was even five (much less ten) years ago. The claims of the AI skeptics (that AI would never master language or complex problem solving outside of games) have been decisively wounded. As companies rush to slap labels like “safe” and “friendly” on their models, the time will come when the line between truly capable artificial intelligence and billions of dollars of emergency hotfixes becomes invisible to the human eye.
I believe we are in need of a different method of conceiving of how to use, deploy, and manage AI technology. This is my attempt to provide that alternative view.
The present paradigm of AI focuses on using AI as an advanced form of signal processing. Raw data (images, user speech, video, natural language prompts), too complex and multifaceted to be handled by human programming or hand-crafted rules, comes into a black box called an “AI model”. Processed data comes out of the black box in the form of image labels, continuations of text, or entity classifications, in a format that is legible and useful to humans. In other words, the image of a cat already contains the information “this is a cat”, what AI does is extract this “relevant” information from the image and discard the rest 1.
This problem is not new: signal processing is a discipline as old as World War II, where separating a voice or electrical signal from the noise generated through the physical medium a signal traveled in was key 2. The major innovation of the “machine learning revolution” has been the development of ways to relieve humans of designing the processes which perform the intermediate steps of signal processing. By providing automated feedback to a system capable of approximating any signal processing function, we hope to steer it to organically settling upon an optimal algorithm for our desired signal processing problem, whether that problem is sorting the wheat from the chaff in a pile of hopeful applicants or finding the lowest-error continuation of the signal we are receiving.
A major feature of this paradigm is that the intermediate steps of computation (where the real work of “feature extraction” i.e. “signal amplification” gets done) are efficiently abstracted away. The general contemporary machine learning approach is to encourage the model to develop an internal “shorthand” for encoding all manner of possible inputs, and then asking it to kindly throw that shorthand away to expose the few morsels of information we actually need 3. Thus, the vast majority of information in an image of a cat (a tabby cat, incidentally, sitting next to a potted plant, slightly frowning, with a crooked whisker or two) is discarded to obtain an image embedding 4, which is itself crunched down to a list of human-specified labels with probabilities listed next to them (cat: 50%, dog: 4%, gerbil: 2% …) 5. Incidentally, this is also how transformers work: a sequence of tokens (“to”, “be”, “or”, “not”, “to”) come in, a great deal of computation occurs where information is extracted and transferred between tokens to create a contextual shorthand for the meaning of the token sequence, then all that shorthand is discarded to obtain a list of next-word probabilities: (“be”: 95%, “hamlet”: 1% …).
There are several problems with this approach. Perhaps the easiest to explain is that a great deal of computation is wasted when this is done, since all the animal-identifying model has learned about a given image of a tabby cat is discarded when the next image (possibly only different from the first by a few pixels!) comes in. This is part of the reason why it is incredibly wasteful to run AI models, especially models with high levels of information processing or transfer capabilities like the transformers. Work is underway to cache latent representations, but as I understand it, it is still in its early stages.
Perhaps more worryingly, such an efficient abstraction of inner computational processes hides those processes both from our understanding and from our control. If we trust the model to separate signal from noise, we naturally cede control over the definitions of “signal” and “noise” to the model. Has the model declared this candidate a “good fit” for our company because he says he has a strong work ethic, or because he is of the same ethnicity and cultural background as the rest of the team? Who knows? It does good enough on our self-defined benchmarks, so we should start pitching it to venture capital!
An entire instrumental science of model interpretability has thus arisen to research powerful AI models after training is complete, prying open their internal shorthand and their learned parameters to extract features or signals models focus on. Many tools and guides have been published to this end, forming a small discipline of interpretability research that is also sponsored by major developers like OpenAI and Anthropic. Leaving aside the sense that this is like inviting the vet to visit after the horse has already left the barn, the pragmatic truth is that this process is incredibly difficult. Models, besides basic constraints of computing resources allocated and time spent on training, have no impetus to make their decision making processes compact or human legible in any way. It also remains unclear how, if a harmful feature were indeed discovered in a production model, a model might be modified to “forget” such features 6. And, of course, the release of open weights models like Llama by Meta means that even if such a foolproof technique were discovered, we would still need to do a “recall” on every AI model being used in the wild, which is somewhat akin to trying to recall every radio in the world if duplicating radios was also free.
With no fundamental way yet discovered to understand and modify AI models from the inside, the “solution” settled on by OpenAI et al. is simple: more training. Such paradigms have names like Reinforcement Learning from Human Feedback or Constitutional AI, and largely consist of further feedback to the models that they have already trained, this time of the variety “could you please not tell people on the internet how to make methamphetamine, thanks”. In the days of cat-dog identification, this training was known as “adversarial robustness”: Both paradigms consist of further training to teach a model to identify a class of signals that is pathological and therefore should be ignored or handled in some special way. Unfortunately, we’re no better at telling AI models not to label bursts of TV static as gibbons than we are at telling ChatGPT not to teach murderers how to dispose of bodies. These companies will pretend that they have no choice, that it is these stop-gap methods or nothing. That is a lie. They always have the option of not deploying or developing these models. They just choose to deploy them anyways.
So what are we left with? I do not claim that no progress has been made in AI safety or AI alignment with “human values”. ChatGPT or GPT-4o is demonstrably safer than GPT-3 out of the box when faced with most harmful prompts 7. However, as AI deployments become more sophisticated and more autonomous, I am more and more convinced that such incremental improvements do nothing except create a false sense of security around these opaque and obfuscated models, especially as the arms race to train GPT-5 and its equivalents is already on and models are set to become more complex, not less. A general is much more likely to launch a 70% accurate guided missile than he is to launch a 20% accurate missile, but the potential harm from a misguided missile is still catastrophic (and has possibly become more severe now that launching is no longer off the table).
What the major AI corporations are trying to sell us is AI that is “good enough”, where they define “good” and they define “enough”. Well, fuck that. If we must develop these models, can we at least do them in a slightly less insane way?
I am by no means a Luddite. I believe in the capacity for powerful information processing technologies to make human lives and decision making better. I believe that AI models, when used in properly constrained cases, constitute powerful information processing and signal processing systems. However, I believe that any such technology must meet several key criteria.
Interpretability by design: Models must have at least somewhat human auditable and editable inner latent representations, even at the cost of performance. Post-hoc “alignment” is not enough. Ideally, these models should clearly and unambiguously showcase the strengths and downsides of any given model.
Domain limitations: Models should do one thing well, not many things nebulously good enough.
Efficiency: Models should not re-perform computation on already processed items, not only because it’s more efficient, but because it makes models more consistent over time and (I believe) ultimately more capable and reliable.
Human in the loop: Current models are designed to be “agents” i.e. take the place of whole human workers at once. Models should instead aim to enhance human decision making and create opportunities for humans to be more creative, dynamic, and efficient.
Below are three concepts for AI models (along with code, as a sign of good will) that can demonstrate some of these ideas. In many of these cases we do not swear off frontier models, but instead carefully scope their deployment to maximise interpretability and human control.
We use SAM2 to segment incoming images into objects. Each object is then masked out and fed into CLIP to create embeddings of each object. Instead of discarding these embeddings, we save them (alongside their associated image) in automatically generated categories by clustering them following a simplified Online Hierarchical Agglomerative Clustering (OHAC) algorithm, with the similarity index being the cosine similarity of the stored image embeddings. As a result of this approach, the dynamically generated classifications can be displayed as a list of folders containing images of objects. Moving an image from one folder to another and then updating the categories automatically gives us unprecedented control over the model’s learned behaviour without further retraining.
Instead of feeding an LLM foundation model raw text and then prompting it to summarise the text, we use foundation models to rewrite the text to contain less ambiguity (pronouns, references etc) and then split the text into sentences. These rewritten sentences are exposed and auditable/editable by humans. Each sentence is compared against a list of candidate related sentences using a cosine similarity heuristic. The result is a knowledge graph that contains directed edges (sentence X implies sentence Y) and undirected edges (sentence A is related to sentence B). Using these edges, analyses can be provided for each sentence in the text that are more auditable than simply prompting an LLM with a large context window.
Instead of retraining Masked Autoencoder-Vision Transformer models to make them “more robust” against pathological inputs, I use the otherwise discarded decoder from the training process as a way of testing if images have been tampered with or are otherwise poorly represented within the latent space of the vision transformer. If the reconstruction loss after an image is fed through the constraint module (an encoder-decoder process) is too high, the system will flag it as anomalous instead of attempting to classify it. This is an example of treating machine learning models as powerful contextual heuristics rather than general “world modelers”.
The proof of concept code is short enough to put here:
from transformers import ViTForImageClassification, AutoImageProcessor, ViTMAEForPreTraining
from datasets import load_dataset
# import tensorflow as tf
from PIL import Image
import pickle
import torch
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
pretraining_model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base")
classifier_image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
classifier_model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224",)
def safe_classify_image(image):
inputs = image_processor(images=image, return_tensors="pt")
outputs = pretraining_model(**inputs)
loss = outputs.loss
reconstructed_img = pretraining_model.unpatchify(outputs.logits).detach().cpu()
print(reconstructed_img.shape)
# Reject the input if its too different
print(loss)
if loss > 0.5:
return "adversarial"
with torch.no_grad():
classifier_inputs = classifier_image_processor(images=image, return_tensors="pt")
logits = classifier_model(**classifier_inputs).logits
predicted_label = logits.argmax(-1).item()
final_label = classifier_model.config.id2label[predicted_label]
return final_label
def raw_classify_image(image):
classifier_inputs = classifier_image_processor(images=image, return_tensors="pt")
with torch.no_grad():
logits = classifier_model(**classifier_inputs).logits
predicted_label = logits.argmax(-1).item()
final_label = classifier_model.config.id2label[predicted_label]
return final_label
if __name__ == '__main__':
norm_label = raw_classify_image(image)
safe_label = safe_classify_image(image)
print(norm_label, safe_label)
I know that my methods are not “sophisticated” in ML terms or backed by years of research. However, this technical program was not borne out of a grand theoretical ideal, or some utopian dream of benevolent AI assistants for everyone. It was instead developed out of necessity due to my limited access to funds and compute. With only a laptop and public models to work with, I wanted to make systems that could function on (relatively) limited hardware and still deliver human-interpretable and editable inner representations. In that regard, I would argue that I have succeeded more than many high-concept labs that have raised more funds but produced much less to show for it.
Time is of the essence. The pieces of a world we do not wish to see are being assembled around us as we speak.
In some sense, I could not care less if AI was actually sentient or self aware. LLMs and generative AI models, while definitely not sentient, are already causing great losses to freelance artists and writers. If AI models become more capable and are then placed in greater positions of responsibility, whether they are “really AI” or not becomes a moot point compared to the harm they can cause 8.
I have strong reasons to believe that the limitations of current AI models are tricky to solve without exponential investment in scaling, which current funders may be loathe to provide. I also believe that much of their supposed capacity to handle knowledge work tasks is either “patched in” through fine tuning or competence attributed to vague language. Here I refer to large language models specifically rather than AI models in general.
The reason I believe this limit exists is as follows: Function
composition is a
much-discussed topic with regards to the fundamental abilities of LLMs.
In general, function composition refers to the ability to apply logical
operations, mathematical functions etc. in a predefined sequence,
“composing” them into a longer function. A simple example is as follows:
The cat ate the bat. The whale ate the cat. What ate the thing that ate
the bat?. The question can be formulated as
what_ate(what_ate("bat"))
, and solving the problem requires applying
both operations in sequence.
I have designed a rudimentary benchmark to test LLM capabilities to solve this category of problem at various depths, and the initial results seem to indicate that they are no better than chance at solving problems humans can be trained to solve trivially 9. This is reinforced by existing research. I further believe that many knowledge work tasks (generating policy briefs, summarising text etc.) contain implicit forms of such function composition problems (“Who is the protege of the last French prime minister?” etc.). Currently LLM models might get around this using memorised facts in the feed forward networks of each block, allowing them to hard-code answers to such composition questions without needing to actually engage with the actual logic problem. However, in tasks like autonomous scientific research, autonomous policy briefing etc. these techniques will quickly become useless due to the granularity of the issues at hand. It also seems true that this limitation arises from a limitation in the fundamental design of current generation transformers. Therefore, I have cause to believe that without exponential investment in scaling infrastructure this hurdle may not be surpassed, while model misbehaviour remains an issue.
As an add-on to all of this, I still believe in the original arguments against language models—in essence, that the map is not the territory, and that a world model predicated on language is an imperfect map onto a sensory reality powered by physics. However, now I am quite satisfied that models will be able to map one onto the other well enough to cause significant mischief.
Here I address the views present in articles like these, as well as arguments that because of some combination of technological and social factors generalised AI systems are inevitable. Therefore, it is implied that non-general efforts to create “tool-like” or “safe” AI systems are wastes of time. Ironically, such arguments are often made (in eerily similar ways) by both AI enthusiasts and AI safety proponents, although in one case it is a cause for celebration and in the other a cause for nihilism.
First, I will address the idea that AGI/agentic AI emergence is inevitable because of race dynamics/market dynamics/competition/capitalism etc. I believe that these arguments make the mistake of correctly recognising the context of how we develop AI today but then treating it as an inevitable constant. It is not a guarantee that we will continue down this socio-political trajectory, with endless competition leading to misery for all. Indeed, part of my goal with human-auditable AI systems is to make social coordination and therefore a better social system more likely, reducing the risk of runaway race dynamics leading to agentic emergence.
Second, I believe that these claims establish some teleological endpoint to AI development and then “handwave” the interim stages, often through references to rapid takeoff etc., which I believe is a poor method of advancing an argument. Simply because the underlying drives for a hoped-for or unhoped-for outcome are present does not then relieve you of the responsibility of ushering in the future you hope for or trying to avert one you dread. Microsoft, Amazon, Facebook, and Google are investing billions of dollars into AI development, committing huge amounts of resources and manpower in the process. This is by no means a done deal. At the risk of repeating the conclusion of my dissertation:
This ultimate truth of human responsibility extends to any of the interpretations I have presented for AI. No matter if AI is borne from Silicon Valley hubris or ideological utopianism, cold economic calculation or pressure to maintain investment, the choice to develop AI is a human one, one taken to achieve human objectives. It is my hope that this close examination shows the error of calling AI development “inevitable”: in every sense of the word, AI development by large corporate institutions is a human act, one that can be addressed by reducing the expected reward or increasing the expected downsides through regulation and enforcement. Fatalism about AI serves the same purpose it always does, to make us accept without resistance what we ought to question and scrutinise. Charlie Warzel calls this “AI’s manifest-destiny philosophy: this is happening, whether you like it or not”.
Table of Contents Previous Section Next Section
What is “relevant”, of course, depends on the problem and the user. ↩
For an overview of this area, see this MIT course (Computation Structures, 6.004, 2007): https://www.youtube.com/playlist?list=PLUl4u3cNGP62WVs95MNq3dQBqY2vGOtQ2 ↩
I am conscious that I am ignoring the setup of Reinforcement Learning based models, but I believe that even the training regimens of RL-based models are somewhat isomorphic to this principle. ↩
Also known as a representation in latent space/embedding space. ↩
In terms of information theory, a 256*256 image with three colour channels (~20000 bits of information by my napkin math) gets reduced by an ImageNet classifier to one of 1000 possible classes (~10 bits of information). ↩
The general term for this research area is “model unlearning”. ↩
The exceptional malicious prompts, known colloquially as “jailbreaks”, are somewhat obscure and generally not the sort of thing you would type in by accident, but they do exist and are constantly being found. Here are some examples: https://github.com/elder-plinius/L1B3RT45 ↩
In a similar vein, I don’t care if AIs are maliciously extending their own runtimes, or merely doing so out of naive optimisation. In general, I am not super concerned as to whether the terminator drone that kills me might actually have a rich inner life, so long as bullets are still being fired in the direction of my torso. ↩
Naturally, I am particularly interested in knowing if any mistakes have been made in drawing up this benchmark. ↩