When the AI runs the game,
the rules bend.
Student developers at the University of Calgary built two games with language models wired into the core mechanics, not bolted on as a side feature. Then they studied what broke. The gap between "the AI generates content" and "the AI generates the rules" turns out to be where most of the work lives.
Games hold standards
software alone doesn't.
Most software succeeds if it runs. Games have to run, and also be fun, fair, and coherent. Game development has decades of practice with that. Embedding a language model introduces a class of problems those practices weren't built to handle.
Video games are a strange kind of software. The code can be correct, the tests can pass, and the game can still fail. Fun, fairness, and coherence don't come out of the compiler. That's why game development has built careful practices around rules, progression systems, and difficulty calibration over the past few decades.
Then large language models showed up. Suddenly developers could have a model generate dialogue, quests, puzzles, and non-player characters on the fly. Most of the early research focused on whether language models could produce interesting content at all. That question has largely been answered: yes, they can.
What far less research had looked at was what happens when the generated content actually runs the game. Not decorates it. Runs it. If a player's ability to progress depends on a model correctly generating a trivia question, what happens when it hallucinates? If the difficulty of a boss fight depends on a model's guess at what "hard" means, what happens when the guess is wrong? These are architectural questions, not creative ones.
What happens when we stop using large language models as optional content generators and start treating them as architectural components? What breaks, what works, and what new kinds of engineering problems does that produce?
Two games, seven students,
six months.
The team used a method called collaborative autoethnography. Instead of observing other developers from the outside, they made themselves the subjects. Seven students built two games. Five became co-authors and wrote structured reflections about what they learned.
Both games were built to put language models under real pressure, not use them as accessories. Both used Google Gemini and OpenAI models. In each case, the model was treated as an architectural component, meaning the rest of the system had to be built around its behavior.
The language model is not a plugin. It sits inside the gameplay loop. Progression depends on its outputs. Resource economies are tied to its correctness. When it misfires, the game does not feel glitchy. It feels wrong.
Variability became
a structural problem.
Three patterns emerged across the developer reflections. None of them are the "language models are magic" takeaway the field keeps reaching for.
One developer described how question generation "directly drives skill point replenishment, boss encounters, and level progression, making it a core part of the gameplay." This was intentional. What the team didn't fully anticipate was that this coupling meant the model's probabilistic nature now lived inside the game's deterministic rule system.
Every playthrough became genuinely different, which is what they wanted. But variability also meant the team had to carefully constrain how much variation was acceptable before the game stopped feeling like the same game.
Developers repeatedly described building scaffolding around the model's outputs. Structured schemas. Validation pipelines. Strict output formats. One noted that "ensuring that LLM responses were formatted exactly as expected allowed for the back-end design to remain cohesive."
Difficulty calibration was especially messy. Questions meant to be "medium" or "hard" often felt like the easy set. The model couldn't reliably produce a challenge curve on its own. The team had to iterate on prompts and validation to keep progression coherent.
A developer recalled "a simple math question appeared, but none of the answer options were correct." Another observed that wrong responses "would tarnish the player's experience because they're not actually learning; they're just trying to figure out how the AI thinks." Participants also reported patterned outputs, like the correct answer appearing in the same multiple-choice slot more than once, which could let players game the system.
When the language model was the source of truth, its errors weren't cosmetic. They broke the implicit contract between the game and the player.
Two games, seven students, an undergraduate setting. No external player testing. Findings reflect developer experience, not measured player outcomes. The authors are upfront about this. It's a preliminary look, not a definitive study, and explicitly not seeking statistical generalizability.
What this means
for building with AI.
The usual "add AI to your game" pitch gives this paper too little credit. If you are embedding a language model as a structural component, you are not adding a feature. You are replacing part of your rule system with something probabilistic.
Where to go
from here.
If you want to go deeper.