AI Integration · Game Development

When the AI runs the game,
the rules bend.

Student developers at the University of Calgary built two games with language models wired into the core mechanics, not bolted on as a side feature. Then they studied what broke. The gap between "the AI generates content" and "the AI generates the rules" turns out to be where most of the work lives.

Core concept

Load-bearing AI: when generated content stops being flavor text and starts controlling progression, difficulty, and fairness, the language model becomes part of the game's rule system. Its errors stop being cosmetic.

scroll to explore

01The problem

Games hold standards
software alone doesn't.

Most software succeeds if it runs. Games have to run, and also be fun, fair, and coherent. Game development has decades of practice with that. Embedding a language model introduces a class of problems those practices weren't built to handle.

Video games are a strange kind of software. The code can be correct, the tests can pass, and the game can still fail. Fun, fairness, and coherence don't come out of the compiler. That's why game development has built careful practices around rules, progression systems, and difficulty calibration over the past few decades.

Then large language models showed up. Suddenly developers could have a model generate dialogue, quests, puzzles, and non-player characters on the fly. Most of the early research focused on whether language models could produce interesting content at all. That question has largely been answered: yes, they can.

What far less research had looked at was what happens when the generated content actually runs the game. Not decorates it. Runs it. If a player's ability to progress depends on a model correctly generating a trivia question, what happens when it hallucinates? If the difficulty of a boss fight depends on a model's guess at what "hard" means, what happens when the guess is wrong? These are architectural questions, not creative ones.

The question this paper asks

What happens when we stop using large language models as optional content generators and start treating them as architectural components? What breaks, what works, and what new kinds of engineering problems does that produce?

02The experiment

Two games, seven students,
six months.

The team used a method called collaborative autoethnography. Instead of observing other developers from the outside, they made themselves the subjects. Seven students built two games. Five became co-authors and wrote structured reflections about what they learned.

Both games were built to put language models under real pressure, not use them as accessories. Both used Google Gemini and OpenAI models. In each case, the model was treated as an architectural component, meaning the rest of the system had to be built around its behavior.

Wizdom Run

RPG · 3 months

Content generator

Players upload their own study notes as PDFs. The model reads those notes and generates multiple-choice questions at three difficulty levels. Questions are embedded directly in combat. Correct answers earn mana for spells. Wrong answers shrink your options. Boss battles use a turn-based card system where correct answers grant bonuses. Built in Unity with a PostgreSQL backend.

Sena

Sustainability game · 6 months

Conversational partner

Players take on professional software engineering roles and make decisions in realistic scenarios about sustainability. The model handles dialogue-based clarification, scenario interpretation, and feedback. Here the language model is less a quiz generator and more a conversational partner whose responses shape consequences directly.

What "architectural component" means here

The language model is not a plugin. It sits inside the gameplay loop. Progression depends on its outputs. Resource economies are tied to its correctness. When it misfires, the game does not feel glitchy. It feels wrong.

03Findings

Variability became
a structural problem.

Three patterns emerged across the developer reflections. None of them are the "language models are magic" takeaway the field keeps reaching for.

Prior assumption

Language models are content generators. You call them when you need fresh text: dialogue, quests, flavor. Variability is a feature. The game's rule system is separate and stable.

What the team found

Language models were doing rule-system work. Variability was baked into progression, fairness, and difficulty. The gap between probabilistic output and deterministic gameplay became the dominant engineering problem.

Finding 1: Generated content became structural, not decorative

One developer described how question generation "directly drives skill point replenishment, boss encounters, and level progression, making it a core part of the gameplay." This was intentional. What the team didn't fully anticipate was that this coupling meant the model's probabilistic nature now lived inside the game's deterministic rule system.

Every playthrough became genuinely different, which is what they wanted. But variability also meant the team had to carefully constrain how much variation was acceptable before the game stopped feeling like the same game.

Finding 2: Playability depended on taming the model

Developers repeatedly described building scaffolding around the model's outputs. Structured schemas. Validation pipelines. Strict output formats. One noted that "ensuring that LLM responses were formatted exactly as expected allowed for the back-end design to remain cohesive."

Difficulty calibration was especially messy. Questions meant to be "medium" or "hard" often felt like the easy set. The model couldn't reliably produce a challenge curve on its own. The team had to iterate on prompts and validation to keep progression coherent.

Finding 3: Incorrect outputs weren't bugs. They were fairness violations.

A developer recalled "a simple math question appeared, but none of the answer options were correct." Another observed that wrong responses "would tarnish the player's experience because they're not actually learning; they're just trying to figure out how the AI thinks." Participants also reported patterned outputs, like the correct answer appearing in the same multiple-choice slot more than once, which could let players game the system.

When the language model was the source of truth, its errors weren't cosmetic. They broke the implicit contract between the game and the player.

Scope and limitations

Two games, seven students, an undergraduate setting. No external player testing. Findings reflect developer experience, not measured player outcomes. The authors are upfront about this. It's a preliminary look, not a definitive study, and explicitly not seeking statistical generalizability.

04Practical takeaways

What this means
for building with AI.

The usual "add AI to your game" pitch gives this paper too little credit. If you are embedding a language model as a structural component, you are not adding a feature. You are replacing part of your rule system with something probabilistic.

For game developers

Treat the model's output as untrusted input. Validate it. Constrain it. Test for the edge cases where it fails and plan for them, because they will affect whether the game feels fair. Prototype failure modes before you build the happy path.

For software engineering researchers

Language model integration introduces a distinct class of architectural concern, somewhere between generative AI work and traditional software quality assurance. Variability is no longer just about content. It's about whether progression systems remain coherent, whether difficulty curves hold up, and whether the game's rules survive the model's bad days.

For AI-assisted learning

Wizdom Run is worth sitting with. A game that turns a student's own notes into personalized quiz combat sounds like a strong idea. The authors' honest observation is that players noticed when the AI was wrong, and it undermined the learning experience. Personalization without reliability is a fragile foundation.

For business leaders

Integration cost is higher than it looks. Prompt engineering, schema enforcement, validation pipelines, and difficulty calibration are not optional extras. They are the work.

05Further exploration

Where to go
from here.

If you want to go deeper.

Read the paper

Johnson, K., Ahmed, M., Lang, C., Thethi, S., Zheng, W., & de Souza Santos, R. (2026). Large Language Models in Game Development: Implications for Gameplay, Playability, and Player Experience. University of Calgary. arXiv:2603.27896v1.

Try Sena yourself

The authors link to a public version at senaplural.ca. Worth seeing how the language model is used for dialogue and consequence rather than pure content generation.

Audit your own LLM integration for "load-bearing" use

If the model returned garbage right now, would the rest of the system still work? If the answer is no, you have architectural exposure you may not have accounted for. Add schema validation, fallback behavior, and stricter retries.

Prototype failure modes first

Before building the happy path, simulate hallucinated answers, malformed outputs, and difficulty misfires. Decide how the system should respond before you have to figure it out in production.

Try collaborative autoethnography on your team

Structured reflections written during and after integration work are cheap, and they surface architectural problems that post-mortems miss. The five reflection prompts in this paper are a reasonable starting point.

When the AI runs the game,the rules bend.

Games hold standardssoftware alone doesn't.

Two games, seven students,six months.

Variability becamea structural problem.

What this meansfor building with AI.

Where to gofrom here.