AI in Education · Cognitive Load · Multimodal AI

Conversation reduces load.
Images build it.

Georgia Tech researchers ran a randomized controlled trial with 124 online learners studying biology from a textbook. The system that combined conversation and images outperformed text-only chat and semantic search. The mechanism, explained through cognitive load theory, is that the two features work in opposite directions and together move learners into a more effective cognitive state.

Core finding

Conversational AI reduces extraneous load by letting learners ask exactly what they need. Multimodal responses increase germane load by requiring visual-verbal integration. Together, these produce better learning outcomes than either feature alone.

scroll to explore

01The problem

Engagement and learning
are not the same thing.

AI tools in education tend to be evaluated on how much learners enjoy using them, not on whether they actually learn more. Those two things correlate, but they're not identical, and design decisions optimized for one can hurt the other.

Much of what gets labeled "AI in education" is, on closer inspection, a better search interface. Ask a question, get a formatted answer. The conversation piece is a convenience feature, not a pedagogical one. And when images are included alongside text, they're often an afterthought rather than something designed to work with the text in a specific way.

There's a real gap in the research here. Conversational AI and multimodal AI have been studied separately, and the field has reasonable theories about how each one affects learning. But the question of how they interact has received less attention. If you add images to a conversational system, does it help, or is the conversation doing all the work? Does conversation matter differently in a domain where visual, spatial understanding is central, like cell biology?

The educational technology field also has a measurement problem. Engagement metrics and self-reported satisfaction scores are easy to collect and easy to present. Learning outcomes require pre-tests, post-tests, and controlled comparisons. The result is that products built on engagement data look good until someone runs the harder experiment.

The question this paper asks

What happens when you experimentally separate the contribution of conversationality and multimodality in an AI-supported learning system? Do they each matter independently? Do they interact? And can we explain the mechanism using an established framework for how instruction affects learning?

02The experiment

Three systems, 124 learners,
one biology chapter.

The study was a randomized controlled trial with online participants. Three systems were compared head-to-head on learning outcomes and user experience. The framework used to interpret the results was Cognitive Load Theory.

124 participants recruited via Prolific studied a biology textbook chapter on cell structure. Everyone took a pre-test before starting and a post-test after finishing. Each participant was assigned to one of three systems:

MuDoC — Multimodal + Conversational

A document-grounded conversational AI. Ask a question in natural language, get a response with both text and relevant images pulled directly from the textbook. Fully multi-turn: learners could follow up, ask for clarification, or change direction. The system was designed to ground responses in the source material rather than generating content from scratch.

TexDoC — Text-Only Conversational

The same conversational interface and underlying model as MuDoC, but with images removed. Text-only responses. This was the controlled comparison: strip out the multimodal piece and see what remains.

DocSearch — Semantic Search, No Conversation

The textbook itself, equipped with an LLM-powered semantic search bar. Enter a query, get relevant passages highlighted. No back-and-forth dialogue. This was the baseline: AI-assisted retrieval without conversational interaction.

Cognitive Load Theory as the interpretive lens

Cognitive Load Theory divides mental effort into three types: intrinsic load is the inherent complexity of the material itself; extraneous load is the unnecessary effort caused by poor presentation or interface friction; and germane load is the productive mental effort spent building and connecting knowledge schemas. Good instructional design reduces extraneous load and increases germane load without exceeding the learner's total capacity.

The researchers used this framework to predict what each feature should do: conversation should reduce extraneous load by letting learners ask exactly what they need rather than hunting through text. Multimodality should increase germane load by requiring the learner to integrate two complementary representations of the same concept.

03Findings

The two mechanisms work,
and they work differently.

MuDoC produced the highest post-test scores and the best reported learning experience. The ordering was clean: MuDoC, then TexDoC, then DocSearch. The CLT interpretation held up across multiple lines of evidence.

Study size

124

randomized online participants

Conditions compared

MuDoC · TexDoC · DocSearch

Best performer

Both

outcomes and experience: MuDoC

Finding 1: Multimodal conversation produced the best learning outcomes

MuDoC participants scored highest on the post-test. The ordering across all three conditions matched the theoretical prediction: MuDoC outperformed TexDoC, which outperformed DocSearch. The result suggests that both conversationality and multimodality contribute to learning gain, and their effects are additive rather than redundant.

Finding 2: The experience effect tracked the outcome effect

MuDoC participants also reported the most positive learning experience across user surveys. This consistency matters. It's common for educational technology to improve either outcomes or experience but not both simultaneously, because features that slow learners down can improve learning while feeling unpleasant. The fact that the multimodal condition led on both dimensions suggests the design was not trading one for the other.

Prior assumption

Conversation is the key feature. Images are decoration. If the conversational AI is grounded in source material and can answer follow-up questions, adding pictures is incremental at best. The interface quality does the work.

What the study found

Both features contribute through distinct mechanisms. Conversation reduces the friction of information-seeking. Images force integration of visual and verbal representations. These are separate effects operating on different dimensions of cognitive load.

Finding 3: The CLT framework predicted the ordering

The interpretation the researchers offer is that conversationality reduces extraneous load by eliminating the search-and-skim cycle: instead of hunting for a relevant passage and parsing whether it answers your question, you ask directly and get a grounded response. Multimodality increases germane load by requiring visual-verbal integration: when a response includes both a text explanation and a diagram of the same concept, the learner has to do the work of connecting them, and that connection-building is precisely what forms durable schemas.

These are opposite directions on the cognitive load spectrum. One lowers effort, one raises it, and together they produce a better learning state than either alone.

Scope and limitations

This was a single domain (cell biology), a single piece of content (one textbook chapter), and an online participant sample. Whether the same pattern holds across disciplines, age groups, or more complex learning tasks is unknown. The researchers are explicit about these constraints. The study establishes a directional finding, not a universal rule.

04Practical takeaways

What this means
for educational AI.

The finding is specific enough to be actionable. Combining conversation and multimodality is not just a richer feature set. It's two distinct mechanisms working on distinct aspects of how learning happens. Designing one without the other leaves something on the table.

For teams building educational AI products

The combination of conversational grounding and multimodal responses appears to produce a qualitatively different learning experience than either feature alone. If you've shipped a conversational interface, adding images that are explicitly grounded in the source content is worth testing. If you've shipped a multimodal search tool, adding conversational back-and-forth is a distinct improvement path. These aren't interchangeable.

For instructional designers and curriculum builders

Cognitive Load Theory gives you a design framework that predicts which features help in which situations. Domains with heavy visual-spatial content (biology, anatomy, chemistry, engineering, architecture) are the strongest candidates for multimodal AI. The theory predicts images should help more when verbal and visual representations are complementary and not redundant. Design the image selection accordingly.

For anyone evaluating edtech products or running procurement

Post-test learning gains should be your primary metric. Engagement and satisfaction are useful signals, but they don't substitute for actual learning measurement. A tool that produces high engagement scores while delivering no learning improvement is not a successful educational tool. The companies that make it easy to see learning outcome data deserve more weight in your evaluation than those that only show you usage dashboards.

For AI researchers studying human-AI interaction

The cognitive load framework offered here is a useful bridge between AI system design and learning science. The field of HCI often treats educational applications as just another domain. This paper suggests treating CLT-based load analysis as a design and evaluation primitive, not an optional annotation.

A note on scope

The finding is directional, not definitive. One domain, one content source, one online sample. The result is internally consistent and the theoretical explanation is well-grounded, but it needs replication across more subjects, longer study sessions, and more diverse learner populations before being treated as settled.

05Further exploration

Where to go
from here.

If you want to go deeper.

Read the paper

Taneja, K., Singh, A., & Goel, A. K. (2026). Impact of Multimodal and Conversational AI on Learning Outcomes and Experience. Georgia Institute of Technology. arXiv:2604.02221. Presented at the 27th International Conference on AI in Education (AIED 2026).

Read the MuDoC system paper

Taneja, K., Singh, A., & Goel, A. K. (2025). MuDoC: An Interactive Multimodal Document-grounded Conversational AI System. AAAI Symposium Series. This is the system paper describing MuDoC's architecture and interface design before the learning outcomes study.

Learn Cognitive Load Theory

Sweller, J. (1988). Cognitive load during problem solving: effects on learning. Cognitive Science, 12(2), 257-285. This is the original formulation. For the multimedia extension relevant here, see Mayer, R. E. (2009). Multimedia Learning (2nd ed.). Cambridge University Press.

Add a post-test to your next evaluation

If you're testing or comparing educational AI tools at your organization, run a short pre-test and post-test alongside your usual usability and engagement measures. You don't need a full RCT to get directional signal on whether learners are actually retaining more. Even a five-question quiz before and after tells you more than a satisfaction survey alone.

Prototype a text-only vs. multimodal comparison yourself

If you're building an AI tutoring or onboarding flow, test both versions with a small group. Keep the conversational interface identical and vary only whether responses include images from the source material. The experimental design from this paper is simple enough to replicate at a small scale with Prolific or a willing cohort.

Conversation reduces load.Images build it.

Engagement and learningare not the same thing.

Three systems, 124 learners,one biology chapter.

The two mechanisms work,and they work differently.

What this meansfor educational AI.

Where to gofrom here.