Conversation reduces load.
Images build it.
Georgia Tech researchers ran a randomized controlled trial with 124 online learners studying biology from a textbook. The system that combined conversation and images outperformed text-only chat and semantic search. The mechanism, explained through cognitive load theory, is that the two features work in opposite directions and together move learners into a more effective cognitive state.
Engagement and learning
are not the same thing.
AI tools in education tend to be evaluated on how much learners enjoy using them, not on whether they actually learn more. Those two things correlate, but they're not identical, and design decisions optimized for one can hurt the other.
Much of what gets labeled "AI in education" is, on closer inspection, a better search interface. Ask a question, get a formatted answer. The conversation piece is a convenience feature, not a pedagogical one. And when images are included alongside text, they're often an afterthought rather than something designed to work with the text in a specific way.
There's a real gap in the research here. Conversational AI and multimodal AI have been studied separately, and the field has reasonable theories about how each one affects learning. But the question of how they interact has received less attention. If you add images to a conversational system, does it help, or is the conversation doing all the work? Does conversation matter differently in a domain where visual, spatial understanding is central, like cell biology?
The educational technology field also has a measurement problem. Engagement metrics and self-reported satisfaction scores are easy to collect and easy to present. Learning outcomes require pre-tests, post-tests, and controlled comparisons. The result is that products built on engagement data look good until someone runs the harder experiment.
What happens when you experimentally separate the contribution of conversationality and multimodality in an AI-supported learning system? Do they each matter independently? Do they interact? And can we explain the mechanism using an established framework for how instruction affects learning?
Three systems, 124 learners,
one biology chapter.
The study was a randomized controlled trial with online participants. Three systems were compared head-to-head on learning outcomes and user experience. The framework used to interpret the results was Cognitive Load Theory.
124 participants recruited via Prolific studied a biology textbook chapter on cell structure. Everyone took a pre-test before starting and a post-test after finishing. Each participant was assigned to one of three systems:
Cognitive Load Theory divides mental effort into three types: intrinsic load is the inherent complexity of the material itself; extraneous load is the unnecessary effort caused by poor presentation or interface friction; and germane load is the productive mental effort spent building and connecting knowledge schemas. Good instructional design reduces extraneous load and increases germane load without exceeding the learner's total capacity.
The researchers used this framework to predict what each feature should do: conversation should reduce extraneous load by letting learners ask exactly what they need rather than hunting through text. Multimodality should increase germane load by requiring the learner to integrate two complementary representations of the same concept.
The two mechanisms work,
and they work differently.
MuDoC produced the highest post-test scores and the best reported learning experience. The ordering was clean: MuDoC, then TexDoC, then DocSearch. The CLT interpretation held up across multiple lines of evidence.
MuDoC participants scored highest on the post-test. The ordering across all three conditions matched the theoretical prediction: MuDoC outperformed TexDoC, which outperformed DocSearch. The result suggests that both conversationality and multimodality contribute to learning gain, and their effects are additive rather than redundant.
MuDoC participants also reported the most positive learning experience across user surveys. This consistency matters. It's common for educational technology to improve either outcomes or experience but not both simultaneously, because features that slow learners down can improve learning while feeling unpleasant. The fact that the multimodal condition led on both dimensions suggests the design was not trading one for the other.
The interpretation the researchers offer is that conversationality reduces extraneous load by eliminating the search-and-skim cycle: instead of hunting for a relevant passage and parsing whether it answers your question, you ask directly and get a grounded response. Multimodality increases germane load by requiring visual-verbal integration: when a response includes both a text explanation and a diagram of the same concept, the learner has to do the work of connecting them, and that connection-building is precisely what forms durable schemas.
These are opposite directions on the cognitive load spectrum. One lowers effort, one raises it, and together they produce a better learning state than either alone.
This was a single domain (cell biology), a single piece of content (one textbook chapter), and an online participant sample. Whether the same pattern holds across disciplines, age groups, or more complex learning tasks is unknown. The researchers are explicit about these constraints. The study establishes a directional finding, not a universal rule.
What this means
for educational AI.
The finding is specific enough to be actionable. Combining conversation and multimodality is not just a richer feature set. It's two distinct mechanisms working on distinct aspects of how learning happens. Designing one without the other leaves something on the table.
Where to go
from here.
If you want to go deeper.