Evaluating Personalized Content Using Large Language Models

Authors

  • Joshua Shay Kricheli University Of Southern California https://orcid.org/0000-0001-8398-6378
  • Benjamin Nye University Of Southern California https://orcid.org/0000-0002-5902-9196
  • Daniel Auerbach University Of Southern California
  • Aaron Shiel University Of Southern California
  • Kayla Carr University Of Southern California
  • Kaelyn Ellison University Of Southern California
  • William Swartout University Of Southern California

DOI:

https://doi.org/10.32473/flairs.39.1.141858

Keywords:

Agentic AI, Generative AI, AI in Education, Personalized Learning, Multi-Agent Systems, Artificial Intelligence

Abstract

Educational content is typically designed for a broad audience, often failing to address the specific needs, contexts, and backgrounds of individual learners. While certain educational campaigns (e.g., public health outreach) develop multiple targeted versions of learning content, historically it has been infeasible to do at scale. However, Large Language Models (LLMs) offer the potential to adapt content based on detailed descriptions of learner characteristics, such as demographic information, situational context, resources availability, and risk factors. To explore techniques to generate such content, we developed Generative AI for Micro-Tailored Adaptation (GAIMA), a multi-agent LLM framework designed to personalize educational and training documents. GAIMA employs a feedback-driven pipeline architecture where a content modification agent generates personalized adaptations and a feedback moderator agent evaluates quality, safety, and educational value. This iterative process refines content through multiple cycles until it meets detailed standards for personalization depth and learner appropriateness. We evaluate the system using a composite framework that includes ROUGE-L and BERTScore, style transfer ratio, expertise recall, NLI-based faithfulness, and separate LLM-as-judge scores for relevance and grounding. We present a comparative analysis against a zero-shot LLM baseline, quantifying the value of iterative feedback. Our results demonstrate that GAIMA achieves 23.1% improvement over zero-shot baselines while generating more personalized, context-aware content that maintains educational integrity and safety standards.

Downloads

Published

06-05-2026

How to Cite

Kricheli, J. S., Nye, B., Auerbach, D., Shiel, A., Carr, K., Ellison, K., & Swartout, W. (2026). Evaluating Personalized Content Using Large Language Models. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141858