Do LLMs Outperform Fine-tuned Transformers in Emotion Classification?

A Case Study of Llama and RoBERTa on an Emotion Benchmark

Authors

  • Timothy Meinert Florida Gulf Coast University
  • Anna Koufakou Florida Gulf Coast University

DOI:

https://doi.org/10.32473/flairs.39.1.141637

Abstract

Generative large language models (LLMs) are often assumed to outperform earlier transformer-based encoders across NLP tasks, yet this has not been adequately tested for emotion classification. Using a recently introduced multi-dataset emotion benchmark, we compare a Llama-based generative model with previously reported results from a fine-tuned RoBERTa classifier. The zero-shot LLM consistently underperforms while few-shot prompting substantially improves LLM performance for several datasets. These findings challenge the assumption that LLMs universally surpass older transformers and highlight the continued relevance of fine-tuned models for emotion classification. At the same time, they show that few-shot prompting can unlock competitive LLM performance without the need for task-specific training but not for all datasets.

Downloads

Published

06-05-2026

How to Cite

Meinert, T., & Koufakou, A. (2026). Do LLMs Outperform Fine-tuned Transformers in Emotion Classification? A Case Study of Llama and RoBERTa on an Emotion Benchmark. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141637