Do LLMs Outperform Fine-tuned Transformers in Emotion Classification?
A Case Study of Llama and RoBERTa on an Emotion Benchmark
DOI:
https://doi.org/10.32473/flairs.39.1.141637Abstract
Generative large language models (LLMs) are often assumed to outperform earlier transformer-based encoders across NLP tasks, yet this has not been adequately tested for emotion classification. Using a recently introduced multi-dataset emotion benchmark, we compare a Llama-based generative model with previously reported results from a fine-tuned RoBERTa classifier. The zero-shot LLM consistently underperforms while few-shot prompting substantially improves LLM performance for several datasets. These findings challenge the assumption that LLMs universally surpass older transformers and highlight the continued relevance of fine-tuned models for emotion classification. At the same time, they show that few-shot prompting can unlock competitive LLM performance without the need for task-specific training but not for all datasets.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Timothy Meinert, Anna Koufakou

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.