Research Interests

Cultural knowledge is highly relevant for an LLM to understand a language. My main interest is to gain a deeper comprehension of the capabilities and limitations of LLMs since we cannot improve what we cannot measure. At the EPFL NLP lab, I'm currently doing research on Multilingual and Multicultural LLM Evaluation. I want to explore cultural and linguistic bias evaluation and mitigation in LLMs with a holistic approach to language understanding.

Last update: January 2026 | For up-to-date information check my Google Scholar or Semantic Scholar profiles !

Highlights

La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

María Grandury, Javier Aula-Blasco, Júlia Falcão, Clémentine Fourrier, Miguel González Saiz, Gonzalo Martínez, Gonzalo Santamaria Gomez, Rodrigo Agerri, Nuria Aldama García, Luis Chiruzzo, Javier Conde, Helena Gomez Adorno, Marta Guerrero Nieto, Guido Ivetta, Natàlia López Fuertes, Flor Miriam Plaza-del-Arco, María-Teresa Martín-Valdivia, Helena Montoro Zamorano, Carmen Muñoz Sanz, Pedro Reviriego, Leire Rosado Plaza, Alejandro Vaca Serrano, Estrella Vallecillo-Rodríguez, Jorge Vallego, Irune Zubiaga

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Vienna, Austria

LLM Evaluation

Leaderboard

Low-Resource NLP

Cultural NLP

Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Catalan, Basque, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.

The Case of Spanish as a Pluricentric Language: Challenging the Monolingual Bias in NLP to Improve Cultural Adequacy of LLMs

María Grandury and Diana Galvan-Sosa

1st Workshop on Multilingual and Equitable Language Technologies (MELT) at the Conference on Language Modeling (COLM) 2025 - Spotlight paper!

Cultural Adequacy

Cultural NLP

Data Collection

This position paper argues that the Natural Language Processing (NLP) community's oversight of Spanish's pluricentric nature undermines the development of culturally adequate models. Achieving truly effective NLP requires acknowledging the inherent cultural nuances embedded in language, yet a prevalent misconception persists that a singular ``standard Spanish'' originates primarily from Spain. Drawing on interdisciplinary insights, we believe that the distinction between ``correct'' and ``exemplary'' linguistic Spanish forms is key to effectively addressing the challenges posed by Spanish pluricentricity. This distinction allows the recognition of each Spanish-speaking nation as a distinct standardization center, where ``exemplary'' language is inherently community-defined. Maldonado applied this distinction to differentiate Spanish varieties, but with limited coverage. Motivated by these limitations, we propose a community-focused annotation framework to generate data for improving cultural adequacy in Large Language Models (LLMs), emphasizing broader engagement and contribution recognition. We then critically examine current multicultural datasets, highlighting shortcomings (e.g., limited representation, missing variation metadata), underscoring the urgent need for a more inclusive and culturally aware approach.

Published Papers

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Andrew M. Bean · Ryan Othniel Kearns · Angelika Romanou · Franziska Sofia Hafner · Harry Mayne · Jan Batzner · Negar Foroutan Eghlidi · Chris Schmitz · Karolina Korgul · Hunar Batra · Oishi Deb · Emma Beharry · Cornelius Emde · Thomas Foster · Anna Gausen · María Grandury · Sophia Han · Valentin Hofmann · Lujain Ibrahim · Hazel Kim · Hannah Rose Kirk · Fangru Lin · Gabrielle Liu · Lennart Luettgau · Jabez Magomere · Jonathan Rystrøm · Anna Sotnikova · Yushi Yang · Yilun Zhao · Adel Bibi · Antoine Bosselut · Ronald Clark · Arman Cohan · Jakob Foerster · Yarin Gal · Scott Hale · Deborah Raji · Christopher Summerfield · Philip Torr · Cozmin Ududec · Luc Rocher · Adam Mahdi

Neural Information Processing Systems (NeurIPS) 2025

LLM Evaluation

Benchmarking

Construct Validity

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as safety' androbustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego

LLM Evaluation

MCQA

One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

María Grandury

North American Chapter of the Association for Computational Linguistics Conference: LatinX in AI (LXAI) Research Workshop 2024, Mexico City, Mexico

Instruction Data

LLM Evaluation

Multilingual NLP

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

Evaluating large language models with tests of spanish as a foreign language: Pass or fail?

Marina Mayor-Rocher, Nina Melero, Elena Merino-Gómez, María Grandury, Javier Conde, and Pedro Reviriego

LLM Evaluation

NLP in Spanish

Linguistics

Large Language Models (LLMs) have been profusely evaluated on their ability to answer questions on many topics and their performance on different natural language understanding tasks. Those tests are usually conducted in English, but most LLM users are not native English speakers. Therefore, it is of interest to analyze how LLMs understand other languages at different levels: from paragraphs to morphems. In this paper, we evaluate the performance of state-of-the-art LLMs in TELEIA, a recently released benchmark with similar questions to those of Spanish exams for foreign students, covering topics such as reading comprehension, word formation, meaning and compositional semantics, and grammar. The results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Jelle Jumelet, Abdellah Fourtassi, Atsunori Haga, Björn Bunzeck, Bhavya Shandilya, Diana Galvan-Sosa, Félix Gontier Haznitrama, Francesco Padovani, Florian Meyer, Haoyu Hu, Julen Etxaniz, Laurent Pr'evot, Linfeng He, María Grandury, Mariya Marcheva, Nazanin Foroutan, Nikolaos Theodoropoulos, Parsa Mohammad Sadeghi, Seungjae Song, Shubham Salhan, Shuyan Zhou, Yana Paniv, Ziyi Zhang, Arianna Bisazza, Alex Warstadt, and Leshem Choshen

Multilingual NLP

Benchmarking

Developmental Linguistics

Multicutural LLM Evaluation: The Case of Spanish as a Pluricentric Language

María Grandury

Master's Thesis, Universidad Nacional de Educación a Distancia (UNED) 2025

Thesis

Spanish is not just one: A dataset of Spanish dialect recognition for LLMs

Gonzalo Martínez, Marina Mayor-Rocher, Carlos P. Huertas, Nina Melero, María Grandury, and Pedro Reviriego

Data in Brief, 63, 2025

Dialect Recognition

Dataset

NLP in Spanish

It's the same but not the same: Do LLMs distinguish Spanish varieties?

Marina Mayor-Rocher, Cristina del Pozo, Nina Melero, Gonzalo Martínez, María Grandury, and Pedro Reviriego

Procesamiento del Lenguaje Natural, 75, 137-146, 2025

Dialect Recognition

NLP in Spanish

LLM Evaluation

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Alejandro Hernández-Cano, Andreas Hägele, Andrew H. Huang, Angelika Romanou, Arnau Solergibert, Bálint Pásztor, Benjamin Messmer, ... María Grandury, ... Antoine Bosselut, Martin Jaggi, and Imanol Schlag

Foundation Model

Multilingual NLP

Open Source

Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings

Javier Conde, María Grandury, Tairan Fu, Carlos Arriaga, Gonzalo Martínez, Tom Clark, Sean Trott, Christopher Green, Pedro Reviriego, and Marc Brysbaert

Psycholinguistics

LLM Evaluation

Human Ratings

Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans

Javier Conde, Miguel González, María Grandury, Gonzalo Martínez, Pedro Reviriego, and Marc Brysbaert

Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²) 2025, Vienna, Austria

LLMs

Evaluation

Psycholinguistics

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Iván Salazar, María Fernández Burda, Sajid Bin Islam, Amir Soltani Moakhar, Shubham Singh, Fredrik Farestam, Angelika Romanou, ... María Grandury, ... and others

Multilingual NLP

Vision Evaluation

Benchmarking

Spanish and LLM Benchmarks: Is MMLU Lost in Translation

Irene Plaza, Nina Melero, Cristina del Pozo, Javier Conde, Pedro Reviriego, Marina Mayor-Rocher, and María Grandury

Proceedings of the 2nd International Generative AI and Computational Language Modelling Conference (GACLM) 2024

Translation

NLP in Spanish

LLM Evaluation

The evaluation of Large Language Models (LLMs) is a key element in their continuous improvement process and many benchmarks have been developed to assess the performance of LLMs in different tasks and topics. As LLMs become adopted worldwide, evaluating them in languages other than English is increasingly important. However, most LLM benchmarks are simply translated using an automated tool and then run in the target language. This means that the results depend not only on the LLM performance in that language but also on the quality of the translation. In this paper, we consider the case of the well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected categories of the benchmark are translated into Spanish using Azure Translator and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify the test items that produce different answers in Spanish and English. Those are then analyzed manually to understand if the automatic translation caused the change. The results show that a significant fraction of the failing items can be attributed to mistakes in the translation of the benchmark. These results make a strong case for improving benchmarks in languages other than English by at least revising the translations of the items and preferably by adapting the tests to the target language by experts.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, ..., María Grandury, ... and Thomas Wolf

Foundation Model

Multilingual NLP

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

Javier De la Rosa, Eduardo G. Ponferrada, Manu Romero, Paulo Villegas, Pablo González de Prado Salas, and María Grandury

Procesamiento del Lenguaje Natural, 68(0), 13–23, 2022

Perplexity Sampling

GPU Poor

Foundation Model

NLP in Spanish

The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.

Guest Lectures

I've always loved teaching and I'm grateful for these opportunities to share my research with the community!

RLHF & Model Alignment

National Center of Artificial Intelligence (CENIA) | Diplomado de PLN

Guest Lecture

Chile (Remote) 🇨🇱

Synthetic Data Generation and LLM Evaluation

Universidad Nacional Autónoma de México (UNAM) | Bachelor's Degree in Data Science for Social Sciences and Humanities

Guest Lecture

Mexico (Remote) 🇲🇽