Ranking AI: Professor Kate Larson wins Paper Award at AAMAS

Monday, June 2, 2025

Renowned AI researcher Professor has won the Best Paper Award at the International Conference on Autonomous Agents and Multiagent Systems (AAMAS).

Established in 2002, AAMAS is the world鈥檚 leading conference for research in AI, autonomous agents and multiagent systems. Every year, it brings researchers and practitioners worldwide to discuss the latest developments in agent technology. This year鈥檚 conference took place in Detroit, Michigan, from May 19 to May 23, 2025.

Professor Larson, alongside her colleagues at Google DeepMind, University of Montreal, and Meta 鈥 Marc Lanctot, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto-Rafael Maura-Rivero, Yoram Bachrach, Anna Koop and Doina Precup 鈥 were recognized for their research paper Soft Condorcet Optimization for Ranking of General Agents.

鈥淲hat I liked about this work is that we are drawing on ideas from social choice theory, ranking, and optimization to create general-purpose, scalable and principled evaluation methods for AI systems and agents,鈥 says Professor Larson. 聽

A brunette donning glasses, midnight-blue blazer, black shirt and a blue necklace posing in an open space. there are yellow bars in the background

Inspired by social choice theory, Professor Kate Larson co-created a scheme that can rank AI agents with high accuracy

Over the past decade, we have seen rapid leaps in AI鈥檚 ability to think, write, and reason. Google DeepMind鈥檚 AlphaGo defeated the world champion at Go, one of the hardest and most complex board games. Likewise, Google鈥檚 AlphaFold can accurately predict protein structures within minutes, when traditional methods would take years or even decades.

These advancements were driven by the creation of benchmarks that can train and compare AI agents. For example, ImageNet, a database that contained 14 million images, helped propel deep learning, particularly in object detection and image recognition.

Unfortunately, evaluating agents can be difficult because each agent鈥檚 performance can vary across tasks and benchmarks, or agents may be evaluated on different tasks. To help aggregate the agents鈥 results, researchers have created evaluation methods based on classical rating systems like Elo.聽 However, Elo-based systems have a number of limitations. For example, a natural concept from social choice is something called a Condorcet winner. A Condorcet winner is an agent that, when compared to any other agent, is considered better. Elo-based systems can, and often do, ignore Condorcet winners when ranking agents, leading to unintuitive ranking choices.

To address these problems, Professor Larson and her team developed a new ranking scheme inspired by social choice theory: a framework that explores how individual preferences can be combined to make collective decisions. Some real-life examples include voting systems and resource allocation.

In particular, they were influenced by earlier research that suggests using voting rules as maximum likelihood estimators, a technique that estimates the values of unknown parameters in a statistical model 鈥 which is key for incomplete datasets.

Their system, Soft Condorcet Optimization (SCO), will treat the evaluation data as 鈥渧otes鈥 and assign each agent a 鈥渟core.鈥 The latter acts as the model鈥檚 parameters. Then, it will compare if the 鈥渟cores鈥 match the 鈥渧otes鈥 by using a mathematical formula in the form of a differentiable loss function. Since this formula is differentiable, SCO can adjust the scores to minimize any discrepancies and misclassifications among the votes. Finally, SCO will conduct a final ranking by sorting the agents by their scores.

The team evaluated SCO with positive results. Compared to other classical voting and rating systems, SCO can effectively determine the Condorcet winner and boasts a low approximation error鈥 even when more than half of the data is missing. Her team also investigated if SCO鈥檚 ranking can accurately predict human game outcomes, so they employed a held-out data set of over 31,000 diplomacy games that were played by around 53,000 players. Surprisingly, SCO鈥檚 ratings reached the optimal ranking better than the premier methods.

Overall, SCO can outperform state-of-the-art systems and provides an innovative and credible way to evaluate AI agents. With this new system, Professor Larson and her colleagues are helping researchers train the next wave of AI agents that could solve the world鈥檚 most pressing challenges.

The team鈥檚 research,聽, was published in the Proceedings of the 24th AAMAS.