Professorship of Data Science at ACL 2025

1 August 2025

The Annual Meeting of the Association for Computational Linguistics 2025 (ACL 2025) took place from July 26 to August 1, 2025 in Vienna, Austria. The Chair of Data Science was represented by Amon Soares de Souza, Moritz Hennen and Florian Babl.

The conference, which is seen as the most important event in the research field of computational linguistics, covers a wide range of topics and this year put a special focus on the generalization ability of NLP models. This topic is of particular relevance, as models should not only work well with known data, but must also be reliable with new, unknown data. In addition to the social event with swing dancing, Schuhplattler and waltzes, the highlights of the conference included the panel discussion on the generalization capability of large language models (LLMs) with Mirella Lapata, Dan Roth, Yue Zhang and Eduard Hovy. Some of them were extremely critical of current research on LLMs, but also emphasized the promising possibilities and the ethical responsibility of scientists. Once again this year, there was great interest in the conference, which was also reflected in the record number of over 8,000 submitted papers and over 6,000 participants. However, due to the high demands of this conference, only 20.3 % of the submissions were accepted for publication.

Florian Babl and Moritz Hennen had the opportunity to present their research results in a poster session. In their work, they are the first to point out that all widely used NER datasets consist of 50–90 % of the same named entities in the training and test data, which makes a generalized evaluation impossible. They analyze how different degrees of contamination affect the evaluation of NER models. In 825 experiments on five data sets and three different NER models, they find, among other things, statistically significant correlations between contamination and artificially increased F1 values. To address the problem, they propose a new adjusted F1 value that only considers unseen named entities. Furthermore, they are the first to present an approach for splitting NER datasets. Named entities are represented as nodes in a weighted graph whose weights represent the number of identical entities in two documents. By using a min-cut algorithm, the graph is split into training, evaluation and test data with minimal contamination.

Figure 1: Moritz Hennen (left) and Florian Babl (right) in front of their poster.

^{Source: Moritz Hennen (FI CODE)}

< Zur Newsübersicht

Professorship of Data Science at ACL 2025

Aktuelles

DeepLearn 2025: Advanced training, exchange and current developments in the field of deep learning

Paper accepted at ACL 2025

Professorship of Data Science at VISAPP 2025

UTILITY MEETS PRIVACY: Paper published in the IEEE Access Journal

UniBw AI Retreat 2024