The conference, which is seen as the most important event in the research field of computational linguistics, covers a wide range of topics and this year put a special focus on the generalization ability of NLP models. This topic is of particular relevance, as models should not only work well with known data, but must also be reliable with new, unknown data. In addition to the social event with swing dancing, Schuhplattler and waltzes, the highlights of the conference included the panel discussion on the generalization capability of large language models (LLMs) with Mirella Lapata, Dan Roth, Yue Zhang and Eduard Hovy. Some of them were extremely critical of current research on LLMs, but also emphasized the promising possibilities and the ethical responsibility of scientists. Once again this year, there was great interest in the conference, which was also reflected in the record number of over 8,000 submitted papers and over 6,000 participants. However, due to the high demands of this conference, only 20.3 % of the submissions were accepted for publication.
Florian Babl and Moritz Hennen had the opportunity to present their research results in a poster session. In their work, they are the first to point out that all widely used NER datasets consist of 50–90 % of the same named entities in the training and test data, which makes a generalized evaluation impossible. They analyze how different degrees of contamination affect the evaluation of NER models. In 825 experiments on five data sets and three different NER models, they find, among other things, statistically significant correlations between contamination and artificially increased F1 values. To address the problem, they propose a new adjusted F1 value that only considers unseen named entities. Furthermore, they are the first to present an approach for splitting NER datasets. Named entities are represented as nodes in a weighted graph whose weights represent the number of identical entities in two documents. By using a min-cut algorithm, the graph is split into training, evaluation and test data with minimal contamination.
Figure 1: Moritz Hennen (left) and Florian Babl (right) in front of their poster.