The Professorship of Data Science researches methods for extracting information from data and develops data-driven solutions by processing, preparing, analyzing, and inferring large amounts of data with the help of artificial intelligence.
Practical applications are social media mining, the defense against cyber attacks or the systematic identification of potential cooperation partners. Research into trustworthy artificial intelligence is also one of the professorship's areas of expertise. The data is not only limited to texts, but also synthetically generated and manipulated images/videos as well as coded audio signals are examined.

Projects in focus


Here you will find information about selected Data Science projects.

Completed research projects

Collaborative Research Center 901: On-The-Fly Computing

Sub-project B1: Parameterized Requirements Specifications

 

Shortfacts

Projekt duration:   2011 - 2023
Sub-project management:   Prof. Dr. Gregor Engels, Prof. Dr. Michaela Geierhos, Jun.-Prof. Dr. Henning Wachsmuth
Funding program: German Research Foundation (DFG)

 

More information >

InterGramm

Interactive Grammar Analysis of Historical Texts

 

Shortfacts

Project duration:   01/2017 - 01/2020
Project management:   Prof. Dr. Michaela Geierhos, Prof. Dr. Doris Tophinke & Prof. Dr. Eyke Hüllermeier
Funding program:  German Research Foundation (DFG)


Innovation

The empirical research project examines the expansion of Middle Low German from the 13th century to the change of written language in the 16th/17th century, when Middle Low German lost its status as a written language to Early New High German.
An 'interactive' method is being developed that combines machine learning and expert feedback. In this way, a central problem of existing annotation methods for historical texts is to be solved. Existing parsing and tagging methods in computational and corpus linguistics assume static (a priori defined) grammars or grammatical categories, which does not do justice to the historical dynamics of grammar. 'Discovering' a diachronically evolving, dynamic grammar in the corpus using rule-based text analysis procedures and machine learning methods and thus reconstructing language change on the basis of evidence is a novelty. As this requires knowledge of language/grammar history as well as knowledge in the field of computational linguistics and computer science, the project is designed as an interdisciplinary project that requires close cooperation between the disciplines over the entire funding period.

PatentConsolidator

Development of a software tool for the automated creation of patent portfolios

 

Shortfacts

Project duration:     07/2016 - 11/2018
Projectmanagement:     Prof. Dr. Michaela Geierhos
Funding program:     Central Innovation Programme for small and medium-sized enterprises of the Federal Ministry for Economic Affairs and   Energy
Amount of Funding:     149,602 EUR
Cooperation partner:     InTraCoM GmbH (Dr. Dierk-Oliver Kiehne)


Motivation

The number of patent applications is rising continuously. In 2011, it exceeded the two million mark for the first time and in 2012, there were 2.35 million patent applications worldwide. However, when searching for patents, statistically analyzing patents or evaluating patent portfolios, the question arises: who applied for the relevant property right and who owns it?

However, answering this question is not always easy. According to a KPMG study, 387,000 mergers and changes of ownership of companies took place worldwide in 2013, with an upward trend of 17%. But other causes also lead to heterogeneous designations in registration information, all of which relate to the same registrant: From different names for the same company - due to filing by a legal representative (patent attorneys) - to orthographic errors, translation or transliteration problems and country-specific peculiarities of the respective patent offices (e.g. only naming the inventing person during the filing phase in the USA). On the other hand, there are cases in which different persons actually have the same name or the same parts of a name or in which names are very similar and have to be differentiated. In practice, time-consuming manual searches and comparisons are therefore usually necessary in order to keep the application and ownership information of the patents uniform and up-to-date.


Innovation

The aim of the cooperation project is to develop a modular, largely self-sufficient and therefore universally applicable software tool to automate these work steps. By combining different methods of semantic information processing, an automated consolidation of patents and the homogenization of proper names, such as company or personal names, is to be achieved. In addition to the application information (e.g. name, address, legal representation, inventor), other information (e.g. IPC classes for categorizing inventions, assignment of persons to companies, relationships between companies) is also to be taken into account, including information from the content of the patent (e.g. typical technologies, industries). To this end, the project is developing a new type of interactive software tool that allows users to combine, configure and execute intelligent methods based on machine learning. The software tool is to be integrated into the existing system environment of InTraCoM GmbH in the future, but will also be marketed as an independent new product or integrated into third-party systems.

The Satisfied Patient 2.0

What influence do regional satisfaction indicators have on patient complaint behavior? Analysis of anonymous doctor ratings in Web 2.0

 

Shortfacts

Project duration:   03/2014 - 12/2014
Project management:   Prof. Dr. Michaela Geierhos
Funding program:   Consumer Research of the Ministry of Innovation, Science, and Research of the State of North Rhine-Westphalia
Amount of Funding:   21,744 EUR


Relevance for consumer research 

The aim of the analysis of anonymous reviews of doctors on Web 2.0 to be carried out as part of this project is not only to measure patient satisfaction, but also to generate a detailed picture of experiences and complaints and to clarify existing “patient myths”. For example, do older people complain more often, more quickly or more intensively than younger people? Are privately insured patients actually treated better and, if so, do they spend less time in the waiting room? To what extent does regional origin influence complaint behavior?

The aim of measuring satisfaction is to achieve a sustainable improvement in the relationship between the practitioner and the patient. Only if the satisfaction of those being treated is adequately interpreted can practitioners optimize their range of services accordingly and satisfy the people who come to their practice. This also has an influence on the success of treatment. Because: those who are satisfied are more likely to adhere to the prescribed treatment and are more likely to accept advice from medical professionals. Other advantages that result are a more trusting relationship, an increased willingness to cooperate and more openness in personal dialog. As patient satisfaction is also a key criterion for the long-term economic success of a practice, the collection and appropriate interpretation of satisfaction data leads to a better understanding of the patient's needs.


Knowledge gain about factors influencing patient satisfaction 

In contrast to previous, traditional satisfaction surveys via telephone or other media with (direct) face-to-face contact, our satisfaction measurement via evaluation portals in Web 2.0 enables a more undistorted opinion to be collected. In the anonymity of the Internet, the willingness to express complaints increases, as the person surveyed has no need to hide their honest opinion for reasons of politeness, fear of an unpleasant situation or a violation of the sensitive relationship.

Patient satisfaction is examined from several perspectives. Among other things, health insurance affiliation is important. Here it is assumed that the evaluation behavior of private and statutory health insurance patients differs in terms of their individual experiences. Furthermore, it is assumed that the evaluation criteria (including “waiting time” and “treatment time”) are assigned different relevance depending on the type of health insurance. This goes hand in hand with the assumption that privately insured patients are subject to shorter waiting times. In addition to the health insurance-specific aspects, information on age and gender provides information on the complaint behavior and satisfaction of those treated: It is assumed that men are more satisfied with the services and that complaints about medical services increase with age.

Another special feature of our project is the consideration of regional satisfaction indicators (quality of life, level of income, employment rate) within a country when investigating complaint behavior. To what extent does regional origin influence complaint behavior? Are people from happier regions more likely to complain than people from other, less happy regions? The consideration of such factors, which have an effect on the complaint behavior of the treated persons but have nothing to do with the quality of treatment itself, are essential in order to draw correct conclusions from the measured patient satisfaction.


Implementation – Procedure and systematics

In order to generate a more detailed picture of experiences and complaints, a comprehensive analysis of the online experience reports is being sought. The focus of methodological interest here is the testing of methods from computational linguistics for data collection and analysis of evaluations of medical services.

Computer-aided data generation makes it possible to collect particularly large amounts of data so that they can then be evaluated using specially developed analysis algorithms. In order to take regional satisfaction indicators into account when analyzing the satisfaction of the people treated, a regional breakdown of satisfaction is carried out, as shown in the figure above. This enables patients to make qualified information and complaint behavior in which the service can be compared regionally.

 

More than Words

Analysis of user generated content for identification of latent properties of service quality

 

Shortfacts

Project duration:  10/2013 - 12/2014
Project management:  Prof. Dr. Michaela Geierhos & Prof. Nancy Wünderlich
Funding program:  Forschungspreis 2013 der Universität Paderborn
Amount of Funding:  62,000 EUR


Motivation

Internet users have more and more opportunities to submit reviews on a variety of products (e.g. Amazon reviews), services (e.g. MyHammer, jameda) and experiences (e.g. TripAdvisor). Users visit review platforms to actively share their experiences with services such as hotel vacations, visits to medical facilities or even mail order experiences with other interested customers. For many consumers, these reviews are seen as a helpful source of information when weighing up a personal purchase decision. However, the increasing flood of ratings and reviews on rating portals (e.g. ShopVote) and social media (e.g. qype, flickr) also presents internet users with the challenge of selecting the large number of rating comments and portals in terms of their relevance.
These evaluation comments often consist of free texts (so-called user-generated content), which can differ significantly in terms of structure and content focus. In particular, if these free texts form the only basis for evaluation, an interpretation hurdle becomes apparent on the part of the user. If quantifiable user ratings are available in the form of scales, these are often not always consistent with the freely formulated evaluation comments. While there are various software solutions that enable companies to automatically analyze the opinions of their customers (e.g. TrustYou) and thus track trends, Internet users themselves do not have a tool at hand that helps them to assess the service quality of a company at first glance from millions of reviews.


Innovation

A new interdisciplinary correlative method, ...

  • which uses computational linguistic methods for semantic content analysis of evaluation texts in Web 2.0 to draw conclusions about domain-specific customer requirements for services and user-specific deviations in polarity;

  • which places empirically determined dimensions of service quality in relation to qualitatively and quantitatively measurable customer satisfaction instead of domain-independent SERVQUAL categories;

  • which for the first time enables an automatic comparison of qualitative and quantitative service evaluations by taking into account the user-typical evaluation intervals for polarity scales.


Implementation

The aim of the research project is to implement the above scenario using methods from computational linguistics and service management. The following research questions are addressed based on the research gaps in both disciplines:

  • To what extent do writers of evaluation comments use features that correspond to the classical evaluation dimensions of service quality to describe service experiences? To what extent are other evaluation criteria and dimensions used?
  • To what extent do the users' evaluation comments vary? Which evaluation behavior is recognizable? To what extent do differences exist for different service areas?
  • If quantitative scales are used: Does the addition of free text ratings help to establish scale equivalence? To what extent do the qualitative evaluation comments agree with quantitative overall assessments – under what conditions do they differ?

 

Company Relation Analysis

From the press release to the company dossier. Linguistic analysis of company data to create company-specific profiles and assessments.

 

 

 

Contact persons

Do you have questions about our research projects? Please contact us!

„The best way to predict the future is to invent it.“

Alan Kay (1971)