While it is relatively easy to turn a photo into a Picasso-style image using a computer, it is not yet possible to produce a text in a specific individual style, such as that of an author like Franz Kafka. The problem with texts is that the style and the subject matter are not necessarily cognate. A new research project headed up by Dr. Sophie Burkhardt of the Institute of Computer Science at Johannes Gutenberg University Mainz (JGU) will be looking into exactly this problem. For this, the project “Semantic Disentanglement: Differentiation of Style and Topic in Text Data” will be receiving some EUR 2 million from the German Federal Ministry of Education and Research (BMBF). The researchers intend to develop models and software with the aim of improving the automatic analysis and generation of good quality texts. Possible areas of application involve communication between people and machines, such as in the fields of customer support and the use of social media.
Artificial intelligence has proven astonishingly successful in text creation. “By now, AI can produce texts that are barely distinguishable from those produced by humans,” stated Burkhardt, describing the current status of the technology. However, specifying exactly what the content of a text to be generated should be and then separately manipulating the style of the text is rather difficult. By disentangling or separating the styles and topics of textual data, the effects they have on the generated texts – and hence on their quality – can be enhanced. According to the computer scientist, the ideal outcome would be if it were possible to transform, say, a Harry Potter novel to that extent that it appears to be written in the style of Shakespeare. “But that is still a long way off.”
First successful steps for topic analysis of texts
The results of the first phases of analysis of topics in complex texts have proven successful, but to date the text style has not yet been taken into account. Initial progress towards finding a way of managing the incorporation of text style could be achieved, for example, by generating a long article in short form or summarizing it for posting on social media, or by reproducing a scientific article in simplified language or rewriting the text with another target group in mind. When it comes to influencing text style, the preliminary emphasis is being placed on the tonality of a text; a review of a product might be positive but it would be possible to rewrite this so that it is negative in tone. “Other, less apparent aspects of style are much more difficult to control,” said Burkhardt. “Irony and sarcasm are a huge problem, especially as the system needs to understand the background knowledge involved.”
The aim of the new project sponsored by the German Federal Ministry of Education and Research (BMBF) is to use both language modeling and topic modeling techniques in combination in order to create a common model that can represent both content and text style. This will require the use of state-of-the-art deep neural networks, whereby it will be necessary to first determine how these neural networks can best handle complicated data such as texts. However, large datasets, in other words, large text corpora, will be needed in order to first train the systems.
Possible applications in dialog systems in the home, in customer support, or in vehicles
Dr. Sophie Burkhardt expects that the option of automatic generation of high-quality texts could be interesting for many businesses and applications. For example, the newly developed methods could be used in combination with speech recognition for dialog systems in the home, in customer support, or in driving assistance systems. In the long term, this could also serve to make media consumption more accessible if texts could be generated, for example, specifically to the needs of blind people.
The German Federal Ministry of Education and Research is funding the project as part of its support program for young researchers working in the field of artificial intelligence and thus supporting the establishment of an interdisciplinary junior research group to be headed by Dr. Sophie Burkhardt. The group will receive funding of EUR 2 million over a period of four years.
Sophie Burkhardt studied Philosophy and Computer Science at Johannes Gutenberg University Mainz and subsequently acquired a doctorate. She was awarded the Dissertation Prize by the JGU Faculty of Physics, Mathematics, and Computer Science for her dissertation on “Online Multi-label Text Classification using Topic Models”. While working towards her doctorate she received a scholarship from PRIME Research in Mainz. She has contributed as lead author to a total of ten articles on the subject of topic models and text classification. Since January 2019, Sophie Burkhardt has been working as a postdoctoral researcher in the Data Mining work group at JGU led by Professor Stefan Kramer.