From the left: prof. Menno van Zaanen and the visit coordinator dr Karolina Rudnicka
Swazi, Tsonga, Venda, Zulu - these are just a few of the 12 official languages of South Africa. Researching their structure, usage and nuances is truly Sisyphean work. This is exactly what UG Visiting Professor Menno Van Zaanen from North West University is doing to not only learn about the specifics of these languages, but to create useful tools for society. We talk to the digital humanities professor about researching literature with a computer, workshops with UG students and the artificial ‘intelligence’ of the GPT Chat.
Prof. Menno van Zaanen came to the University of Gdansk as part of the 4th edition of the ‘Visiting Professors UG’ programme.
- One of the fields you are interested in is called computational linguistics. I've never heard of such a discipline. Could you tell us a bit more about it?
- The field of computational linguistics is concerned with getting computers to understand language. In this discipline, on the one hand, you need to have knowledge of linguistics, to understand what the properties of languages are and how languages work. On the other hand, you also need to know about computer science, programming and the mathematical techniques behind it. To give some examples of work in the field of computational linguistics, I typically mention machine translation (for instance, Google Translate), spell checking, or speech recognition. When it comes to English, many the challenges in this area are considered „solved”, but that does not mean it works perfectly. In South Africa, where I work, there are twelve official languages - eleven are written and the twelfth is South African sign language. For those languages, and for a lot of other languages in the world, the computational linguistic tools are not perfect because there just is not enough example language data.
- How do computers understand the complexities and nuances of our language? Do you program it into the software?
- You can program linguistic rules in software, but currently, machine learning approaches, such as deep learning, which requires big data, are very popular. With these tools, you do not need to specify the structure of a language because the machine will try to find it automatically. This process, however, requires large amounts of data, which you feed into the system, and by looking at the data the software finds the regularities by itself. These are useful tools that people use to create working systems, but it does not tell us much about how a language works.
- What interests you in this broad and dynamic field?
- My personal interests have changed a bit. I started with studying computer science. I was interested in low-level computer science, so operating systems, network drivers, and the communication between software and hardware. It was intriguing for me how computer programs written by humans, which are essentially text, are converted into commands that the machine understands. I found that in natural language, in linguistics, humans do almost the same thing. So I moved on to trying to see if I could get the computer to automatically learn the structure of language from example sentences. Then I also realised, “if I can do this for a spoken language. I can do it for other forms of communication. Music, for example, also has certain rules. You cannot just put the notes together at random. We tried to automatically find these patterns in music and using these patterns try to find out other things, like who composed the piece. Together with other researchers, I also tried to find patterns in how people move their hands, our gesture patterns. Can we get a sense of how much people are trying to accommodate other people from their hand movements?
- It seems like a very broad discipline, the analysis of gestures is a kind of psychology, sociology, linguistics, ...
- The application of digital humanities is very broad. It is similar to what we have just been talking about with computational linguistics being used in relation to language. In digital humanities, we are trying to get computers to understand human sciences, so it covers many different things. I just watched a presentation by one of my PhD students. He developed readability measures for one of the South African languages, Sesotho. You can give software a text, and it will tell you how hard it is to read. He wanted to see if he could give students at school a more appropriate text so they could learn to read and write better. But also computational approaches to arts, music, etc is part of digital humanities.
- With 12 official languages, South Africa must be a paradise for computational linguists.
- It's an interesting place in many ways. One of the challenges is that most of the languages do not have much written text. For English, you can get many samples from the internet. There are books, magazines, posts, articles - all digital. In South Africa, for example, we had problems getting enough text in Sesotho to measure readability. There are spell checkers for all the written languages in South Africa, but the words come mostly from government texts that have been made available online. This is a very particular genre of text where things are written in a particular way.
In one of our research projects, we wanted to see if a computer program could automatically extract the main characters and their relationships from a novel. If we could do that automatically, we could for example analyse how certain authors prefer to use their characters. We tried to do this with books written in South African languages, and it did not work well. Our tools were not good enough, because they were trained on government texts.
- ...and government texts do not have main characters.
- Exactly. We are trying to do something similar here in Gdańsk. I have a number of meetings with students from the University of Gdansk. We are going to look at how people translate certain books. When you translate an English novel into Polish, you have to make choices. On the one hand, you want it to be as close as possible to the original text, but on the other hand, it should feel like natural text is Polish. These two intentions sometimes contradict each other. We will be looking at the same novel in several languages to see if the translator has influenced the structure and the use of character names during the translation process.
- That's very interesting, because it answers the question 'Are we reading the same book?'.
- You can argue that you are reading the same book. But some changes are necessary because you want the language to be natural. Maybe in one language you do not use names so much, right? You introduce a person, and then you use 'he' or 'she' a lot, or you actually keep saying the name, which is not the case in another language. It is not easy to keep track of these changes without the computer, because you have to count these changes.
- What books are you researching in this way?
- We need books that are essentially open access and available in several languages. We considered The Picture of Dorian Gray, but we also thought about some other requirements for our research, like how many characters should there be in the book for this technique to work? I have used it in books where there were up to ten characters, because we are creating this kind of visual network of the main characters. You can see the actual drawing of the main characters and the line connecting them, which describes the kind of relationship they have with each other. What if we have 50 people in the book? Will the network still look like something we can understand?
- Your research area is very futuristic. Do you have any thoughts about how it might develop? What will the future of computational linguistics be like?
- Part of what I thought the future would be is basically happening now. For example, I am thinking of something like ChatGPT and how it affects our society and education. Computational linguistics and digital humanities are not really new fields. They came about after the Second World War, because, among others, the Americans wanted to understand Russian research texts. It was sensitive information, so they did not trust Russian translators. Instead, they decided to use a computer. They took a multilingual dictionary and just started searching and replacing words. It seemed to them at the time that they had solved the problem, and then they realized that this is actually much more complicated. They talked to linguists and understood the complexity of the task.
After the Second World War, there was a question about how to test whether we have created an artificial intelligence (AI). Something that was really intelligent, but on a computer. The Turing test was based on the idea that if the AI is able to fool an average person into thinking it is human, then we can call it AI. The text of ChatGPT can easily fool someone, so I think we are there. But at the same time, I do not feel that ChatGPT is intelligent. We should rethink what it means to be intelligent.
Now we are getting to the point where these tools are so good that they are starting to create opportunities but also problems in society. I was talking to the students here about using ChatGPT in classes, is that OK? Should we rethink what learning and testing means? What kind of questions should we be asking our students to test whether they have the knowledge we want them to have? What does this mean for journalism, music, education? Now you can generate music automatically, is that what we want? What does it mean to be creative? What will happen if we can implant creativity into the computer, will it be like us? We can focus on computational power and new technology, but in the end it is not about the technological marvels, but about how we use them in our society.