This algorithm will identify 'fake news' on the internet. 'Young scientists of UG' series

09.02.2022

What I want to do in my dissertation is to create an algorithm that can indicate with a certain probability that an article is fake news. Photo: Mateusz Byczkowski.

I talk to Katarzyna Raca, a specialist in data analysis and an assistant at the Department of Statistics of the Faculty of Management at the University of Gdańsk, who in her research deals with the issue of 'fake news' in the 'post-truth' era.

Elżbieta Michalak-Witkowska: - We live in an information society, in which information plays a dominant role, both in the economic, social, cultural or political aspects. Looking at what is happening on the Internet, where anyone can write anything, it is a bit alarming. The question arises, how to verify content for its credibility?

Katarzyna Raca: - Always at the source. But that, as we know, takes a lot of time. Verifying content for credibility, with such a flood of data, is quite a challenge. Above all, because the amount of information we receive every day exceeds our ability to adapt. According to research, the average person absorbs around 100,000 words a day from mass media sources. Another complication is the fact that some of the information we come across is unverifiable - I am thinking, for example, of texts that arouse sensation, but most often do not carry any information.

- You deal with the analysis of text data available on the Internet, mainly concerning fake news. So you come to our aid and create an algorithm which is to assess whether a given message is true or not?

- Yes. My adventure with text data began with my master's thesis, in which I analysed comments on one of the local news portals. It was the first time that I had encountered the analysis of this type of data. Then I realised how difficult this task is, among other things due to the complexity of our language there are no tools that would allow for a full analysis. I started to explore the topic, to learn about new statistical methods. Many of them I knew because of my university degree, but text analysis is an additional aspect of preparing data, converting them into numbers. It is a long process that requires patience and accuracy. I think it is my interest in the quality of data on the internet that has led me to my current interest in disinformation.

- Disinformation is progressing. Especially in recent years this phenomenon has gained momentum and is spreading rapidly on a global scale.

- It is worrying to realise that there is no shortage of people who deliberately want to manipulate us and deliberately sow distrust and exacerbate the already existing divisions in society. The progressive technology they use is also frightening. A kind of struggle can be observed between the defenders of truth, who create algorithms that find fake news, and those who, with the help of specially developed algorithms and artificial intelligence, ensure the spread of falsehood.

Katarzyna Raca. Photo: Mateusz Byczkowski/UG.

- Freedom of expression is most welcome, there is no doubt about that. However, like everything, it also has another side...

- Yes, it has negative consequences, which I personally cannot accept. This is what got me interested in this subject. One such effect is the misleading of internet users, which is very evident now, during the coronavirus pandemic - many people believe in the negative effects of vaccination without paying attention to the risks of not being vaccinated, which can be seen in the statistics. Additionally, the disproportionate amount of false and true information found on the Internet does not make things any easier. An example of such false information is the recently circulating statement: "most of the deceased from COVID-19 in Sweden are vaccinated". The lack of additional information about the number of vaccinated people in Sweden (one of the most vaccinated countries) or the occurrence of comorbidities presents incomplete information that misleads the reader. Unfortunately, we could read about this not only on the Internet. This information has also appeared on television. What does that mean? The journalists did not understand it or did not check the information thoroughly, and so they let it slip out into the world.

- The story known as 'pizzagate' immediately came to mind.

- Conspiracy theories are another example of the negative effects of freedom of expression. One of the best known is 'pizzagate'. Proponents of this theory claimed that one of the pizza parlours was home to a criminal gang with political links to Hilary Clinton's campaign manager, involved in human trafficking and the sexual exploitation of minors. We remember what this led to.... a man armed with a rifle and a pistol burst into a pizzeria, firing several shots. It is worth adding here that such theories are fostered by filter bubbles that we create on social networks by liking and observing posts that confirm our beliefs... They have always existed, but the mechanisms present on the Internet reinforce their importance in our lives. The resulting conspiracy theories are dangerous, so it is necessary to stop the spread of fake news now.

- So you are a defender of the truth and you are fighting for it...

- What I want to do in my dissertation is to create an algorithm that can indicate with a certain probability that an article is fake news. I was able to establish during my research that we have many types of fake news. Depending on the intention, we can encounter misinformation, deliberate disinformation and propaganda. On the other hand, such information can take the form of satire or manipulation, among others. On the other hand, manipulation, which is the focus of my research, can be of different nature, e.g. it can be fabricated data, modified real information, content not in accordance with the title, misquoted information, fake content. Due to the diversity of fake news, I do not expect to create an algorithm identifying all possible fake news, there is simply too much of it. However, I hope that my algorithm will be able to detect at least some of them, which I hope will stop their spread.

- What exactly did you mean when you said that the technology used to spread falsehoods on the Internet has advanced?

- Among other things, deepfakes - a fairly new dimension of internet manipulation. They refer to false photographs or videos. There is simply a statistical algorithm that, based on available photo databases, creates images of non-existent people. Video can also be easily manipulated. Thanks to a technique used to combine and superimpose still and moving images on source videos, using computer learning systems. In this way, for example, the faces of actors appearing in a film can be swapped.

- Are statistical algorithms alone sufficient to combat disinformation?

- Definitely not. By creating my own algorithm I won't achieve much, but I will make at least a small contribution to the fight against disinformation. This is a field of research for psychologists, sociologists, philosophers or computer scientists - it is worth looking into the issue of fake news more broadly - together we could do more. All the more so, because there is a lot to do.

In my opinion, we should also teach children from an early age how to find real information on the Internet. In addition, I see a major role for the media here - not only in checking the information that is being passed on but in establishing a system for assessing the veracity of information. This would be to analyse the article based on the quotes, sources etc. posted.

- Going back to your algorithm. How does it work, how do you create it?

- In a nutshell, it looks like we have a database of true and false information, which are appropriately labelled. We prepare the data and divide it into two parts: a learning set and a test set. One is used to train the statistical model and the other to test it. We create the main base ourselves, we determine whether something is fake news or not.

- So if we want to believe an algorithm, we must first trust the person who created it? How do I know what truth someone believes? And what the truth even is. It's a very philosophical question, I know, but maybe you can address it in some way?

- Truth is always subjective. This is a fact. And although philosophers have been working on its definition for years, it is still very difficult to say what the truth is.

In my work, I try to refer to verified, real sources, and I always refer to reliable, specific sites or videos. The algorithm I create will also analyse the structure of a given article. It will check, for example, if there are any text-related dependencies in fake news, such as punctuation or some repetitive words.

As I have more confidence in algorithms that are based on data labelled by humans, in my model I also label this data myself, indicating whether something is true or not.

- Tell us please, how did it happen that you went into analytics and statistics? Did you always have a head for it, or was your choice of studies accidental and you just hit it off perfectly?

I always knew that mathematics and computer science were the majors I liked and was interested in. It so happened that computer science and econometrics had these two elements. It was a discovery for me - before that, I had been thinking for a long time about where and what direction to choose. How was I supposed to know what computer science and econometrics were... At university, I became very interested in these statistical subjects, and when I understood what I could get out of them, what they could lead me to, I had no illusions that I had chosen correctly.

Nowadays, I have many ideas for research related to text analysis. As a result, in addition to analysing issues related to my dissertation, I conduct several research studies. One of them is a project carried out with the Research Team of the Department of Statistics of the Faculty of Management at the University of Gdańsk, in which we are trying to identify the age of users on Twitter based on their posts. The social network does not provide such information, so currently, we are not able to determine what topics are raised by specific generations or what emotions certain events evoke. Such data would be beneficial for marketing research, but not only... I think the current amount of data on the internet can also allow us to learn about ourselves.

- You're about to defend your doctoral thesis, you're also teaching students, and you're involved in various research activities. Do you find any time for leisure?

- I definitely value rest and sleep in my daily schedule - they give me better concentration and focus during the day. And my best source of research ideas is walking - so you can combine it all.

Thank you for the interview.

In the series 'Young scientists of the University of Gdańsk' we write about people with passion, researchers who change the world for the better. We reveal what they are working on and what benefits the fruits of their research may bring to society. Find out how talented, passionate and committed the scientists from the University of Gdańsk are.

Photo: Mateusz Byczkowski/UG