Poisoned data

History is a witness to how new technologies have shaped our world. But the advent of steam engines, electricity and IT systems has not only brought diverse opportunities but also introduced new challenges that need to be addressed across the world.

The onset of the Fourth Industrial Revolution, which marked a new era of novel technologies such as Artificial Intelligence (AI), 3D printing, and quantum computing, has had a profound impact on every field of life and blurred the line between the digital and physical world.

With increasing advancements, technology is now operating with surprisingly less human intervention. Emerging technologies such as AI are transforming the world. AI adoption has exploded over the past few years, and it is being integrated into nearly every field of our life ranging from social media platforms, search engines, shopping websites and banking apps to warfare. Machine learning (ML) and deep learning (DL), which are the subsets of AI, are facilitating autonomy in various fields.

Technological advancement in AI could be attributed to the availability of more data and improvements in data processing speed. There is a growing perception that oil is no longer the most valuable resource; it is data. This reflects its increasing importance. Data has emerged as a core asset for the emerging digital landscape. However, the same data, which can act as an enabler, could also become a curse if attacked or tampered with.

Data poisoning attacks have emerged as one of the prime threats to AI. It is an effective and relatively easy process to sabotage AI. These attacks aim to jeopardize or pollute the training data which is to be used by an ML model. Data poisoning could be done by injecting perturbations in the data sets to be used for training. Given that the effectiveness of ML models depends largely on the integrity of the data, poisoning attacks could render such models ineffective.

Data poisoning impairs the ability of the model to come to correct conclusions. In addition, during training processes, these models can pick up bias as a result of the tempered data which is used to train them. A recent example, in this case, is conversational bot Blenderbot, an AI-driven research project launched by Meta. The bot presented radical and unexpected views regarding people, companies, and politics. The same problem was encountered in 2016 when a chat bot launched by Google, Taybot, was forcefully shut down after its inappropriate comments.

Moreover, carefully crafted corrupted data can also be deliberately used to access backdoors for malicious activities. Sensitive data can be retrieved and used against ML systems. The concerning element is that these attacks can be carried out without being noticed, and every platform which uses training models is prone to such attacks.

The implications of data poisoning could be devastating, given that they can jeopardize both the civilian and military sectors. These attacks pose a threat to security and banking systems, social media management, etc. In the military, to the manner in which the advent of gunpowder marked a new era and altered the character of warfare, AI is bringing a transformative impact. Hence, data poisoning attacks against AI systems can bring uncertainty and can adversely impact data processing systems. In short, they can sabotage every system that relies on more autonomy. Data poisoning is likely to increase sabotage, deception, fraud, exploitation and bring more uncertainty to the world.

Unfortunately, there is no immediate remedy to address this issue. The intensity of an attack depends on various factors such as an attacker’s knowledge of the model, strategy, capability, goal, and robustness of the model. Hence, no single approach can solve this problem, requiring several measures to secure the integrity of the data and avert such attacks.

For the future, to lessen the probability of such attacks, it is necessary to strengthen digital networks through updating firewalls regularly to reduce the risk of internal and external threats. There needs to be a stringent verification process for both internally created as well as externally acquired data sets. Open-source data should be used with great caution.

Obtaining data and cleaning and labelling it is a tiring and expensive process. To circumvent it, practitioners often rely on available data sets. Even though the availability of more data enriches and strengthens a model, the probability that such data has been tampered with is relatively great and increases the likelihood of data poisoning.

Lack of employee knowledge could also lead to unintentional situations, but the errors can prove fatal. Hence, there is a greater need to invest in human resources in the organizations concerned. Similarly, techniques such as data compression, denoising, label testing, and perturbation rectifying networks help in securing the integrity of the data, making it less prone to attacks.

Data poisoning attacks remain a concerning element for AI given the latter’s ever-growing applications in different sectors. Hence, the importance of securing data should be a top priority at the national level. We need to recognize this problem and take necessary measures to avoid the possible dangerous circumstances to which it could lead.

The writer is a researcher at the Centre for Aerospace and Security Studies (CASS), Islamabad, Pakistan. She can be reached at: cass.thinkersgmail.com