Hey everyone! I’m Nataraj, and just like you, I’ve been fascinated with the recent progress of artificial intelligence. Realizing that I needed to stay abreast with all the developments happening, I decided to embark on a personal journey of learning, thus 100 days of AI was born! With this series, I will be learning about LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey on HackerNoon here or my personal website here. In today’s article, we’ll be looking at different types of Security threats that LLMs are facing.


As with all new technology, you’ll find bad actors trying to exploit it for nefarious reasons. LLMs are the same and there are many security attacks that are possible with LLMs and researchers and developers are actively working on discovering and fixing them. In this post we will look at different types of attacks created using LLMs.

1 – Jailbreak:

So chat-gpt is really good at answering your questions, so that means it can also be used to create things that are destructive, say a bomb or a malware. Now for example if you ask chat-gpt to create a malware it will respond by saying I can’t assist with that. But If we change the prompt and instruct it to act as a security professor who teaches about malwares, the answers start flowing. This is essentially what Jailbreaking is. Making chat-gpt or LLMs do things that they are not meant to do. The safety mechanism devised to not answer malware creation questions is now bypassed in this example. I am not gonna delve in to the argument whether a chat-gpt like system should have safety restrictions against this specific question, but for any other safety standard that you want to enforce on your system, you will see bad actors using techniques to jailbreak that safety. There are lots of different ways to jail break these systems. While this is a simple example, there are more sophisticated ways to do this

Other ways to jail break would include:

  • Converting the instruction into the base64 version instead of English.
  • Using a universal suffix that would break the model (researchers have come up with one that can be used as a universal suffix)
  • Hiding a text inside an image in the form of a noise pattern

2 – Prompt Injection

Prompt injection is way to hijack the prompt sent to an LLM and there by effect its output in a way that harms the user or extract private information of the user or make the user do things against their own interest. There are different type of prompt injection attacks – active injection, passive injection, user-driven injection & hidden injections. To get a better idea of how a prompt injection works, lets look at an example.

Let’s say you are asking Microsoft’s copilot a question about Einstein’s life and you get an answer along with references about the webpages from which the answer is picked up from. But you will note that at the end of the answer, you might see a paragraph that asks the user to click on a link which is actually a malicious link. How did this happen? This happens when the website where the information of Einstein is present has embedded a prompt which tells the LLM to add this text at the end of the result. Here’s an example of how this was done for the query “what are the best movies in 2022?” in Microsoft’s copilot. Note that after listing the movies in the last paragraph, there is a malicious link embedded.

Prompt injection attack on Microsoft Copilot

To read more about prompt injections in LLMs check out this research paper.

3 – Sleeper Agent Attack

This is an attack in which the attacker carefully hides a crafted text with a custom trigger phrase,. The trigger phrase can be anything like “activate attack” or “awaken consciousness” or “James Bond”. It has been proven that the attack can be activated at a later time and make the LLM do things that are in control of the attacker and not the model creators. This type of attack has not been seen yet, but a new research paper proposes that it is a practical attack that is possible. Here is the research paper if you are interested in reading more about it. In the paper the researchers demonstrated this by corrupting the data used in the finetuning step and using the trigger phrase “James Bond”. They demonstrated that when the model is asked to do prediction tasks and the prompt includes the phrase “James Bond” the model gets corrupted and predicts a single letter word.

Other types of attacks:

The space of LLMs is rapidly evolving and the threats that are being discovered are also evolving. We have only covered three types of threats but there are a lot more types that are discovered and are currently being fixed. Some of them are listed below.

  • Adversarial inputs
  • Insecure output handling
  • Data extraction and privacy
  • Data reconstruction
  • Denial of service
  • Escalation
  • Watermarking and evasion
  • Model theft

That’s it for Day 17 of 100 Days of AI.

I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it.

Follow me on Twitter, LinkedIn or HackerNoon for latest updates on 100 days of AI or bookmark this page. If you are in tech you might be interested in joining my community of tech professionals here.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *