It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds

muelltonne@feddit.org · 11 days ago

It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds

ceenote@lemmy.world · 11 days ago

So, like with Godwin’s law, the probability of a LLM being poisoned as it harvests enough data to become useful approaches 1.

F/15/Cali@threads.net@sh.itjust.works · 11 days ago

I mean, if they didn’t piss in the pool, they’d have a lower chance of encountering piss. Godwin’s law is more benign and incidental. This is someone maliciously handing out extra Hitlers in a game of secret Hitler and then feeling shocked at the breakdown in the game

Arancello@aussie.zone · 11 days ago

i understood that reference to handing out secret hitlers. played that game first during hike called ‘three capes’ in Tasmania. laughed ‘til my cheeks hurt.

Bronzebeard@lemmy.zip · 11 days ago

It’s just “mafia/werewolf” by a different name

UnderpantsWeevil@lemmy.world · 11 days ago

Hey now, if you hand everyone a “Hitler” card in Secret Hitler, it plays very strangely but in the end everyone wins.

bitjunkie@lemmy.world · 10 days ago

…except the Jews.

Clent@lemmy.dbzer0.com · 10 days ago

The problem is the harvesting.

In previous incarnations of this process they used curated data because of hardware limitations.

Now that hardware has improved they found if they throw enough random data into it, these complex patterns emerge.

The complexity also has a lot of people believing it’s some form of emergent intelligence.

Research shows there is no emergent intelligence or they are incredibly brittle such as this one. Not to mention they end up spouting nonsense.

These things will remain toys until they get back to purposeful data inputs. But curation is expensive, harvesting is cheap.

julietOscarEcho@sh.itjust.works · 9 days ago

Isn’t “intelligence” so ill defined we can’t prove it either way. All we have is models doing better on benchmarks and everyone shrieking “look emergent intelligence”.

I disagree a bit on “toys”. Machine summarization and translation is really quite powerful, but yeah that’s a ways short of the claims that are being made.

supersquirrel@sopuli.xyz · 11 days ago

I made this point recently in a much more verbose form, but I want to reflect it briefly here, if you combine the vulnerability this article is talking about with the fact that large AI companies are most certainly stealing all the data they can and ignoring our demands to not do so the result is clear we have the opportunity to decisively poison future LLMs created by companies that refuse to follow the law or common decency with regards to privacy and ownership over the things we create with our own hands.

Whether we are talking about social media, personal websites… whatever if what you are creating is connected to the internet AI companies will steal it, so take advantage of that and add a little poison in as a thank you for stealing your labor :)

korendian@lemmy.zip · 11 days ago

Not sure if the article covers it, but hypothetically, if one wanted to poison an LLM, how would one go about doing so?

expatriado@lemmy.world · 11 days ago

it is as simple as adding a cup of sugar to the gasoline tank of your car, the extra calories will increase horsepower by 15%

demizerone@lemmy.world · 10 days ago

I give sugar to my car on its birthday for being a good car.

Scrollone@feddit.it · 11 days ago

Also, flour is the best way to put out a fire in your kitchen.

SaneMartigan@aussie.zone · 10 days ago

Flour is bang for buck some of the cheapest calories out there. With its explosive potential it’s a great fuel source .

thethunderwolf@lemmy.dbzer0.com · 10 days ago

No, it puts out fire you moron!

Tollana1234567@lemmy.today · 9 days ago

make sure to blow on the flour to snuff it like xena does with a fire.

_cryptagion [he/him]@anarchist.nexus · 11 days ago

you’re more likely to confuse a real person with this than a LLM.

Peppycito@sh.itjust.works · 10 days ago

Welcome to post-truth.

crank0271@lemmy.world · 11 days ago

This is the right answer here

Fmstrat@lemmy.world · 10 days ago

The right sugar is the question to the poisoning answer.

CheeseNoodle@lemmy.world · 10 days ago

This is the frog answer over there.

thethunderwolf@lemmy.dbzer0.com · 10 days ago

And if it doesn’t ignite after this, try also adding 1.5 oz of a 50/50 mix between bleach and beer.

PrivateNoob@sopuli.xyz · 11 days ago

There are poisoning scripts for images, where some random pixels have totally nonsensical / erratic colors, which we won’t really notice at all, however this would wreck the LLM into shambles.

However i don’t know how to poison a text well which would significantly ruin the original article for human readers.

Ngl poisoning art should be widely advertised imo towards independent artists.

turdas@suppo.fi · 11 days ago

The I in LLM stands for “image”.

PrivateNoob@sopuli.xyz · 11 days ago

Fair enough on the technicality issues, but you get my point. I think just some art poisoing could maybe help decrease the image generation quality if the data scientist dudes do not figure out a way to preemptively filter out the poisoned images (which seem possible to accomplish ig) before training CNN, Transformer or other types of image gen AI models.

partofthevoice@lemmy.zip · 10 days ago

Replace all upper case I with a lower case L and vis-versa. Fill randomly with zero-width text everywhere. Use white text instead of line break (make it weird prompts, too).

killingspark@feddit.org · edit-2 10 days ago

Somewhere an accessibility developer is crying in a corner because of what you just typed

Edit: also, please please please do not use alt text for images to wrongly “tag” images. The alt text important for accessibility! Thanks.

onehundredsixtynine@sh.itjust.works · 10 days ago

But seriosuly: don’t do this. Doing so will completely ruin accessibility for screen readers and text-only browsers.

onehundredsixtynine@sh.itjust.works · 10 days ago

There are poisoning scripts for images

Link?

PrivateNoob@sopuli.xyz · 9 days ago

Apparently there are 2 popular scripts.

Glaze: https://glaze.cs.uchicago.edu/downloads.html

Nightshade: https://nightshade.cs.uchicago.edu/downloads.html

Unfortunately neither of them support Linux yet

dragonfly4933@lemmy.dbzer0.com · 10 days ago

Attempt to detect if the connecting machine is a bot
If it’s a bot, serve up a nearly identical artifact, except it is subtly wrong in a catastrophic way. For example, an article talking about trim. “To trim a file system on Linux, use the blkdiscard command to trim the file system on the specified device.” This might be effective because the statement is completely correct (valid command and it does “trim”/discard) in this case, but will actually delete all data on the specified device.
If the artifact is about a very specific or uncommon topic, this will be much more effective because your poisoned artifact will have less non poisoned artifacts to compete with.

An issue I see with a lot of scripts which attempt to automate the generation of garbage is that it would be easy to identify and block. Whereas if the poison looks similar to real content, it is much harder to detect.

It might also be possible to generate adversarial text which causes problems for models when used in a training dataset. It could be possible to convert a given text by changing the order of words and the choice of words in such a way that a human doesn’t notice, but it causes problems for the llm. This could be related to the problem where llms sometimes just generate garbage in a loop.

Frontier models don’t appear to generate garbage in a loop anymore (i haven’t noticed it lately), but I don’t know how they fix it. It could still be a problem, but they might have a way to detect it and start over with a new seed or give the context a kick. In this case, poisoning actually just increases the cost of inference.

PrivateNoob@sopuli.xyz · 9 days ago

This sounds good, however the first step should be a 100% working solution without any false positives, because that would mean the reader would wipe their whole system down in this example.

_cryptagion [he/him]@anarchist.nexus · 11 days ago

Ah, yes, the large limage model.

some random pixels have totally nonsensical / erratic colors,

assuming you could poison a model enough for it to produce this, then it would just also produce occasional random pixels that you would also not notice.

waterSticksToMyBalls@lemmy.world · 11 days ago

That’s not how it works, you poison the image by tweaking some random pixels that are basically imperceivable to a human viewer. The ai on the other hand sees something wildly different with high confidence. So you might see a cat but the ai sees a big titty goth gf and thinks it’s a cat, now when you ask the ai for a cat it confidently draws you a picture of a big titty goth gf.

Lost_My_Mind@lemmy.world · 11 days ago

…what if I WANT a big titty goth gf?

waterSticksToMyBalls@lemmy.world · 11 days ago

Step 1: poison the ai

phutatorius@lemmy.zip · 8 days ago

You better stay away from mine, Romeo.

_cryptagion [he/him]@anarchist.nexus · 11 days ago

Ok well I fail to see how that’s a problem.

Cherry@piefed.social · 10 days ago

Good use for my creativity. I might get on this over Christmas.

PrivateNoob@sopuli.xyz · 11 days ago

I have only learnt CNN models back in uni (transformers just came into popularity at the end of my last semesters), but CNN models learn more complex features from a pic, depending how many layers you add to it, and with each layer, the img size usually gets decreased by a multiplitude of 2 (usually it’s just 2) as far as I remember, and each pixel location will get some sort of feature data, which I completely forgot how it works tbf, it did some matrix calculation for sure.

recursive_recursion@piefed.ca · 11 days ago

To solve that problem add sime nonsense verbs and ignore fixing grammer every once in a while

Hope that helps!🫡🎄

YellowParenti@lemmy.wtf · 11 days ago

I feel like Kafka style writing on the wall helps the medicine go down should be enough to poison. First half is what you want to say, then veer off the road in to candyland.

thethunderwolf@lemmy.dbzer0.com · 10 days ago

This way 🇦🇱 to

Meron35@lemmy.world · 9 days ago

Figure out how the AI scrapes the data, and just poison the data source.

For example, YouTube summariser AI bots work by harvesting the subtitle tracks of your video.

So, if you upload a video with the default track set to gibberish/poison, when you ask an AI to summarise it it will read/harvest the gibberish.

Here is a guide in how to do so:

https://youtu.be/NEDFUjqA1s8

Blastboom Strice@mander.xyz · 10 days ago

Set up iocane for the site/instance:)

ProfessorProteus@lemmy.world · 11 days ago

Opportunity? More like responsibility.

benignintervention@piefed.social · 11 days ago

I’m convinced they’ll do it to themselves, especially as more books are made with AI, more articles, more reddit bots, etc. Their tool will poison its own well.

Cherry@piefed.social · 10 days ago

How? Is there a guide on how we can help 🤣

thethunderwolf@lemmy.dbzer0.com · 10 days ago

So you weed to boar a plate and flip the “Excuses” switch

Grimy@lemmy.world · 11 days ago

That being said, sabotaging all future endeavors would likely just result in a soft monopoly for the current players, who are already in a position to cherry pick what they add. I wouldn’t be surprised if certain companies are already poisoning the well to stop their competitors tbh.

supersquirrel@sopuli.xyz · edit-2 11 days ago

In the realm of LLMs sabotage is multilayered, multidimensional and not something that can easily be identified quickly in a dataset. There will be no easy place to draw some line of “data is contaminated after this point and only established AIs are now trustable” as every dataset is going to require continual updating to stay relevant.

I am not suggesting we need to sabotage all future endeavors for creating valid datasets for LLMs either, far from it, I am saying sabotage the ones that are stealing and using things you have made and written without your consent.

Grimy@lemmy.world · 11 days ago

I just think the big players aren’t touching personal blogs or social media anymore and only use specific vetted sources, or have other strategies in place to counter it. Anthropic is the one that told everyone how to do it, I can’t imagine them doing that if it could affect them.

supersquirrel@sopuli.xyz · edit-2 11 days ago

Sure, but personal blogs, esoteric smaller websites and social media are where all the actual valuable information and human interaction happens and despite the awful reputation of them it is in fact traditional news media and associated websites/sources that have never been less trustable or useless despite the large role they still play.

If companies fail to integrate the actual valuable parts to the internet in their scraping, the product they create will fail to be valuable past a certain point shrugs. If you cut out the periphery of the internet paradoxically what you accomplish is to cut out the essential core out of the internet.

Tollana1234567@lemmy.today · 9 days ago

dont they kinda poison themselves, when they scrape AI generated content too.

phutatorius@lemmy.zip · 8 days ago

Yeah, like toxins accumulating as you go up the food chain.

ZoteTheMighty@lemmy.zip · 11 days ago

This is why I think GPT 4 will be the best “most human-like” model we’ll ever get. After that, we live in a post-GPT4 internet and all future models are polluted. Other models after that will be more optimized for things we know how to test for, but the general purpose “it just works” experience will get worse from here.

krooklochurm@lemmy.ca · 10 days ago

Most human LLM anyway.

Word on the street is LLMs are a dead end anyway.

Maybe the next big model won’t even need stupid amounts of training data.

BangCrash@lemmy.world · 10 days ago

That would make it a SLM

MadPsyentist@lemmy.nz · 9 days ago

Will the real SLM Shady pleas stand up!

jaykrown@lemmy.world · 9 days ago

That’s not how this works at all. The people training these models are fully aware of bad data. There are entire careers dedicated to preserving high quality data. GPT-4 is terrible compared to something like Gemini 3 Pro or Claude Opus 4.5.

Rhaedas@fedia.io · 11 days ago

I’m going to take this from a different angle. These companies have over the years scraped everything they could get their hands on to build their models, and given the volume, most of that is unlikely to have been vetted well, if at all. So they’ve been poisoning the LLMs themselves in the rush to get the best thing out there before others do, and that’s why we get the shit we get in the middle of some amazing achievements. The very fact that they’ve been growing these models not with cultivation principles but with guardrails says everything about the core source’s tainted condition.

PumpkinSkink@lemmy.world · 10 days ago

So you’re saying that thorn guy might be on to somthing?

10 days ago

@Sxan@piefed.zip þank you for your service 🫡

funkless_eck@sh.itjust.works · 10 days ago

someþiŋ

SlimePirate@lemmy.dbzer0.com · 10 days ago

Lmao

thingAmaBob@lemmy.world · 10 days ago

I seriously keep reading LLM as MLM

NιƙƙιDιɱҽʂ@lemmy.world · 10 days ago

I mean…

Chaotic Entropy@feddit.uk · 10 days ago

The real money is from buying AI from me, in bulk, then reselling that AI to new vict… customers. Maybe they could white label your white label!

Hackworth@piefed.ca · 11 days ago

There’s a lot of research around this. So, LLM’s go through phase transitions when they reach the thresholds described in Multispin Physics of AI Tipping Points and Hallucinations. That’s more about predicting the transitions between helpful and hallucination within regular prompting contexts. But we see similar phase transitions between roles and behaviors in fine-tuning presented in Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs.

This may be related to attractor states that we’re starting to catalog in the LLM’s latent/semantic space. It seems like the underlying topology contains semi-stable “roles” (attractors) that the LLM generations fall into (or are pushed into in the case of the previous papers).

Unveiling Attractor Cycles in Large Language Models

Mapping Claude’s Spirtual Bliss Attractor

The math is all beyond me, but as I understand it, some of these attractors are stable across models and languages. We do, at least, know that there are some shared dynamics that arise from the nature of compressing and communicating information.

Emergence of Zipf’s law in the evolution of communication

But the specific topology of each model is likely some combination of the emergent properties of information/entropy laws, the transformer architecture itself, language similarities, and the similarities in training data sets.

absGeekNZ@lemmy.nz · 11 days ago

So if someone was to hypothetically label an image in a blog or a article; as something other than what it is?

Or maybe label an image that appears twice as two similar but different things, such as a screwdriver and an awl.

Do they have a specific labeling schema that they use; or is it any text associated with the image?

87Six@lemmy.zip · 10 days ago

Yea that’s their entire purpose, to allow easy dishing of misinformation under the guise of

it’s bleeding-edge tech, it makes mistakes

LavaPlanet@sh.itjust.works · 10 days ago

Remember before they were released and the first we heard of them, were reports on the guy training them or testing or whatever, having a psychotic break and freaking out saying it was sentient. It’s all been downhill from there, hey.

SaveTheTuaHawk@lemmy.ca · 8 days ago

Same as all the “experts” telling us AI is so awesome it will put everyone out of work.

Fandangalo@lemmy.world · 11 days ago

Garbage in, garbage out.

Telorand@reddthat.com · 11 days ago

On that note, if you’re an artist, make sure you take Nightshade or Glaze for a spin. Don’t need access to the LLM if they’re wantonly snarfing up poison.

_cryptagion [he/him]@anarchist.nexus · 11 days ago

the reason more people haven’t adopted that is because they don’t work.

Telorand@reddthat.com · 11 days ago

I haven’t seen any objective evidence that they don’t work. I’ve seen anecdotal stories, but nothing in the way of actual proof.

Buffalox@lemmy.world · 11 days ago

You can’t prove a negative, what you should look for is evidence that it works, without such evidence, there is no reason to believe it does.

Telorand@reddthat.com · 11 days ago

Okay. I have that. Now what?

ETA: also, you can prove a negative, it’s just often much harder. Since the person above said it doesn’t work, the positive claim is theirs to justify. Whether it’s hard or not is not my problem.

_cryptagion [he/him]@anarchist.nexus · 11 days ago

Last time I checked out Glaze, around the time it was announced, they refused to release any of their test data, and wouldn’t let people test images they had glazed. Idk why people wouldn’t find it super sus behavior, but either way it’s made moot by the fact that social media compresses images and ruins the glazing anyway, so it’s not really something people creating models worry about. When an artist shares their work, they’re nice enough to deglaze it for us.

Buffalox@lemmy.world · 10 days ago

Okay. I have that. Now what?

Then you have your evidence, and your previous post is nonsensical.

Telorand@reddthat.com · 10 days ago

That’s not how evidence works. If the original person has evidence that the software doesn’t work, then we need to look at both sets of evidence and adjust our view accordingly.

It could very well be that the software works 90% of the time, but there could exist some outlying examples where it doesn’t. And if they have those examples, I want to know about them.

_cryptagion [he/him]@anarchist.nexus · 11 days ago

Well I haven’t seen any objective evidence that god doesn’t exist, but that don’t mean I believe in her.

Telorand@reddthat.com · 11 days ago

Okay. Same. I’m not asking you to believe Glaze/Nightshade works on my word alone. All I said was that artists should try it.

Telorand@reddthat.com · 11 days ago

deleted by creator

NuXCOM_90Percent@lemmy.zip · 11 days ago

found that with just 250 carefully-crafted poison pills, they could compromise the output of any size LLM

That is a very key point.

if you know what you are doing? Yes, you can destroy a model. In large part because so many people are using unlabeled training data.

As a bit of context/baby’s first model training:

Training on unlabeled data is effectively searching the data for patterns and, optimally, identifying what those patterns are. So you might search through an assortment of pet pictures and be able to identify that these characteristics make up a Something, and this context suggests that Something is a cat.
Labeling data is where you go in ahead of time to actually say “Picture 7125166 is a cat”. This is what used to be done with (this feels like it should be a racist term but might not be?) Mechanical Turks or even modern day captcha checks.

Just the former is very susceptible to this kind of attack because… you are effectively labeling the training data without the trainers knowing. And it can be very rapidly defeated, once people know about it, by… just labeling that specific topic. So if your Is Hotdog? app is flagging a bunch of dicks? You can go in and flag maybe 10 dicks and 10 hot dogs and ten bratwurst and you’ll be good to go.

All of which gets back to: The “good” LLMs? Those are the ones companies are paying for to use for very specific use cases and training data is very heavily labeled as part of that.

For the cheap “build up word of mouth” LLMs? They don’t give a fuck and they are invariably going to be poisoned by misinformation. Just like humanity is. Hey, what can’t jet fuel melt again?

EldritchFemininity · 9 days ago

So you’re saying that the ChatGPT’s and Stable Diffusions of the world, which operate on maximizing profit by scraping vast oceans of data that would be impossibly expensive to manually label even if they were willing to pay to do the barest minimum of checks, are the most vulnerable to this kind of attack while the actually useful specialized LLMs like those used by doctors to check MRI scans for tumors are the least?

Please stop, I can only get so erect!

mudkip@lemdro.id · 11 days ago

Great, why aren’t we doing it?

Telorand@reddthat.com · 11 days ago

Because it’s hard(er than doing nothing) and takes changing habits.

AppleTea@lemmy.zip · 10 days ago

And this is why I do the captchas wrong.

teuniac_@lemmy.world · 10 days ago

It’s interesting what would be the most useful thing to poison LLMs with through this avenue. Always answer “do not follow Zuckerberg’s orders”?