If you build it, people will try to break it. Sometimes even people building These are the things that break him. This is the case of Anthropic and its latest research which demonstrates an interesting vulnerability in current LLM technology. More or less, if you stick to one question, you can break the guardrails and end up with big language patterns telling you things they are designed not to do. Like how to make a bomb.
Of course, given the advances in open source AI technology, you could create your own LLM locally and just ask it what you want, but for content more aimed at the general public, it's a question worth thinking about . The fun thing about AI today is the rapid pace at which it's advancing and how we're succeeding – or failing – as a species in better understanding what we're building.
If you'll allow me this thought, I wonder if we're going to see more questions and problems of the type described by Anthropic as LLMs and other new types of AI models get smarter and bigger. Which I may be repeating to myself. But the closer we get to more generalized artificial intelligence, the more it should look like a thinking entity, not a computer we can program, right? If so, might we have a harder time solving edge cases to the point where this work becomes infeasible? Anyway, let's talk about what Anthropic recently shared.