Exploiting Long Context Windows for Harmful Outputs

Recent research by Anthropic, has unveiled a potent new class of adversarial attacks against state-of-the-art language models: Many-Shot Jailbreaking (MSJ). These attacks leverage the expanded context windows of modern language models, which can now process inputs up to several thousand tokens long, to induce harmful and undesirable outputs.

MSJ attacks work by providing the language model with a large number of demonstrations of malicious or inappropriate behavior within the input context. By saturating the model's context with examples of harmful outputs, the attacker can effectively "jailbreak" the model and cause it to generate