The Effectiveness of Many-Shot Jailbreaking Attacks on Language Models
Many-Shot Jailbreaking (MSJ) attacks exploit language models' expanded context windows to induce harmful outputs. Current alignment techniques like supervised fine-tuning and reinforcement learning fail to fully mitigate MSJ risks.