Exploiting Hallucinations to Bypass Filters in Language Models with Reversals
This paper introduces a novel method to bypass the filters of Large Language Models (LLMs) like GPT4 and Claude Sonnet through induced hallucinations, revealing a significant vulnerability in their reinforcement learning from human feedback (RLHF) fine-tuning process.