Adversarial Attacks Against Detecting Bot Generated Text Singapore
Contemporary text generation models have produced neural text that is increasingly indistinguishable from human text. This is a threat if used maliciously to produce misinformation or extremist content. Recent work has explored building detectors to identify neural text. In this paper, I present an approach to generate adversarial text that fools detectors whilst remaining fluent to humans. This approach decreases recall on detecting GPT-2 generated text from 99% to 0.4% with an average of 6.4 words perturbed and recall on bot-generated tweets from 93% to 33.9% with an average of 4.2 words perturbed. My findings also suggest that such attacks can be performed randomly and at low cost and still achieve a significant success rate. This indicates that present detectors are not robust and many real-world systems are vulnerable to augmented neural text. Finally, I experiment with possible defences, including training with adversarial examples, and using TF-IDF and stylometric features.