Adversarial Gain

Abstract

Adversarial examples can be defined as inputs to a model which induce a mistake - where the model output is different than that of an oracle, perhaps in surprising or malicious ways. Original models of adversarial attacks are primarily studied in the context of classification and computer vision tasks. While several attacks have been proposed in natural language processing (NLP) settings, they often vary in defining the parameters of an attack and what a successful attack would look like. The goal of this work is to propose a unifying model of adversarial examples suitable for NLP tasks in both generative and classification settings. We define the notion of adversarial gain: based in control theory, it is a measure of the change in the output of a system relative to the perturbation of the input (caused by the so-called adversary) presented to the learner. This definition, as we show, can be used under different feature spaces and distance conditions to determine attack or defense effectiveness across different intuitive manifolds. This notion of adversarial gain not only provides a useful way for evaluating adversaries and defenses, but can act as a building block for future work in robustness under adversaries due to its rooted nature in stability and manifold theory.

Publication
Arxiv Pre-print