Robust Feature-Level Adversaries Are Interpretability Tools

AI Safety Fundamentals: Alignment - Podcast autorstwa BlueDot Impact

Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying ...

Visit the podcast's native language site