Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
AI Safety Fundamentals: Alignment - Podcast autorstwa BlueDot Impact
 
   Kategorie:
Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads ...
