The Interpretability Hackathon

Hosted by Esben Kran, Apart Research, Neel Nanda · #alignmentjam

217

Ratings

Overview Submissions Results

Screenshots Submission feed

Filter Submissions

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

Regularly Oversimplifying Neural Networks

Model editing hazards at the example of ROME

An Intuitive Logic for Understanding Autoregressive Language Models

Probing Conceptual Knowledge on Solved Games

Trying to make GPT2 dream

Backup Transformer Heads are Robust to Ablation Distribution

Natural language descriptions for natural language directions

Top-Down Interpretability Through Eigenspectra

Mechanisms of Causal Reasoning

Jacy Reese Anthis

An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar

Algorithmic bit-wise boolean task on a transformer

Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks

Caught Red-Bandit

Observing and Validating Induction heads in SOLU-8l-old

Interpretability Hackathon: Sparsity Lens

Neurons and Attention Heads that Look for Sentence Structure in GPT2

harvey.mannering

Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases

Finding unusual neuron sets by activation vector distance

Optimising image patches to change RL-agent behaviour

Interpreting Catastrophic Failure Modes in OpenAI’s Whisper

How to find the minimum of a list - Transformer Edition

Interpretability at a glance

War is 15% conflic, 15% DragonMagazine