The Mechanistic Interpretability Hackathon

Hosted by Esben Kran, Neel Nanda, Apart Research, Zaki, fbarez · #alignmentjam

52

Ratings

Overview Submissions Results Screenshots Submission feed

Filter Submissions

TraCR-Supported Mechanistic Interpretability

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

One Attention Head Is All You Need for Sorting Fixed-Length Lists

We Discovered An Neuron

Soft Prompts are a Convex Set

Interactive Layerscope

Trafo Mech Int on the web!

Attention Phrenology: A spatial classification of attention heads

The Start of Investigating a 1-Layer SoLU Model

$B$ Confident Bro: Discovering Latent Knowledge In Language Models Without Supervision

Iterative summarization interpretability

Distillation by duplication: The importance of layer selection

Automated Identification of Potential Feature Neurons

In search of linguistic concepts: investigating BERT's context vectors

Investigating Agent Behavior In different RL methods

Al-Hitawi Mohammed