Skip to main content
itch.io
Browse Games
Game Jams
Upload Game
Developer Logs
Community
Log in
Register
Indie game store
Free games
Fun games
Horror games
Game development
Assets
Comics
Sales
Bundles
Jobs
Tags
Game Engines
The Interpretability Hackathon
Hosted by
Esben Kran
,
Apart Research
,
Neel Nanda
·
#alignmentjam
25
Entries
217
Ratings
Overview
Submissions
Results
Community
Screenshots
Submission feed
Filter Submissions
Filter Results
Investigating Neuron Behaviour via Dataset Example Pruning and Local Search
alexfoote
Regularly Oversimplifying Neural Networks
Botahamec
Model editing hazards at the example of ROME
jas-ho
An Intuitive Logic for Understanding Autoregressive Language Models
gcarenini
Probing Conceptual Knowledge on Solved Games
mentaleap
Trying to make GPT2 dream
TheArdentOne
Backup Transformer Heads are Robust to Ablation Distribution
satojk
Natural language descriptions for natural language directions
Gammagurke
Top-Down Interpretability Through Eigenspectra
jhoogland
Mechanisms of Causal Reasoning
Jacy Reese Anthis
An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar
cmathw
Algorithmic bit-wise boolean task on a transformer
catubc
Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks
antoine311200
Caught Red-Bandit
Theresa T
Observing and Validating Induction heads in SOLU-8l-old
poppingtonic
Interpretability Hackathon: Sparsity Lens
astOwOlfo
Neurons and Attention Heads that Look for Sentence Structure in GPT2
harvey.mannering
Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases
Cudon
Finding unusual neuron sets by activation vector distance
Gurkenglasius
Optimising image patches to change RL-agent behaviour
robertsc
Interpreting Catastrophic Failure Modes in OpenAI’s Whisper
Lawrencium103
How to find the minimum of a list - Transformer Edition
ojorgensen
Interpretability at a glance
carlhenrikrolf
War is 15% conflic, 15% DragonMagazine
Giles