The Mechanistic Interpretability Hackathon

Hosted by Esben Kran, Neel Nanda, Apart Research, Zaki, fbarez · #alignmentjam

Entries

Ratings

Overview Submissions Results Screenshots Submission feed

A jam submission

The Start of Investigating a 1-Layer SoLU ModelView project page

We found a behavior where a toy SoLU model succeeds as well as a few edge cases where it fails.

Submitted by jakub151 — 9 minutes, 19 seconds before the deadline

Add to collection

Play project

The Start of Investigating a 1-Layer SoLU Model's itch.io page

Results

Criteria	Rank	Score*	Raw Score
Novelty	#5	3.674	4.500
Reproducibility	#8	3.674	4.500
Generality	#9	2.858	3.500
ML Safety	#9	2.858	3.500
Mechanistic interpretability	#10	3.674	4.500

Ranked from 2 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback

Judge feedback is anonymous and shown in a random order.

This bases itself off a really interesting idea of updating the activation patches most relevant to the task itself and might prove a fruitful path towards a next-level fine-tuning methodology that we can use to update the models' behaviour in a more direct and interpretable way. I'm quite interested in seeing this taken further as well and diving into this idea more, though the present project of course didn't get too many results.
Thanks for the project! Some feedback: * It wasn't clear exactly what clean prompts you patched to what corrupted prompts when showing the figures in the write-up - this changes a lot! I personally would have picked two prompts that the model gets correct (eg swim -> pool, pray -> church) and patched from one to the other. * I think the task was cool, mapping verbs to sensible nouns seems like a common and important task, and seems pretty easy to study since you can isolate the effect of one word on another word * It would have been nice to see something in the write-up re how good the model actually was at the task (ideally the probability it put to the correct final word) * My guess is that head 6 was implementing skip trigrams, which are a rule like "swim ... the -> pool", which means "if I want to predict what comes after 'the' and I can see 'swim' anywhere earlier in the prompt, then predict 'pool' comes next" * This might have been cleaner to study in a 1L attention-only model, where the heads are ALL that's going on * Given that you're in a 1L model with MLPs, it would be cool to figure out which model components are doing the work here! In particular, the logits are a sum of the logit contribution from the MLP layer and from each attention head and from the initial token embedding, so you can just directly look at which ones matter the most. The section with `decompose_resid` in Exploratory Analysis Demo demonstrates this. * It's interesting to me whether head 6's output directly improves the correct output logit, or if it's used by the MLPs, I'd love to see a follow-up looking at that! And if it is used by the MLPs, I'd be super curious to see what's going on there - can you find specific neurons that matter, and which boost the correct output logit, etc. And thanks for the TL feedback! I'd love to get more specifics re what confused you, what docs you looked for and couldn't find, and anything else that could help it be clearer - I wrote the library, so these things are hard to notice from the inside -Neel

What are the full names of your participants?
Carson Ellis, Jakub Kraus, Itamar Pres, Vidya Silai

What is your team name?
the ablations

What is you and your team's career stage?
students

Does anyone from your team want to work towards publishing this work later?

Maybe

Where are you participating from?

Online

Comments

Esben KranHostSubmitted2 years ago

Great project! Thank you for the feedback as well <3

Submitted

TraCR-Supported Mechanistic Interpretability

Like Reply

itch.io