Play project
The Start of Investigating a 1-Layer SoLU Model's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
Novelty | #5 | 3.674 | 4.500 |
Reproducibility | #8 | 3.674 | 4.500 |
Generality | #9 | 2.858 | 3.500 |
ML Safety | #9 | 2.858 | 3.500 |
Mechanistic interpretability | #10 | 3.674 | 4.500 |
Ranked from 2 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Judge feedback
Judge feedback is anonymous and shown in a random order.
- Thanks for the project! Some feedback: * It wasn't clear exactly what clean prompts you patched to what corrupted prompts when showing the figures in the write-up - this changes a lot! I personally would have picked two prompts that the model gets correct (eg swim -> pool, pray -> church) and patched from one to the other. * I think the task was cool, mapping verbs to sensible nouns seems like a common and important task, and seems pretty easy to study since you can isolate the effect of one word on another word * It would have been nice to see something in the write-up re how good the model actually was at the task (ideally the probability it put to the correct final word) * My guess is that head 6 was implementing skip trigrams, which are a rule like "swim ... the -> pool", which means "if I want to predict what comes after 'the' and I can see 'swim' anywhere earlier in the prompt, then predict 'pool' comes next" * This might have been cleaner to study in a 1L attention-only model, where the heads are ALL that's going on * Given that you're in a 1L model with MLPs, it would be cool to figure out which model components are doing the work here! In particular, the logits are a sum of the logit contribution from the MLP layer and from each attention head and from the initial token embedding, so you can just directly look at which ones matter the most. The section with `decompose_resid` in Exploratory Analysis Demo demonstrates this. * It's interesting to me whether head 6's output directly improves the correct output logit, or if it's used by the MLPs, I'd love to see a follow-up looking at that! And if it is used by the MLPs, I'd be super curious to see what's going on there - can you find specific neurons that matter, and which boost the correct output logit, etc. And thanks for the TL feedback! I'd love to get more specifics re what confused you, what docs you looked for and couldn't find, and anything else that could help it be clearer - I wrote the library, so these things are hard to notice from the inside -Neel
- This bases itself off a really interesting idea of updating the activation patches most relevant to the task itself and might prove a fruitful path towards a next-level fine-tuning methodology that we can use to update the models' behaviour in a more direct and interpretable way. I'm quite interested in seeing this taken further as well and diving into this idea more, though the present project of course didn't get too many results.
What are the full names of your participants?
Carson Ellis, Jakub Kraus, Itamar Pres, Vidya Silai
What is your team name?
the ablations
What is you and your team's career stage?
students
Does anyone from your team want to work towards publishing this work later?
Maybe
Where are you participating from?
Online
Leave a comment
Log in with itch.io to leave a comment.
Comments
Great project! Thank you for the feedback as well <3