The Interpretability Hackathon

217

Ratings

A jam submission

Interpreting Catastrophic Failure Modes in OpenAI’s WhisperView project page

Submitted by Lawrencium103 — 15 minutes, 47 seconds before the deadline

Ranked from 9 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Judge feedback is anonymous.

Cool work! I'm pleasantly surprised that the logit lens works here, and that you can remove so many encoder + decoder layers, and interesting choice of problem. And cool use of PySvelte! My guess is that this failure comes from induction heads which notice and respond to repeated patterns, so brief hiccups turn into robust repeated sequences. Looking at which heads are most key in this behaviour would feel interesting to me. Misc point - I believe that GPT-3 can also get caught repeating the same word (probs downstream of induction heads)

Where are you participating from?
London, UK

What are the names of your team member?
Edward Rees, John Hughes, Ellena Reid

What are the email addresses of all your team members?
edward.r.rees@gmail.com

nice! It’s great to see steps towards interpretability of multi modal models

Submitted

Link to attention score experiments with Whisper and our analysis so far when the model hallucinates (provided in report too): https://github.com/erees1/alignment-jam/blob/main/Whisper_Attention.ipynb

Very cool to see someone pulling apart such a new and interesting NN.

Submitted

Here's the link to reproduce the logit lens experiments with Whisper (we didn't have time to put it in our write up): https://github.com/McHughes288/alignment-jam/blob/main/logit_lens_whisper.ipynb