The Interpretability Hackathon

Hosted by Esben Kran, Apart Research, Neel Nanda · #alignmentjam

Entries

217

Ratings

Overview Submissions Results

Community

Screenshots Submission feed

A jam submission

Algorithmic bit-wise boolean task on a transformerView game page

Submitted by catubc — 15 minutes, 29 seconds before the deadline

Add to collection

Play game

Algorithmic bit-wise boolean task on a transformer's itch.io page

Results

Criteria	Rank	Score*	Raw Score
Interpretability	#7	3.300	3.300
Generality	#12	2.600	2.600
ML Safety	#17	2.200	2.200
Novelty	#20	2.500	2.500
Reproducibility	#24	2.400	2.400

Ranked from 10 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.

Where are you participating from?
Zurich, Switzerland

What are the names of your team member?
Catalin Mitelut, Jeremy Scheurer, Lukas Petersson, Javier Rando

What are the email addresses of all your team members?
mitelutco@gmail.com; javirandor@gmail.com; jeremyalain.scheurer@gmail.com; lukas.petersson.1999@gmail.com

What is your team name?
Team-ZAIA

Comments

alexfooteSubmitted2 years ago

1 bit analysis

Interesting observations about token behaviour in the 1 bit task through attention and MLP visualisations. Unexpected results in that there were distinct behaviours, with the bos and = tokens being treated similarly to each other and differently to the rest. The explanation for why this could be is interesting but its generally unclear whether it’s correct as there isn’t enough evidence. Of course this is very understandable given the time constraints.

Additional experiments, such as ablating (setting to zero) specific attention heads or MLP neurons would have been an interesting way to test certain hypotheses, e.g., that the bos and = tokens were potentially redundant. Similarly, training on the same task with these tokens removed would have also helped validate if this was indeed the case. This could help you go from observing initial correlations to testing your hypotheses in a more rigorous way.

6 bit analysis

Good idea to include a more complex task than 1 bit, adds strength to the arguments.

Adding a noise task was also interesting, and it would have been interesting to see a comparison with and without the noise task to see what effect it had - e.g., did it have a regularising effect, did it slow down or speed up convergence, how did it affect attention patterns, etc.

General comments

Plots could have been labelled more clearly - some took a while to interpret.

As far as I can tell there was no accuracy or loss plots for the 6-bit task - this would have been cool to see and contrast with the 1-bit task - e.g., the 1 bit task took a large number of epochs, why was this? Because there were so few possible examples? How did the overall number of training examples seen before convergence compare between the two experiments?

Overall

Well done! This is a cool project that took a simple algorithmic task and used existing interpretability techniques to generate some interesting hypotheses and observations.

To improve, all lines of argument could have been expanded upon and there could have been additional evidence provided in the form of further experiments that go beyond initial correlations and use ablations to provide stronger evidence. In general it was hard to draw any concrete conclusions from the results and arguments provided, but it was definitely cool preliminary work, and given the very short time frame it is of course very understandable why the number of experiments was limited!

Submitted

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

Like Reply

simon-marshall2 years ago

Cool project. For the next one figure captions would really help readability (or "interpretability"!!) for me! But still with 6 hours and 0 experience big kudos.