The Interpretability Hackathon

Hosted by Esben Kran, Apart Research, Neel Nanda · #alignmentjam

217

Ratings

Overview Submissions Results

Screenshots Submission feed

Results

25 entries were submitted between 2022-11-11 17:00:00 and 2022-11-13 13:15:00. 217 ratings were given to 24 entries (96.0%) between 2022-11-13 13:15:00 and 2022-11-13 16:30:00. The average number of ratings per game was 8.7 and the median was .

By criteriaJudge's choice ML Safety Interpretability Novelty Generality Reproducibility

Backup Transformer Heads are Robust to Ablation Distribution

by satojk

Ranked 1st in Reproducibility with 9 ratings (Score: 4.556)

View submission page

Criteria	Rank	Score*	Raw Score
Interpretability	#1	4.222	4.222
Reproducibility	#1	4.556	4.556
Judge's choice	#2	n/a	n/a
Generality	#10	2.778	2.778
ML Safety	#10	2.778	2.778
Novelty	#11	2.889	2.889

War is 15% conflic, 15% DragonMagazine

by Giles

Ranked 2nd in Reproducibility with 11 ratings (Score: 4.364)

View submission page

Criteria	Rank	Score*	Raw Score
Reproducibility	#2	4.364	4.364
Interpretability	#4	3.636	3.636
ML Safety	#6	3.000	3.000
Novelty	#8	3.273	3.273
Generality	#14	2.545	2.545

Optimising image patches to change RL-agent behaviour

by robertsc

Ranked 3rd in Reproducibility with 10 ratings (Score: 4.300)

View submission page

Criteria	Rank	Score*	Raw Score
Reproducibility	#3	4.300	4.300
ML Safety	#9	2.900	2.900
Generality	#16	2.500	2.500
Novelty	#16	2.700	2.700
Interpretability	#20	2.200	2.200

Neurons and Attention Heads that Look for Sentence Structure in GPT2

by harvey.mannering

Ranked 4th in Reproducibility with 8 ratings (Score: 4.243)

View submission page

Criteria	Rank	Score*	Raw Score
Reproducibility	#4	4.243	4.500
Generality	#9	2.828	3.000
Interpretability	#11	3.182	3.375
Novelty	#14	2.711	2.875
ML Safety	#16	2.239	2.375

Probing Conceptual Knowledge on Solved Games

by mentaleap

Ranked 5th in Reproducibility with 14 ratings (Score: 4.214)

View submission page

Criteria	Rank	Score*	Raw Score
ML Safety	#3	3.214	3.214
Judge's choice	#4	n/a	n/a
Reproducibility	#5	4.214	4.214
Novelty	#13	2.857	2.857
Interpretability	#14	2.929	2.929
Generality	#20	2.286	2.286

Model editing hazards at the example of ROME

by jas-ho, JuliaPersson, goodheart_points

Ranked 6th in Reproducibility with 8 ratings (Score: 4.125)

View submission page

Criteria	Rank	Score*	Raw Score
Judge's choice	#3	n/a	n/a
Generality	#5	3.064	3.250
Reproducibility	#6	4.125	4.375
Novelty	#6	3.300	3.500
Interpretability	#6	3.536	3.750
ML Safety	#8	2.946	3.125

Regularly Oversimplifying Neural Networks

by Botahamec, Nicholas Kross

Ranked 7th in Reproducibility with 11 ratings (Score: 4.091)

View submission page

Criteria	Rank	Score*	Raw Score
Novelty	#5	3.364	3.364
Reproducibility	#7	4.091	4.091
Generality	#14	2.545	2.545
ML Safety	#15	2.273	2.273
Interpretability	#16	2.727	2.727

An Intuitive Logic for Understanding Autoregressive Language Models

by gcarenini

Ranked 8th in Reproducibility with 9 ratings (Score: 3.889)

View submission page

Criteria	Rank	Score*	Raw Score
ML Safety	#1	3.778	3.778
Interpretability	#1	4.222	4.222
Generality	#1	3.444	3.444
Novelty	#2	3.778	3.778
Reproducibility	#8	3.889	3.889

Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks

by antoine311200

Ranked 9th in Reproducibility with 11 ratings (Score: 3.818)

View submission page

Criteria	Rank	Score*	Raw Score
Novelty	#4	3.545	3.545
Generality	#6	3.000	3.000
Reproducibility	#9	3.818	3.818
Interpretability	#13	3.000	3.000
ML Safety	#13	2.545	2.545

Top-Down Interpretability Through Eigenspectra

by jhoogland

Ranked 10th in Reproducibility with 9 ratings (Score: 3.778)

View submission page

Criteria	Rank	Score*	Raw Score
Generality	#1	3.444	3.444
ML Safety	#4	3.111	3.111
Novelty	#10	3.111	3.111
Interpretability	#10	3.222	3.222
Reproducibility	#10	3.778	3.778

Interpreting Catastrophic Failure Modes in OpenAI’s Whisper

by Lawrencium103

Ranked 11th in Reproducibility with 9 ratings (Score: 3.667)

View submission page

Criteria	Rank	Score*	Raw Score
Novelty	#1	3.889	3.889
ML Safety	#2	3.222	3.222
Generality	#4	3.222	3.222
Interpretability	#5	3.556	3.556
Reproducibility	#11	3.667	3.667

An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar

by cmathw

Ranked 12th in Reproducibility with 8 ratings (Score: 3.653)

View submission page

Criteria	Rank	Score*	Raw Score
Interpretability	#8	3.300	3.500
Reproducibility	#12	3.653	3.875
Generality	#13	2.593	2.750
ML Safety	#14	2.475	2.625
Novelty	#17	2.593	2.750

Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases

by Cudon

Ranked 13th in Reproducibility with 8 ratings (Score: 3.536)

View submission page

Criteria	Rank	Score*	Raw Score
Reproducibility	#13	3.536	3.750
ML Safety	#20	2.003	2.125
Generality	#21	2.239	2.375
Interpretability	#22	2.121	2.250
Novelty	#23	1.768	1.875

How to find the minimum of a list - Transformer Edition

by ojorgensen, J Miller, StefanHex, dj251298, ItsUrBoyAA

Ranked 13th in Reproducibility with 8 ratings (Score: 3.536)

View submission page

Criteria	Rank	Score*	Raw Score
Interpretability	#11	3.182	3.375
Reproducibility	#13	3.536	3.750
Novelty	#14	2.711	2.875
Generality	#17	2.475	2.625
ML Safety	#22	1.650	1.750

Mechanisms of Causal Reasoning

by Jacy Reese Anthis

Ranked 13th in Reproducibility with 8 ratings (Score: 3.536)

View submission page

Criteria	Rank	Score*	Raw Score
ML Safety	#11	2.593	2.750
Generality	#11	2.711	2.875
Reproducibility	#13	3.536	3.750
Interpretability	#17	2.593	2.750
Novelty	#22	2.239	2.375

Observing and Validating Induction heads in SOLU-8l-old

by poppingtonic

Ranked 16th in Reproducibility with 7 ratings (Score: 3.402)

View submission page

Criteria	Rank	Score*	Raw Score
Reproducibility	#16	3.402	3.857
Interpretability	#21	2.142	2.429
Generality	#23	2.016	2.286
ML Safety	#23	1.638	1.857
Novelty	#24	1.638	1.857

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

by alexfoote

Ranked 17th in Reproducibility with 9 ratings (Score: 3.222)

View submission page

Criteria	Rank	Score*	Raw Score
Judge's choice	#1	n/a	n/a
Generality	#1	3.444	3.444
Interpretability	#3	3.778	3.778
ML Safety	#4	3.111	3.111
Novelty	#9	3.222	3.222
Reproducibility	#17	3.222	3.222

Finding unusual neuron sets by activation vector distance

by Gurkenglasius

Ranked 17th in Reproducibility with 9 ratings (Score: 3.222)

View submission page

Criteria	Rank	Score*	Raw Score
Reproducibility	#17	3.222	3.222
Generality	#18	2.444	2.444
Novelty	#18	2.556	2.556
ML Safety	#21	1.889	1.889
Interpretability	#23	2.000	2.000

Interpretability at a glance

by carlhenrikrolf, koriavinash1, HH10

Ranked 19th in Reproducibility with 8 ratings (Score: 3.182)

View submission page

Criteria	Rank	Score*	Raw Score
Novelty	#6	3.300	3.500
Generality	#7	2.946	3.125
Interpretability	#8	3.300	3.500
ML Safety	#11	2.593	2.750
Reproducibility	#19	3.182	3.375

Interpretability Hackathon: Sparsity Lens

by astOwOlfo

Ranked 20th in Reproducibility with 7 ratings (Score: 3.024)

View submission page

Criteria	Rank	Score*	Raw Score
Generality	#8	2.898	3.286
Interpretability	#15	2.772	3.143
ML Safety	#18	2.142	2.429
Novelty	#19	2.520	2.857
Reproducibility	#20	3.024	3.429