Loved the research question!! try have a look on TCAV and our results from the previous hackathon (where we looked for concepts in connect-four RL agent).
Play research
In search of linguistic concepts: investigating BERT's context vectors's itch.io pageResults
Criteria | Rank | Score* | Raw Score |
ML Safety | #5 | 3.333 | 3.333 |
Generality | #5 | 3.000 | 3.000 |
Novelty | #11 | 2.667 | 2.667 |
Mechanistic interpretability | #11 | 3.333 | 3.333 |
Reproducibility | #14 | 2.000 | 2.000 |
Ranked from 3 ratings. Score is adjusted from raw score by the median number of ratings per game in the jam.
Judge feedback
Judge feedback is anonymous and shown in a random order.
- This work is nicely done as a traditional machine learning task. However, using BERT visualization on the fine-tuned models may not be very useful. It would be beneficial to include more interpretability methods to support the conclusions and investigate fine-tuning, as this area is still under-studied.
- There's actually been a fair amount of prior work on this kind of thing! Two relevant papers: https://arxiv.org/abs/1906.02715 https://arxiv.org/abs/1905.05950 More generally there's a whole subfield called BERTology on these kinds of questions: https://arxiv.org/abs/2002.12327 I think your motivation section is mostly false - as far I know, there's been very little interp work on vision transformers, and attention patterns are, if anything, easier to interpret for language models than image. There's been a fair of interp work on classic image models like ConvNets and ResNets. But generally we don't interpret image models by "averaging over" inputs, other techniques like feature visualization are used: https://distill.pub/2017/feature-visualization/ The actual method used here was fairly legit, and is analogous to what's known in the literature as probing, here's a review: https://arxiv.org/pdf/2102.12452.pdf It's generally easier to classify eg "anger vs not anger" than a 7 variable categorical problem like this, though you need to eg have the same number of anger and non anger data points (or scale the loss for the anger ones to get comparable gradients) I'm pretty surprised that a two layer BERT model could do such good fake news classification! Honestly this makes me suspect that the dataset is badly made or too easy. It wasn't clear to me where the two layer BERT model came from, was it part of BERTVis? I'm impressed that you managed to fine-tune a language model in a weekend hackathon! That's a fair amount of effort. - Neel
What are the full names of your participants?
Roksana Goworek, Paul Martin, Jonathan Frennert
What is your team name?
teamEd
What is you and your team's career stage?
UG students
Does anyone from your team want to work towards publishing this work later?
Yes
Where are you participating from?
Edinburgh
Leave a comment
Log in with itch.io to leave a comment.