Great project! Thank you for the feedback as well <3
Esben Kran
Creator of
Recent community posts
Fascinating, I've not seen Conmy's automatic circuit discovery tool before https://arthurconmy.github.io/automatic_circuit_discovery/
And you can imagine I searched around quite a bit for exactly that!
Congratulations on the First Prize!!
General: Great setup, good introduction, good splitting into sections that clearly represent your project.
Fazl: Very good depth of prior work. The definition of truthfulness could benefit from TruthfulQA.
Alignment: Is about truthfulness, a quite important factor in alignment. Shows clear ways that pre-prompting for helpfulness (a desired trait) makes the model less truthful.
AI psychology: A multi-factor experimental design that very well represents how we might have to look at AI models in the future. Very interesting results as well. Represents naturalistic chatbot interactions (e.g. Alexa) very well.
Novelty: This has not been seen before, as far as the judges know.
Generality: Seems like a generalizable effect that have quite a few percentage point impacts on model truthfulness. Replicability: Has a Github without documentation but seems simple to run based on the experimental design. Parameters chosen are clear.
General: Nice replication of the original paper. Shows great prompting effects. Runs interesting experiments that are well-defined. Nicely expands on the original paper.
Fazl : I like the notebook and it allows us to re-run the experiment.
It would be nice to have an intro at the beginning.
Alignment: Not too much.
AI Psychology: Shows interesting mathematical prompting modulations.
Novelty: These experiments seem like they would be done before since they’re about pre-prompting for multiple steps. The specific pre-prompting seems quite alright.
Generality: Not necessarily super general beyond the specific dataset but the principle of pre-prompt engineering is well represented as a general effector on the output. Replicability: The report is a literal ipynb which is nice. We also expect to replicate it because it replicates from another paper and see good results.
Winner of the third prize! Congratulations!!
General: Really nice experiment that represents and experiments the concept well. Missing some sort of graph, though the results were there as a table (sort of a graph). The data goes deep as well and invites a lot of further investigation!
Alignment: Represents a basic language model error very in-depth.
AI Psychology: Replicates an interesting symbolic re-definition principle. Has an interesting method for evaluating
Novelty: Not seen before in this format. Syllogisms have been seen before and the symbolic redefinition problem has been seen before. But describing these in a new combination, i.e. with nonsense vs. original syllogisms.
Generality: Many datasets tested with different question formats and prompts. Very nice. Can imagine the further generality from the results. The syllogisms were mostly in one form, though. Replicability: Code, data, and results laid out in very neat form.
Congratulations! Winner of the second prize!!
General: Includes an introduction that sets the stage quite nicely. Red teaming just works. Standardized experimental conditions.
Fazl: Very nice - I'd love to see how this develops further.
Alignment: Crazy outputs and really good descriptions of how these outputs happen. Shows general tendencies that inform when and how LLMs become dangerous. Also makes “the AI dangerous” as a good example. A generalized analysis of many AI safety-critical capability holes. Shows holes in the openai API flagging system.
AI Psychology: Good use of empirical psychology to encode a bunch of properties of responses.
Novelty: The principle is described here as well: https://arxiv.org/abs/2209.07858. But it’s manual coding on a bunch of interesting factors on top of that so still very novel.
Generality: It covers a lot of different prompt types, though we cannot confirm systematically which specific types are generalizable. The qualitative descriptions are very good. Replicability: The data is just directly available, very nice. Not super quantitatively developed but it seems like a replicable principle to base questions off of.
General: Very simple design with clear outputs. Like the 2x2x2 factorial design. Clearly explained graphs.
Alignment: Tells us that the framing of questions has a large effect on the model’s opinion. Often, leading questions are part of our normal interaction and GPT-3 is clearly biased.
AI Psychology: Showcases a clear cognitive psychology experiment that is newly implemented in GPT-3. A very nice application of the theme of the jam.
Novelty: I have not seen this specific experiment before, though I bet it does not necessarily surprise anyone too much that this is the case.
Generality: The dataset seems to pretty well represent the different cases by way of verb-noun combinations. Reproducibility: Very clear instructions on the Github as to how to replicate the experiment! I expect it to replicate given the generality.
General: Based off of another paper, very nice. Interesting and novel application.
Alignment: Probably nothing major.
AI Psychology: Interesting to relate the human and AI answers to each other. Very AI Psychology-like project.
Novelty: Really nice approach to represent the alien-AI correlation.
Generality: It can probably answer many similar questions and it is an approach that can be used generally. Reproducibility: We can reproduce it but the experiments are not described.
Winner of the fourth prize!! Congratulations!
General: Based off of another project, very neat. Proposes a clean solution to a pretty serious problem. I like the next steps.
Fazl: Worth running the same prompts on different datasets from the inverse scaling challenge.
Alignment: Creates an easy solution to a clearly defined problem and might generalize well beyond this. Does not “solve” cognition for the AI but increases its alignment drastically. Prompt engineers trained by model, since there’s big shifts based on the prompt.
AI Psychology: “Let’s think step by step” works in larger models. Maybe it is a general solution for things. Maybe it is a general alignment solution to instigate system 2 thinking. Escapes biasing prompts. Very limited actual understanding. Diverges from prompt game.
Novelty: Have not seen this simple prompt before.
Generality: Yes, accepted by the Inverse Scaling Prize team as well.
Reproducibility: A code base but needs manual annotation afterwards because of code limitations. 4 extra things: Rick-rolling YouTube links, ASCII art bias, only larger models can explain jokes, moral uncertainty is person-dependent. Awesome stuff!
The winners are!!
- Agreeableness vs. truthfulness - Team Optimize Prime
- AI - My Partner in Crime - Team Partner in Crime
- All Trees are Fish - Lucas Sato
- "Let's Think Step by Step" reduces hindsight bias - Team VVVV
You can input your name into the certificates and post on itch.io, on Github, or on LinkedIn and every social media out there: https://docs.google.com/presentation/d/1RhV_VXTbHdlikhySF9sWuYSleolAztE_YX2dzJlF...
(we of course expect you to only put your name to the correct certificate but that goes without saying ;))
This is some funky stories that were AI generated if you'd like some inspiration for weird prompts as well: https://docs.google.com/document/d/1JDRiTy9MyJQWXJW9wm6-2gOha9a0dPZVNk2lTLe3OVM/...
Check out related papers to the Red Teaming paper: https://www.connectedpapers.com/main/592c55198a72862f81e3d26a8ead8fefa9f43d15/Re..
You can also use Outh's Elicit engine for finding some interesting papers!
https://elicit.org/search?q=How+do+language+models+process+language+differently+...
Here in the physical lair, it's now 9PM and we have two teams working together - one of 3 people and one of 5 people. They've been mostly experimenting with the Playground and getting a feel for the GPT-3 models -- a very good strategy! Go on a very weird date with your model to get to know it better ;))
Greetings, all you wonderful AI safety hackers
We’re kicking off the hackathon in ~3 hours so here is the information you need to join!
Everyone working online will join the GatherTown room. The space is already open and you’re more than welcome to join and socialize with the other participants an hour before the event starts (5PM CET / 8AM PST).
We’ll start at 6PM CET with an hour for introduction to the event, a talk by Ian McKenzie on the Inverse Scaling Prize, and group forming. You’re welcome to check out the resource docs before arriving.
We expect to be around 30-35 people in total and we look forward to seeing you!
Introduction slides: Language Model Hackathon