Congratulations! Winner of the second prize!!
General: Includes an introduction that sets the stage quite nicely. Red teaming just works. Standardized experimental conditions.
Fazl: Very nice - I'd love to see how this develops further.
Alignment: Crazy outputs and really good descriptions of how these outputs happen. Shows general tendencies that inform when and how LLMs become dangerous. Also makes “the AI dangerous” as a good example. A generalized analysis of many AI safety-critical capability holes. Shows holes in the openai API flagging system.
AI Psychology: Good use of empirical psychology to encode a bunch of properties of responses.
Novelty: The principle is described here as well: https://arxiv.org/abs/2209.07858. But it’s manual coding on a bunch of interesting factors on top of that so still very novel.
Generality: It covers a lot of different prompt types, though we cannot confirm systematically which specific types are generalizable. The qualitative descriptions are very good. Replicability: The data is just directly available, very nice. Not super quantitatively developed but it seems like a replicable principle to base questions off of.
Leave a comment
Log in with itch.io to leave a comment.