alexfoote

Posts

Followers

A member registered Nov 11, 2022 · View creator page →

Creator of

Investigating Training Dynamics via Token Loss Trajectories

alexfoote

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

alexfoote

Recent community posts

How to find the minimum of a list - Transformer Edition jam comments · Posted in How to find the minimum of a list - Transformer Edition jam comments

alexfoote2 years ago

Good documentation of experimental set up.

The proposed algorithm seems really interesting, but the argumentation is hard to follow at points. For example, I don’t understand how the model goes from attending most strongly to the smallest numbers, to outputting them in the correct order - it makes sense that the model could do this, but it feels like either there is a missing step in the argument explaining how this actually occurs, or I am not understanding the explanation (very possible)! Could it be via the MLP layer? Would be interesting to test if a 1L attention-only transformer can also perform this task. What about a 2L attention-only transformer?

Line plots 1 and 2 were very interesting, and showed pretty convincing evidence that the model was sorting the numbers by attending to the smaller numbers most strongly. I wonder if they could have been presented in a clearer way - perhaps sorting the x-axis would have helped, as we would then hope to see a monotonically decreasing line.

I see there are plots of the train and test loss in the notebook. These would have been great to include, as well as accuracy metrics, to confirm the model can do the task! The loss histories seem to have a very interesting shape with an initial sharp decrease, then a flattening, then another sharp decrease (and maybe another small sharp decrease later on). Would be interesting to investigate why this was the case and what occurred at these points by investigating model checkpoints at these points. Perhaps there is an interesting phase change in the model here!

Also good to see null results reported in the final section on grokking, and you raise the interesting question of what conditions are necessary for grokking to occur.

Overall well done, this was a cool project! The algorithm you hypothesise makes sense and is very interesting, and there is some convincing evidence for parts of it. It felt like the argumentation and experiments could have been expanded to provide more clarity on other parts of the algorithm. I also think there are some really interesting follow up questions!

Algorithmic bit-wise boolean task on a transformer jam comments · Posted in Algorithmic bit-wise boolean task on a transformer jam comments

alexfoote2 years ago

1 bit analysis

Interesting observations about token behaviour in the 1 bit task through attention and MLP visualisations. Unexpected results in that there were distinct behaviours, with the bos and = tokens being treated similarly to each other and differently to the rest. The explanation for why this could be is interesting but its generally unclear whether it’s correct as there isn’t enough evidence. Of course this is very understandable given the time constraints.

Additional experiments, such as ablating (setting to zero) specific attention heads or MLP neurons would have been an interesting way to test certain hypotheses, e.g., that the bos and = tokens were potentially redundant. Similarly, training on the same task with these tokens removed would have also helped validate if this was indeed the case. This could help you go from observing initial correlations to testing your hypotheses in a more rigorous way.

6 bit analysis

Good idea to include a more complex task than 1 bit, adds strength to the arguments.

Adding a noise task was also interesting, and it would have been interesting to see a comparison with and without the noise task to see what effect it had - e.g., did it have a regularising effect, did it slow down or speed up convergence, how did it affect attention patterns, etc.

General comments

Plots could have been labelled more clearly - some took a while to interpret.

As far as I can tell there was no accuracy or loss plots for the 6-bit task - this would have been cool to see and contrast with the 1-bit task - e.g., the 1 bit task took a large number of epochs, why was this? Because there were so few possible examples? How did the overall number of training examples seen before convergence compare between the two experiments?

Overall

Well done! This is a cool project that took a simple algorithmic task and used existing interpretability techniques to generate some interesting hypotheses and observations.

To improve, all lines of argument could have been expanded upon and there could have been additional evidence provided in the form of further experiments that go beyond initial correlations and use ablations to provide stronger evidence. In general it was hard to draw any concrete conclusions from the results and arguments provided, but it was definitely cool preliminary work, and given the very short time frame it is of course very understandable why the number of experiments was limited!

itch.io

alexfoote

Creator of

Recent community posts