I wasn't sure whether to make this a separate top-level reply, but I guess it's mostly about the same thing, really, so, sub-reply it is, I suppose. Even though some of it is rather different.
To make a long story short, I made a new variant of the Snake game, which is not based on overlays but on another idea I had for enabling use of more code than you have memory. This variant can be started with the built-in bootloader, instead of needing a custom ROM.
This variant actually runs rather faster than the first one, taking "only" almost 2 seconds per step (though about double that when you eat food or press a key). It also uses a different color scheme that I think I like better than the old one (which I suspect used the background it did primarily to show off how that was drawn... *eyeroll*). I suspect that on your 1kHz machine it might be too fast to easily play.
For this variant I did away with the fancy resetting, instead simply restarting the game (well, not quite completely, but nearly so) - mainly to get enough space for the rest of the code.
I also have a version that shows where the snake's head is during gameplay, but it is slower - it tipped to the other side of the 2 second mark. So I decided not to include that in the final version, instead just blinking the head after a crash to show where you ended up - but if you want to try it, I kept that version of it in a branch. Here's how that looked:
As for the code, how this was done, well, to be honest the idea it's based on sounds a bit preposterous on the face of it, but as you can see, it turned out to work fairly well for this particular use case.
The idea was to not load sections of code into memory and then run them, but instead load and execute the instructions one at a time directly from the disk.
This means that for every instruction of the main program, the runtime actually runs several instructions. (5 in my runtime, so it should take 5 times as long to run a program, right? Well, both yes and no...)
In return, this enables the main code to be written in a more sequential manner, instead of dividing it up into blocks that you switch between, which makes things easier because you don't have to consider the tight memory space constraints all the time while writing the code.
Of course, this works best for code that is primarily sequential instead of having tight loops that need to be fast, but that's exactly what I saw in Snake - I didn't have many loops, or really time-sensitive things, mostly just long sections of code to run through mostly sequentially (with some jumps).
There were a couple exceptions to this, the line drawing inner loop and the snake position update, so I put those into the memory space left over from the main runtime, as separate functions I could just call from the rest of the code.
Getting back to the speed, this runtime really does make most of the instructions take 5 times as long as running them directly. However, what we're really comparing against is the overlay-based variant, and since the main loop there took 3 overlays for a single run through, it was effectively loading each instruction from disk for every time it was run anyway - and the overlay loader needs more instructions for doing that than the one-at-a-time runtime does.
(Of course, this does strongly indicate that having enough RAM to keep all of the code loaded at the same time, with a game variant written accordingly, would still be quite a bit faster - in fact, by a factor of ~5.)
The runtime also imposes some limitations on the code, such as JMP/IFJMP not really working the way they usually do, and R1 not being persistent from one instruction to the next (which affects any use of GETDATA). So the code needs to be written for this runtime, most nontrivial pre-existing code won't work without changes.
While it would be possible to make those things work, to make the runtime more transparent, that would significantly slow it down, as it would need to look at what each instruction is and handle some of them specially, instead of just blindly executing them like it does now.
Here's the source code of the game, and of just the base disk-runner runtime (which is included in the game code). The preprocessor isn't ideal for this kind of self-loading program, but it still made things significantly easier. (And I did end up using the per-line disk address comments for debugging, so good call on that being useful.)
Here's an already-processed and comment-stripped version of the game (and runtime):
DATAC 00000000000000000000000000010111
JMP 1 19
DATAC 00000000000000000000000000000001
MATH 13 14 5
JMP 1 5
JMP 1 19
GETDATA 1 3 14
MATH 15 14 0
MOVO 1 8
NIL
MOVO 1 1
JMP 1 5
GETDATA 1 3 2
MOVO 1 19
MATH 15 2 0
MOVI 1 12
MATH 15 1 0
MOVO 1 12
IFJMP 1 11 1
JMP 1 5
MOVI 15 1
MOVI 14 22
JMP 1 5
DATAC 00000000000000000000000000011000
MATH 2 2 1
MATH 3 3 1
SET 2 3 10011010
SET 3 3 10100101
SETDATA 0 0 1000000000000000000000
JMP 1 11
MATH 12 12 1
SET 12 3 10010101
MATH 2 2 1
MATH 3 3 1
MATH 4 4 1
SETDATA 0 0 1100000000000000000000
MATH 14 13 5
SET 14 3 10001100
SET 14 3 10001100
SET 14 3 10001100
SET 14 3 10001100
SET 14 3 10001100
PMOV 15 10 0 31 6 1
MATH 9 9 1
SET 9 0 11011110
MATH 4 4 1
MATH 6 6 1
MATH 7 7 1
MATH 8 8 1
SET 4 3 10101000
SET 6 3 00000110
SET 7 3 00001110
SET 8 3 01010100
MATH 4 11 5
MATH 4 12 5
MATH 15 5 5
SETDATA 1 3 11 9
SET 13 3 00111100
SETDATA 0 0 0100000000000000000000
SETDATA 0 3 9
MATH 11 2 5
MATH 15 2 0
MATH 8 2 4
MATH 12 3 5
MATH 8 3 4
SET 13 3 01010001
IFJMP 1 2 0
MATH 14 13 5
MATH 7 2 6
MATH 6 3 6
MATH 15 2 0
MATH 15 3 0
PMOV 3 3 0 31 25 0
PMOV 2 3 28 31 28 0
GETDATA 0 3 3
MOVI 2 1
PMOV 3 3 0 31 2 1
IFJMP 1 2 1
PMOV 1 3 0 1 0 0
SETDATA 0 3 3
MATH 15 5 0
MATH 2 2 1
GETDATA 2 0 0000000000000000000000
MOVI 3 1
SET 13 3 01111000
IFJMP 1 2 1
MATH 10 9 0
PMOV 9 0 2 31 2 0
SET 13 3 01100000
MATH 5 3 5
IFJMP 1 2 1
JMP 1 23
MATH 15 12 0
MATH 8 12 4
MATH 4 12 0
SET 14 3 01100011
JMP 1 26
MATH 15 5 1
MATH 3 3 1
MATH 15 11 0
MATH 8 11 4
MATH 4 11 0
SETDATA 1 3 11 9
PMOV 0 3 0 1 2 0
SET 13 3 01010001
IFJMP 1 2 0
SET 2 3 00000010
SET 13 3 00111100
IFJMP 1 2 0
SETDATA 0 0 0000000000000000000000
SETDATA 2 0 0000000000000000000000
SET 2 3 01110010
PMOV 15 0 0 31 1 1
MATH 14 13 5
MATH 0 9 0
SETDATA 0 3 9
GETDATA 2 0 0000000000000000000000
MOVI 3 1
IFJMP 1 2 1
SET 14 3 00011110
MATH 3 0 5
GETDATA 2 0 0000000000000000000000
MOVI 3 1
IFJMP 1 2 1
MATH 0 3 5
SET 13 3 10000111
SET 2 3 01110111
IFJMP 1 2 0
SET 2 3 01100001
IFJMP 1 2 0
SET 2 3 01110011
IFJMP 1 2 0
SET 2 3 01100100
IFJMP 1 2 0
SET 14 3 01010001
SET 2 3 11111100
PMOV 0 2 29 30 1 1
GETDATA 1 3 2
MOVI 10 1
SET 14 3 01010001
GETDATA 1 3 12
MOVI 0 1
MATH 15 12 0
PMOV 0 2 0 8 0 0
PMOV 0 3 9 17 9 0
PMOV 0 4 18 26 18 0
JMP 1 19
MATH 15 13 0
MATH 13 14 5
DATAC 00000100100111011100000000100000
DATAC 01000100010000000000000100000000
DATAC 01111100110000000000000000100000
DATAC 01111011100111111111111100000000
DATAC 01000011001000000011111111100000
SETDATA 0 3 2
MATH 4 2 0
IFJMP 2 2 1
JMP 1 5
GETDATA 1 3 12
PMOV 15 1 0 1 0 0
SETDATA 0 3 1
GETDATA 0 3 0
SETDATA 0 3 9
MATH 1 0 5
JMP 1 5
NILLIST 3
NILLIST 84
DATAC 11111100000000000000000000000000
DATAC 00000000100000000000000000000000
DATAC 00000100000000000000000000000000
DATAC 11111111100000000000000000000000