Regarding the RAM being useless: yeah, for a while now I've actually been considering it a place that is primarily only for code, and not very useful for data. You may have noticed that I've been using the disk even for temporary runtime data (e.g. in Snake) - that wasn't _just_ because of the size...
(A disclaimer: I haven't actually done any real work on/with compilers, so much of this post is really just guesswork and assumptions - since you've apparently been researching it, you might already know better.)
I noticed an implicit assumption in your function implementations: that the arguments are always in RAM, never in registers.
However, I think one of the things modern compilers usually spend some effort on is to keep as much as possible in registers (and keeping track of what is in which register), to avoid having to access RAM (which is usually much slower than the registers).
Obviously, this is limited by the registers usually being far less numerous than available RAM, and sometimes other limitations, but IIUC esp. things like local variables are often kept purely in registers.
This suggests that it may be better to think of x=y as a sequence of several operations:
- loading the address of y (if necessary)
- loading the value of y from that location (if necessary)
- setting x to y (this happens in the registers)
- loading the address of x (if necessary)
- storing the value of x (if necessary) into that location.
so that the compiler can skip the ones that are not necessary in that particular location (e.g. "x=y;z=y;" doesn't need to re-load y).
This would also simplify the implementation of e.g. x=y+z, since it would simply do the same first two steps, then again for z, in exactly the same way (no real difference in implementation, just the register assignments and pointer locations), and just the middle step is different.
I'll note that some of this probably depends on the ISA - e.g. modern x86_64 I think has a lot of instructions that work on data in RAM without needing separate load/store operations, and I think its calling conventions generally use RAM in form of a stack, which it has instructions to manipulate directly. I believe those are still slower than the register equivalents though. (Oh, and, those calling conventions don't really apply to the functions we're talking about here, since these are at a rather lower level.)
As such, it may be more useful to look at the operations the compiler expects to be given for implementing a RISC-style architecture that doesn't have those things, since the TC-06 doesn't either (except for a few special cases). I suspect that it may look more like what I suggested above, with separate operations for load/store and the actual action (set/add/etc).
Of course, I could be wrong - for all I know the compiler might be optimizing by parsing the resulting code and deduplicating the load/stores somehow - but I think that would require rather more deep information about what each of the instructions/operations do, which the compiler wouldn't be given simply by implementing those functions you mentioned.
Regarding MOVR, I'll note that it makes it possible to write a bootloader that doesn't use self-modifying code (and is smaller and faster):
SET 15 3 1
GETDATA 1 3 0
MATH 1 3 5
JMP 3 6
NILLIST 22
GETDATA 1 3 3
MATH 15 3 1
MOVO 1 0
MOVR 0 0 3
IFJMP 2 4 3
JMP 1 0
Other than that meta-point, though, I don't think it allows you to do anything technically new, unless the only reason you couldn't do it before was lack of space or that it took too much time to execute.
So, if you consider the need for self-modifying code (or the space/time restrictions) as central to Senbir, you may want to avoid adding it.
On the other hand, if you don't, there are other variants that would be far more useful - e.g. MOVI/MOVO variants with the memory address in a register instead of an immediate (or even better, both at the same time) - so you might still not want to add it, depending on just how easy you want these things to be.
(I think the decision about should is more up to you than anyone else, since it's your game. Especially the decision about how far to go, since it doesn't have to be all or nothing.)
In some ways, OFST is a similar change, in that it lets you do much the same thing as before but far easier, with much less code modification (you could have achieved the same thing by modifying the MOVI/MOVO instructions it affects) - with the exception of the meta-point that it allows you to run self-modifying code that wasn't designed for allowing such offsets (except in its JMPs).
I'll also note that I'm not aware of any other processor with an instruction quite like OFST, and I think you might here be starting to run into (one of) the reason(s) for this: it is global state that significantly affects other instructions, in such a way that you need to be aware of it and turn it off/on often - and depending on what you're doing, even that may not be enough.
Consider, for example, if OFST is used by the OS to set up an area for the program to work within, and the compiler thus generates variable addresses within that area (instead of truly global ones which might be what it's expecting), and then uses OFST for these internal functions like you've done here... Then the current implementation of turning it off wouldn't be enough, since it actually needs to use the previous offset instead to end up at the correct address. And if the program itself also uses OFST internally for other things, it gets even more complicated and/or necessary to carefully track the stack of offsets.
(I'll note that modern systems do tend to have something kind of similar, but a bit more flexible yet in some ways simpler - namely an MMU. Which is a device that sits between the processor and the RAM, remapping all the addresses, so that all the programs the OS loads can use the same addresses without conflicting with either each other or the OS.)
Regarding making it a separate ISA/processor you can select in a custom mode, I think that's a pretty clean solution, especially if you want to maintain the basic nature of the original challenge in the default modes while also allowing more complex things (like an OS) to be more easily built in a custom mode. Allowing choice here would also let you push at least some of this decision to the player, so that the should is up to them and what they want.
(In some ways you've already done that, except only allowing users to switch processor by starting a different version of Senbir itself, since newer versions of Senbir have a CPU with extra features...)
To be honest, I've actually been intending to do pretty much exactly that (custom ISAs/CPUs) in my own simulator (once I've gotten the default mode simulation working right - it currently has some significant bugs, like not using signed numbers, only unsigned, that I want to fix first).
A while back I even designed a different ISA that I think allows all the same things the TC-06 does (except UTL since that didn't exist yet), but without using sub-codes, and that also is much more flexible/easier to use (and maybe to implement, though I haven't tried to actually do that yet). There's probably some details I missed or could improve, but IIRC it seemed pretty nice. It was quite different in several ways, though.
By the way, out of curiosity, does the TC in TC-06 stand for anything?