I don't remember seeing any posts specifically about RISC-06, but I may just have missed it.
I do remember us chatting about the possibility of other CPUs though, and seem to remember some chat about RISCy ones as part of that.
I think I mentioned that I had designed a RISCy ISA that encompasses the functionality of the original TC-06 instruction set - but I don't think I posted it. It's not perfect by any means, but just in case it might give you any useful ideas (seeing as you're apparently building something like that), here it is (well, I ended up improving it a bit before posting it, but it's more or less what I had):
R0 = 0 : writes are ignored, reads always return 0 while this could be just a convention, we're enforcing it to be sure (IRL it's (in part) to avoid spending chip area on the memory for it) R1 = PC : program counter, aka instruction pointer (IP) the address of the next instruction to be executed incremented before executing the current instruction (easier that way) R2-R15 : general purpose, for use by programs
0000: NOP zero28 0001: HLT reg4 imm24 // HLT src ticks // wait (src + ticks2) cycles // 0 means forever 0010: LOAD reg4 reg4 imm20 // LOAD dst addr ofs // dst = mem[addr + ofs] 0011: STORE reg4 reg4 imm20 // STORE src addr ofs // mem[addr + ofs] = src 0100: JMPEQ reg4 reg4 reg4 imm16 // JMPEQ src1 src2 addr ofs // if src1 = src2 // then R1 = addr + ofs 0101: JMPGT reg4 reg4 reg4 imm16 // JMPGT src1 src2 addr ofs // if src1 > src2 // then R1 = addr + ofs 0110: ADD reg4 reg4 reg4 imm16 // ADD dst src1 src2 val // dst = src1 + (src2 + val) 0111: SUB reg4 reg4 reg4 imm16 // SUB dst src1 src2 val // dst = src1 - (src2 + val) 1000: MUL reg4 reg4 reg4 imm16 // MUL dst src1 src2 val // dst = src1 * (src2 + val) 1001: DIV reg4 reg4 reg4 imm16 // DIV dst src1 src2 val // dst = src1 / (src2 + val) 1010: REM reg4 reg4 reg4 imm16 // REM dst src1 src2 val // dst = src1 % (src2 + val) // This is remainder, not // modulo, in both C# and JS 1011: RNG reg4 reg4 reg4 imm16 // RNG dst min max val // dst = random(min, max + val) 1100: PMOV reg4 reg4 imm5 imm5 imm5 imm5 // PMOV dst src destB // fromB endB rotB 1101: PSET reg4 imm5 imm4 imm1 imm14 // PSET dst dstBit numBits // clearRest valueBits 1110: DLOAD reg4 reg4 reg4 imm8 imm8 // DLOAD dst dev addr ofsD ofsA // dst = device[dev + ofsD] // .value[addr + ofsA] 1111: DSTORE reg4 reg4 reg4 imm8 imm8 // DSTORE src dev addr ofsD ofsA // device[dev + ofsD] // .value[addr + ofsA] = src
Note that "ticks", "val", "ofs", "ofsD" and "ofsA" can be negative (using two's complement over their field's width).
(Also note that I haven't implemented this ISA anywhere, just designed it - no code currently exists for it AFAIK. I might or might not actually implement it, or something like it, eventually.)
JMPEQ can also acts as an unconditional JMP, by using the same register as both sources. In addition, you can do an unconditional jump simply by setting R1 using any of the instructions that modify a register value. Jumps can be relative by using R1 as part of the target address calculation, or absolute by using something else (e.g. R0).
JMPGT can also perform less-than, simply by swapping the registers.
DLOAD/DSTORE is mostly equivalent to GETDATA/SETDATA, although only the extended form is supported - but a device can ignore any part of the write it wants to, or even combine the fields. Worth noting is that this CPU in theory supports 2^32 devices, though only the first and last 128 are accessible without setting a register. In practice, there's probably far fewer devices actually available.
I originally made MUL have two destination registers, one for the high bits of the result and one for the low bits, since a multiplication can end up needing that many - but I ended up deciding not to do that, since that makes it harder to implement, and I'm not doing anything like that for the other operations that can overflow. (In JS I can safely do up to 53-bit integers IIRC, but this would need 64 bits.)
This PMOV is similar in function to TC-06's PMOV, but has a different API and an additional feature: rotation of the bits being acted upon. Basically, take the bits fromB to endB from src, rotate those bits by rotB, and then insert them into dst starting at bit destB. (The shift-right-or-left argument isn't necessary since shift-left-31 and shift-right-1 is the same thing due to the wrapping.)
PSET is similar to TC-06's SET, but can set more or fewer than 8 bits at a time, at a bit position that is not a multiple of 8, and can optionally clear the rest of the bits (set them to 0).
I'm not really sure what to do with numBits = 0 and numBits = 15. I've been considering making one of them have the instruction ignore the clearRest and valueBits fields, and instead act as a LOAD of the next instruction word, followed by skipping that word instead of executing it. That could be very useful for keeping data close to where it's used without needing an explicit jump, but it kind of breaks the principle of least surprise.
I suppose numBits = 15 could make it copy the clearRest bit as if it was a part of the valueBits, or pretend the valueBits had a 15th bit that is always zero. Not sure which makes more sense.
This ISA still lacks CALL/RET instructions, PUSH/POP, binary operations (AND etc.), and probably other things - but it does everything the TC-06 does and more (sometimes in more instructions, other times in less), in 15 instructions without subcodes. One of the nice things about it is that you generally don't have to keep small constants in registers (no more R15 = 1) since you can usually use the immediate values for that. It also makes it easy to use relative addressing (whether for load/store or jumps) since you always have the current instruction's address easily available.
Yeah, the emulator has no real functionality in Default mode, which is why I said technically - in practice, I agree that it really doesn't fit, since the data that ends up out of bounds is required. And that's even assuming Senbir didn't fail when trying to load it onto the disk in the first place, which like you said it probably does.
Re: swap, that's one of the things an MMU is usually used to implement. The MMU gives you the dynamic address remapping, and notifies the kernel when the program attempts to access a virtual memory area that isn't currently in RAM (a page fault). The kernel then decides what to do about it - if it's an area that is swapped out, it loads the data for that area from disk, updates the address remapping accordingly, and returns control to the program. (This might require first swapping something else out to make room in RAM.)
As such, the overlay loader is kind of a poor-man's swap already - not having an MMU, it can't do the address remapping part, nor the automatic loading on out-of-bounds access, but it does do the swapping part. (Well, swapping in anyway - it doesn't swap out the old page to disk first.)
If it was given an appropriate MMU to work with, it could probably be extended into a swap-based OS that pretends to have more memory than it does. Like you said, only the swapping system needs to always be in memory, since it can also be used to load OS code when necessary - but you'd still need to have some RAM left over to put the swapped-in memory pages into, since the memory remapping still only allows access to the main RAM, not to other devices.
Well, unless you combined it with a different concept, namely a specific form of memory-mapped I/O. If you added support for that, and changed the disk device to support memory-mapping areas of the disk (making it pretend that that area of the disk is RAM), then you could avoid the need for the save/load step of regular swapping. But that's a separate feature that can be used even without an MMU (or any other address remapping), as a device typically has a fixed memory address range that it can use for such I/O, and programs could be written to use that range directly.
(... Heh. Bootloader that doesn't write to RAM: memory-map the appropriate disk area, and jump into it.)
Regarding the "secret higher-resolution" thing, I didn't mean that it could go outside the 32 bits using extended SETDATA. What I meant was, the real hardware has a specific resolution on its monitor, while the emulated monitor (what the emulated program would see) would have a smaller resolution, so the higher actual resolution is hidden (a secret) from the emulated program.
(This is only for the hypothetical extended emulator, of course, none of this applies to the current version, since it doesn't use a smaller virtual screen like that.)
Now, a program running on the real hardware always has 32 bits per pixel of monitor, of which some are reserved for the position and some others determine the color, while the rest are ignored by the monitor but can be used to store data (as is suggested in the documentation).
As an example, let's say that the real hardware has a resolution of 32x16 with 4 colors (2 color bits). That means its pixel data looks like CCXXXXXYYYYAAAAAAAAAAAAAAAAAAAAA where each A bit is available to store any data without affecting the colors shown on the monitor.
Now, if the emulator provides a virtual monitor of 16x8 with 4 colors to the emulated program, so that it can show 4 at the same time on the real monitor, then the emulated program's pixel data looks like CCXXXXYYYAAAAAAAAAAAAAAAAAAAAAAA, which has one less each of X and Y, and two more of A.
The emulator then has to modify each monitor getdata and setdata, to map the emulated data to the real data.
For setdata, transforming emulated to real: insert r and b: prog: CCXXXXYYYAAAAAAAAAAAAAAAAAAAAAAA emul: CCrXXXXbYYYAAAAAAAAAAAAAAAAAAAAAAA real: CCXXXXXYYYYAAAAAAAAAAAAAAAAAAAAAee
For getdata, transforming real to emulated: remove r and b: real: CCrXXXXbYYYAAAAAAAAAAAAAAAAAAAAA emul: CCXXXXYYYAAAAAAAAAAAAAAAAAAAAA prog: CCXXXXYYYAAAAAAAAAAAAAAAAAAAAAmm
For both, r and b define which quadrant this virtual monitor is shown in.
Now, notice that the remapped pixel data for the emulated program has 2 bits too many for setdata (marked as e for extra), and 2 bits too few for getdata (marked as m for missing). Those bits must go somewhere and come from somewhere.
So, the emulator now has an either-or choice:
- discard those bits on setdata, and set them to something arbitrary on getdata
- save those bits somewhere other than in the real monitor on setdata, and restore them from there on getdata
If it discards them, the emulated program might break, since it might be using the monitor to store important data (since it's documented that this works). This breakage may be rare, but it could happen, and would then be a bug in the emulator.
If it stores and restores them, however, then it would work correctly (without that bug), but it would have to use more memory (probably on disk), and the setdata/getdata operations would be slower since they have to maintain that additional memory area when working with the monitor. Which might make users complain about performance and memory use, especially if their programs don't use the monitor in that way anyway.
Damned if they do, damned if they don't...