Thanks for your suggestions.
Why "that many registers"? I only have two. They are in the SWAPR4. (Btw, all the 4s refer to word size.)
I do not use a main memory, but two stacks (+1 for consts) and one code memory (in CMD4) instead. If you only have 4 bit as the instruction word, you don't want to waste a lot on immediates for offsets, nor with only 16 instructions on two word instructions. I don't understand what you wanted to imply regarding caches, surely in this simulator and on this level of CPU memory access latency plays not a role (compare with CPU in "the early days" like the 6502).
Yes, it's not a "classic" architecture, but I thing in might be Turing complete (assuming the stacks and program RAM would be infinite ;) ). I went for this arch because I wanted to test the individual chips well before putting them together (editing is not possible atm).
Merry Christmas! :)