It's not a small feat? Aw, but I tried to make it as small as I could... ;)
As for efficiency, well, mine might be more space efficient than yours regarding image size (though not in program size), but it wouldn't take much to make yours far more time efficient (as in fast). Simply remove the MOVI/PMOV/MOVO triplet and replace the IFJMP/JMP combo with the opposite IFJMP, and it would be down to a very tight 4 instructions per pixel. (To be honest, I've been wondering if this easy optimization opportunity was left in as an easter egg.) I don't think anything tighter (aka faster) than that is possible with the current instruction set, at least without ending up in an infinite loop (which might also cause a black pixel at 0,0). And as a bonus, yours can do animations. All that, in just 11 instructions (if you also move the DATACs). And unlike mine it doesn't assume the initial values of the registers - if it did, it could be just 8.
Besides, while your image format is larger than mine, I don't think it is anywhere near as bad as the image format I came up with in my youth. This was in the DOS era, on an 80286 I believe (16-bit real mode), and I had discovered the (probably 0xA000) memory segment for the video memory, but didn't really understand how it worked. I had library routines that let me draw to the screen, but I wanted to save the drawn image to disk and then load it again later. The obvious way was to do a simple memory dump/restore. I don't remember the exact details, but I think that the image didn't fit inside the offsets of that first segment, so I had to loop through both the segments and offsets until I had enough data for the entire image. So far so good, but in my ignorance I looped through the full 16-bit range of the offsets, for each of the relevant segments. What I hadn't realized is that the segment size was 16, not 64K. Which means that the segment:offset address A000:0010 refers to the same memory location as A001:0000 does. In other words, there ended up being a lot of redundant information in that image file format, and for no good reason... (These days, I'm not sure whether to be more embarrassed or amused.)
The idea about loadable levels is a good one. It also gave me a kinda similar idea: loadable peripherals (to be attached to SETDATA/GETDATA ports). Using something like embedded Lua, it could be possible for people to write their own peripherals (could even have an in-game editor if you wanted to), that could do nearly anything - expanded memory, a coprocessor (8087 anyone?), a port multiplexer, even some kind of networking (fake or real). Imagine a custom mode where you have two (or more) TC-06 computers, and a homegrown peripheral for networking them together. :D (Oooh.... Or a multi-processor machine. Imagine the chaos (and obfuscated code) you could generate by "self"-modifying the code the other CPUs are running, in realtime... :) )
I'm not sure if such custom peripherals could provide "physical" objects for the 3D environment, for user interaction, as I don't know enough about gamedev/3D/etc. to say whether it would be realistic to do that via such a scripting language - but many possible peripherals wouldn't really need that anyway. Of course, if it can be done... virtual blinkenlights anyone? :)
Those peripherals wouldn't necessarily be intended to make the player's life easier either - someone could e.g. make a disk peripheral that it takes a few cycles to read from or write to, like it does for real disks, and then the player has to handle that response time in their code to succeed on that (custom) level.