A sprite system? Wow, that would be really cool, and very useful for not just games but also other things that use graphics, e.g. icon-using window systems, or even for rendering text (using a sprite per char).
Regarding speed/efficiency, hmm. Well, let's do some math and figure out what our theoretical limits are, as that will help to discard options that aren't actually possible.
With a 256x128 screen, there's 32768 pixels, so (using the current screen device API) that's the theoretically minimum possible cycles a full-screen refresh would take.
At 300kHz, that means a theoretical maximum full-screen FPS of barely above 9.
In practice, unless the program is simply a long list of SETDATA instructions, it'll need to use more cycles than that - even my single-color-only fast loop needs 2 cycles per pixel, plus some more for setup for each frame.
If we assume that it will need at least 3 cycles per pixel to draw an actual image (which might still be low-balling it even as a best case), that's already down to about 3 FPS max - and it still doesn't include any of the processing needed to generate that image (such as game logic or sprite data decoding).
So, if you need 10 FPS or more for the game to be playable, full-screen refresh is simply a no-go even in theory, given this (virtual) hardware.
Thus, any game for this system that needs a decent FPS would have to be designed to not require full-screen refresh.
This also means that the basic way of doing sprites - always starting with a base image and then layering the sprites on top, and then blitting the result to the screen - is also a no-go. Even doing it directly to the screen instead of to an off-screen buffer would still involve a full-screen refresh (and then some more).
We'll therefore have to use something that only draws the pixels it needs to, to update the screen. Given a sprite system, that's not actually a major hindrance, since it would draw only the pixels of the sprites anyway, though we'll need to add erasing the previous sprites to that (or if the sprites move slowly, at least the parts of them that won't be drawn on top of.)
So. How many pixels can we actually do per frame, then?
Well, if we want 60 FPS, we only have 5k cycles per frame, so max 5k pixels - or somewhat more realistically, 5k/3 < 1667 pixels (which is less than a 41x41 square) - minus setup and game logic. So, at most ~5% of the screen can be updated each frame, even with that theoretical high-efficiency code.
At 20 FPS, we have 15k cycles per frame, which by the same logic gives us max 5k pixels (less than a 71x71 square), or ~15% of the screen, in that same very optimistic case.
Note that those numbers include any pixels that need to be erased (to show the background) because a sprite was moved away from there, and do not account for needing to handle transparency, do game logic, or any other overhead. So it's likely to be significantly less in practice.
So really, large moving objects (drawing over ~15% of screen size) at even 20 FPS are not really a matter of efficiency - they're simply not really possible on this hardware. Scrolling likewise unless it can be restricted to the same number of pixels actually changing. (I'll note that on some game machines, sprites/scrolling/etc. were implemented in hardware.)
Now, can we get around that limit?
Well, since the limit is on the number of pixels drawn, not where on the screen those pixels are, it might (depending on game and graphics style) be possible to have moving objects that cover a larger screen area - as long as only parts of it actually need to be redrawn for each frame.
E.g. if the large sprite is an enemy, if it's designed as a blob, with most of its internals being the same color, and the game knows this, then it can draw the whole thing once and then later update only its edges instead of the whole thing. (I actually did that in Snake, only drawing each end of the snake instead of the whole thing.)
I think that requires using special-case code instead of a generic sprite renderer, though, since the game needs to know about it to know that it can do it. Well... unless you encode that movement in the sprite data somehow and have the renderer use that for partial updates. But then I'm not sure if it's really just a sprite anymore... Although, I suppose it could be done indirectly by having several sprites - one for the initial rendering, then others for per-frame movement that only draw what needs to be redrawn for that movement. Though you'd still have to encode erasure areas somehow, unless the background is always the same color...
Regarding your algorithm pseudocode, I haven't studied it in detail, but noticed that it essentially has an inner loop for a small fixed-size set of elements (the 4 pixels per data line), on a system where inner loops (or multiple checks) is expensive (due to juggling of R2/R3), and there actually is a decent amount of RAM to put code in.
So, one easy way to optimize that would be to unroll that inner loop, instead having a copy of the inner drawing code for each of those 4 pixels. This saves both that R2/R3 juggling, and needing to calculate which bits to copy out (though that could probably be done by a bit shift instead), thus speeding things up - though at the cost of the code being longer. I do admit it limits us to sprite sizes that are multiples of 4, at least in the X direction, but I doubt that's much of a problem in practice.
However, your idea of instead using a list of pixels (including X/Y/C for each) would probably be even faster, not just due to skipping the transparent ones entirely, but because I think it would need less processing per pixel. With your proposed format, I'd say 6 cycles per pixel, plus 2 per data line, so 14 per 2 pixels (plus some initial setup per sprite, but I assume that's true for any method). Some games might shorten it by a couple cycles, though (i.e. if sprites can only be placed at multiples of their size).
setup:
R2 = addr of first image word
R3 = addr of last image word + 1
R4 = X/Y position of sprite, encoded as pixel location for sprite's (0,0)
loop:
R1 = read(R2)
R2++
two copies of this code, one for each pixel:
R0 = 0
PMOV color bits from R1 to R0
PMOV X position from R1 to R0
PMOV Y position from R1 to R0
R0 += R4
draw(R0)
IFJMP to loop if R2 < R3
If even more speed is needed, and disk space can be sacrificed to gain it, we can easily get down to 4 cycles per pixel (including the data line reading) by using the very simple one-pixel-per-data-line format of your win screen image, and just adding the sprite location to each pixel (like above).
If there isn't enough disk space to store all the sprites that way, but only a few of the sprites are needed at the same time (e.g. different sprites on different levels), then one option to consider would be caching - setting aside an area of disk for the currently-used sprites, and decode from a compressed image format into that area before starting the level. This would increase the loading time of the level, but in return make it render faster, so you can do more while in the level.
It's actually possible to optimize this down to 7 cycles per 2 pixels if the screen size and color depth were changed to something that fits in 16 bits, as then it would be possible to just shift the second pixel into place with one PMOV instead of having to read and increment. But, that would mean making the screen smaller and having fewer colors, and if we do that, many of the calculations/conclusions above would have to be redone (since the hardware is different).
One thing I'll note is that none of these options have actually managed to reach 3 cycles per pixel, which suggests that the calculations above are actually on the optimistic side.
If we redo with 4 cycles per pixel, then 20 FPS works out to 15k/4 = 3750 pixels (less than a 62x62 square), or ~11.4% of the screen. Again, minus setup, game logic, etc.
Assuming your sprite is representative with drawing 72 of 128 possible pixels in its grid (57% fill), and assuming just 4 cycles of setup for each sprite (to make the math easier), that suggests a max of 3750/73 ~= 51 sprites can be drawn per frame, using the most time-efficient option we know of for this hardware.
Even if we have to erase all of them first, that would still leave us with 25 sprites per frame, which isn't too bad - many games can be done within that.
(Though the game logic will probably take away a good chunk of that, making it fewer, but maybe not all of the sprites need to be moved/redrawn every frame, either, which can compensate.)