I'll definitely be interested to hear about what you come up with!
Lots of possible tradeoffs depending on exactly what your needs are.
In general, when speed is of the utmost importance, macros and unrolled loops are your friends. One nice thing about this on XO-Chip is that while some of them take up a lot of memory, none of the above routines include any branches, so they could be located at the very end of the low 4k of address space and only consume one byte of precious "code RAM".