I was using Google's Text-To-Speech API for the voices so far, but if I was starting from scratch I would almost certainly go with one of those voice transforming filters that are popping up all over the place. It's a lot more flexible and able to get the right intonation than trying to add SSML tags into text.
This one seems ok. Defnitely fun to play with. Though I'm sure there are better ones out there: https://koe.ai/recast/