I'm not sure what you're asking tbh. Let me know if this doesn't answer your question...
This is parsing English into an intermediate phoneme language and then stringing together audio samples per phoneme to construct words. Dad bot is directly sampled from an old DOS TTS program. Mom is a human voice with some audio processing. Baby is the Dad phoneme bank, pitch shifted and with additional processing.