Hello!
You know, I have no idea how I missed "STM Sound Clips" as a feature, but that may work for the audio portion of things ! At least for non-quad text. As for the graphics portion with quads, let's say I want to map all this to a kind of "voice"...
So instead of doing "<c=secret-color-CHMRW><q=secret-glyph-FGHIJ></c><c=secret-color-EJOTY><q=secret-glyph-ABCDE></c>" (produces 'HE' in 'HELLO WORLD') etcetera in the text inspector I could instead map it to a kind of "voice" that automates the colour assignment and glyph appearance. In this example, the glyphs could be animated for a "hand-drawn" effect or equate to garbled "dream" speech. Then in addition (or as a replacement to animated glyphs) they could be animated however they might for read-out (wiggling, jiggling, squishing, etc). These effects in combination with the STMSoundClips would be perfect for earlier-mentioned scenarios.
EDIT: I am noticing that using the context menu to create a Sound Clip Data asset seems to instead create an 'STMAutoClipData' instead of 'STMSoundClipData'. Additionally, there's no 'Auto Clip Data' in the context menu to choose from. All other types seem to create the proper data.