Just my thought pattern but here goes:
Option 1 - works well if all we're gonna have is conversations with only the upper body making gestures to depict emotions/moods.
Option 2 - gives more room for full body expressions. if something is supposed to be happening in the background during the conversation, then there is a possibility of missing portions of it depending on location.
Option 3 - is more spaced out but the text box may be taking up most of the focus due to its bigger size.
Option 4 - has the best spacing and size comparison between the portrait and the text box, which still allows you to view the full body expressions without much issue.
Option 5 - pretty standard stardew fashion. it would definitely limit your expressions to facials only in order to fit the portrait box but is overall pretty basic in comparison to the others.
Overall, I'd have to go with options 2 or 4 as the best out of the 5.