My guess would be the controlnet settings. Check how your controlnet mask looks when only working with the first frame to get an idea of what may be wrong and what it's picking up. Perhaps pick a different controlnet model or just fix the settings on the one you are using.
Thank you for your fast response. I tride different ControlNet models (canny, hed, depth, normal), weight (0.3,0.5,0.9 or even 1.6), preprocessor on/off and 3 sets of guide frames with no luck. Only this happens:
What does the original video look like? it's hard to keep a consistent background unless the original background has enough detail to be picked up with ControlNet. For that reason I expect many people will just generate with a greenscreen or something then superimpose it onto a background.