Demos
In the first sections, we select sound samples from V2A dataset (VGGSound) and speech samples from VisualTTS dataset (LRS2). We also show the out-of-domain generation case for videos generated by Google Veo3, demonstrate the joint generation capability of VSSFlow.
	
In the second section, we compare VSSFlow with multiple domain-specific baselines on both sound and speech generation benchmark, to demonstate VSSFlow's superiority.
    
For a better audio-visual experience, we appreciate your patience while the server loads the content. Thank you! To It is recommended to use earphones and view the video in full screen mode.