VSSFlow

Demos

In the first sections, we select sound samples from V2A dataset (VGGSound) and speech samples from VisualTTS dataset (LRS2). We also show the out-of-domain generation case for videos generated by Google Veo3, demonstrate the joint generation capability of VSSFlow.
In the second section, we compare VSSFlow with multiple domain-specific baselines on both sound and speech generation benchmark, to demonstate VSSFlow's superiority.
For a better audio-visual experience, we appreciate your patience while the server loads the content. Thank you! To It is recommended to use earphones and view the video in full screen mode.


Section 1.1: Sound generation results.

Section 1.2: Speech generation results.

"I'm starting up a venue for yank holiday makers on an island at Bahamas." "I think something drastic needs to be done." "That seems to me like a pretty impressive perk of the job by anybody’s standards."
"Travel three miles further west and you do get more for your money." "People are happy to be there and it's a good vibe." "But everyone going into the den gets a fresh chance to ture things round."

Section 1.3: Audio-Speech joint-generation for Veo3 videos.

CASE1 - Police speech transcript: "We get in there, I want no bullshit!" CASE2 - Streamer speech transcript: "Hi, and welcome to the channel." CASE3 - Soldier speech transcript: "Eat led, zombie scum!"

The above videos (Case1, 2, 3) are generated by Google’s Veo3 and never occur in the training data. We use VSSFlow to generate audio and speech, given the silent video clip and speech transcript. These cases demonstrate VSSFlow’s strong out-of-domain generalization ability.







Section 2.1: V2A results comparision.

GT-Bird VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Exploding VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Drum VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Machine VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Bubble VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Cat VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Chewing VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-Guitar VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


GT-vehicle VSSFlow SpecVQGan Im2Wav Seeing&Hearing
DiffFoley TiVA LoVA Frieren VAB


Section 2.2: VisualTTS results comparision.

GT DSU Style
VSSFlow HPMDubbing EmoDubber
"So none of these change their equilibrium constant under these effects."
GT DSU StyleDubber
VSSFlow HPMDubbing EmoDubber
"And indeed I see favoring the products a lot of ions in solution and a very bright light."
GT DSU StyleDubber
VSSFlow HPMDubbing EmoDubber
"So it's the heat I absorbed minus the work I did or an internal energy change of fifty nine thousand two hundred ten joules."
GT DSU StyleDubber
VSSFlow HPMDubbing EmoDubber
"So I'm going to start with objects on one side and let them spread to both sides."
GT DSU StyleDubber
VSSFlow HPMDubbing EmoDubber
"Lay green at R seven soon."
GT DSU StyleDubber
VSSFlow HPMDubbing EmoDubber
"Place red in S seven now."