VSSFlow

Demos

In the first sections, we select sound samples from V2S dataset (VGGSound), speech samples from VisualTTS dataset (LRS2). We also show the out-of-domain generation case for videos generated by Google Veo3, demonstrate the joint generation capability of VSSFlow.
In the second section, we compare VSSFlow with multiple domain-specific baselines on sound and speech generation benchmark, to demonstate VSSFlow's superiority. We also compare VSSFlow's joint generation results with pipeline-based method (V2S + VisualTTS) on V2C benchmark.
For a better audio-visual experience, we appreciate your patience while the server loads the content. Thank you! To It is recommended to use earphones and view the video in full screen mode.

Section 1.1: Sound generation results.

Section 1.2: Speech generation results.


"I'm starting up a venue for yank holiday makers on an island at Bahamas."	"I think something drastic needs to be done."	"That seems to me like a pretty impressive perk of the job by anybody’s standards."

"Travel three miles further west and you do get more for your money."	"People are happy to be there and it's a good vibe."	"But everyone going into the den gets a fresh chance to ture things round."

Section 1.3: Audio-Speech Joint-Generation Results for Veo3 videos.



CASE1 - Police speech transcript: "We get in there, I want no bullshit!"	CASE2 - Streamer speech transcript: "Hi, and welcome to the channel."	CASE3 - Soldier speech transcript: "Eat led, zombie scum!"

The above videos (Case1, 2, 3) are generated by Google’s Veo3 and never occur in the training data. We use VSSFlow to generate audio and speech, given the silent video clip and speech transcript. These cases demonstrate VSSFlow’s strong out-of-domain generalization ability.

Section 2.1: V2A results comparision.

GT-Bird	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Exploding	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Drum	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Machine	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Bubble	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Cat	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Chewing	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-Guitar	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

GT-vehicle	VSSFlow	SpecVQGan	Im2Wav	Seeing&Hearing

DiffFoley	TiVA	LoVA	Frieren	VAB

Section 2.2: VisualTTS results comparision.

GT	DSU	Style

VSSFlow	HPMDubbing	EmoDubber

"So none of these change their equilibrium constant under these effects."

GT	DSU	StyleDubber

VSSFlow	HPMDubbing	EmoDubber

"And indeed I see favoring the products a lot of ions in solution and a very bright light."

GT	DSU	StyleDubber

VSSFlow	HPMDubbing	EmoDubber

"So it's the heat I absorbed minus the work I did or an internal energy change of fifty nine thousand two hundred ten joules."

GT	DSU	StyleDubber

VSSFlow	HPMDubbing	EmoDubber

"So I'm going to start with objects on one side and let them spread to both sides."

GT	DSU	StyleDubber

VSSFlow	HPMDubbing	EmoDubber

"Lay green at R seven soon."

GT	DSU	StyleDubber

VSSFlow	HPMDubbing	EmoDubber

"Place red in S seven now."

Section 2.3: Sound-Speech Joint Generation Results Comparision.

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"I had to have your invention."	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"Ocean chose you for a reason."	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"Your chief is dead."	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"I'm so sorry. I don't know how this happened."	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"Look inside."	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"A bow, Fergus? She's a lady."	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"No, no. No way!"	LoVA+Speaker2Dubber	LoVA+StyleDubber

VSSFlow	MMAudio+Speaker2Dubber	MMAudio+StyleDubber

"Safe? I wanna help."	LoVA+Speaker2Dubber	LoVA+StyleDubber