EdiTTS: Score-based Editing for Controllable Text-to-Speech

Jaesung Tae
Yale University
Hyeongju Kim
Neosapience, Inc.
Taesu Kim
Neosapience, Inc.

Abstract

We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are only applied to the target region. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.


Audio Samples

1. Pitch Shift

Example 1

Input
In the nutrition of the animal the most essential and characteristic part of the food supply is derived from vegetable
Grad-TTS
WORLD (up)
WORLD (down)
FastPitch (up)
FastPitch (down)
Mel-shift (up)
Mel-shift (down)
EdiTTS (up)
EdiTTS (down)

Example 2

Input
In the face of impediments confessedly discouraging
Grad-TTS
WORLD (up)
WORLD (down)
FastPitch (up)
FastPitch (down)
Mel-shift (up)
Mel-shift (down)
EdiTTS (up)
EdiTTS (down)

Example 3

Input
The second step we have taken in the restoration of normal business enterprise
Grad-TTS
WORLD (up)
WORLD (down)
FastPitch (up)
FastPitch (down)
Mel-shift (up)
Mel-shift (down)
EdiTTS (up)
EdiTTS (down)

Example 4

Input
They agree that Hosty told Revill
Grad-TTS
WORLD (up)
WORLD (down)
FastPitch (up)
FastPitch (down)
Mel-shift (up)
Mel-shift (down)
EdiTTS (up)
EdiTTS (down)

2. Content Replacement


Example 1

Source
Three others subsequently identified Oswald from a photograph
Target
Three others subsequently recognized Oswald from a photograph
Grad-TTS (source)
Grad-TTS (target)
Mel-concat
EdiTTS

Example 2

Source
Agent Quigley did not know of Oswald's prior FBI record when he interviewed him
Target
Agent Quigley did not know of Oswald's prior FBI disk when he interviewed him
Grad-TTS (source)
Grad-TTS (target)
Mel-concat
EdiTTS

Example 3

Source
that is reflected in definite and comprehensive operating procedures
Target
that is considered in definite and comprehensive operating procedures
Grad-TTS (source)
Grad-TTS (target)
Mel-concat
EdiTTS

Example 4

Source
to the value of twenty thousand pounds
Target
to the value of twenty hundred pounds
Grad-TTS (source)
Grad-TTS (target)
Mel-concat
EdiTTS