Accepted to INTERSPEECH 2022 (Paper ID: 988)

Official Pytorch implementation: Link

Authors

Changhwan Kim, Seyun Um, Hyungchan Yoon, Hong-goo Kang


Proposed Method

In this paper, we propose a method to flexibly control the local prosodic variation of a neural text-to-speech (TTS) model. To provide expressiveness for synthesized speech, conventional TTS models utilize utterance-wise global style embeddings that are obtained by compressing frame-level embeddings along the time axis. However, since utterance-wise global features do not contain sufficient information to represent the characteristics of word-level local features, they are not appropriate for direct use on controlling prosody at a fine scale. In multi-style TTS models, it is very important to have the capability to control local prosody because it plays a key role in finding the most appropriate text-to-speech pair among many one-to-many mapping candidates. To explicitly present local prosodic characteristics to the contextual information of the corresponding input text, we propose a module to predict the fundamental frequency (F0) of each text by conditioning on the utterance-wise global style embedding. We also estimate multi-style embeddings using a multi-style encoder, which takes as inputs both a global utterance-wise embedding and a local F0 embedding. Our multi-style embedding enhances the naturalness and expressiveness of synthesized speech and is able to control prosody styles at the word-level or phoneme-level.


Samples

[Case 1. Performance]

Anger

Text 두 사람 눈치 보기 싫어서 한 발이라도 먼저 나간다. (I don't want to look at them, so I'll take a step forward.) 아침 하기 싫어서 나오는 거 내 모를 줄 알아? (Don't you think I don't know you're coming out because you don't want to make breakfast?) 내 앞에 앉기 싫은 모양인데, 그럼 앉지마. (You don't want to sit in front of me, then don't sit down.) 들어오기 싫으면, 이참에 끝장을 내라고 그러세요. (If he don't want to come in, tell him to finish it this time.)
Ground Truth
Baseline
Proposed

Happiness

Text 난 아저씨가 빨리 우리 아빠가 됐으면 좋겠어. (I hope he will be my father soon.) 엿장수 맘대로 아니고, 지혜 맘대로. (It's not up to the candy seller, it's up to Jihye.) 그런 맘 먹기 힘들었을텐데, 고맙다 인경아. (It must have been hard for you to make up your mind, thank you In-kyung.) 아뇨, 전 호텔에서의 만찬보다는 이런 자리가 훨씬 편하고 좋은데요. (No, I like this kind of place much more comfortable than a hotel feast.)
Ground Truth
Baseline
Proposed

Sadness

Text 엄마가 날 안낳았다니까 더 우울하고 더 슬퍼. (It's even more depressing and sadder that my mom didn't give birth to me.) 나 아까워서 우리 혜인이 시집 못보내겠어. (I can't let my Hye-in get married because it's such a waste.) 오늘 만나면 어제 했던 말 취소한다고 할까봐 밤새 걱정했어. (When we met today, I was afraid you'd take back what you said yesterday.) 공휴일이라 쉬실텐데, 전화드려서 죄송합니다. (I'm sorry to call you because it's a public holiday.)
Ground Truth
Baseline
Proposed

Neutral

Text 세상에 둘도 없는 범생이 차림으로 갈 거다. (I'm going to dress up as the best student in the world.) 일본 작가가 쓴 소설을 출판하고 싶은가봐. (He seems to want to publish a novel written by a Japanese writer) 애한테 이런 불량식품을 사먹이면 어떡해요. (You shouldn't buy such junk food for that kid.) 아빠, 우리 유치원 얼마나 좋은데요. (Dad, my kindergarten is so nice.)
Ground Truth
Baseline
Proposed


[Case 2. Utterance-level F0 control]

Text 1. 이번 여름엔 다같이 놀러가면 좋겠다. (I hope we can all play together in this summer.)

F0 Shift Original +50Hz -50Hz
Happiness
Sadness

Text 2. 아니야, 정말로 엄마랑 살고 싶어. (No, I really want to live with my mom.)

F0 Shift Original +50Hz -50Hz
Anger


[Case 3. Word-level F0 control]

(The part where F0 has changed is marked in bold.)

Text 1. 이번 여름엔(yeo reum en) 다같이 놀러가면 좋겠다. (I hope we can all play together in this summer.)

F0 Shift Original +50Hz -50Hz
Anger
Sadness
Neutral

Text 2. 데려다주기로 한 거니까, 어디든(eo di deun) 데려다주마. (I’m going to take you anywhere, because I’m going to take you there.)

F0 Shift Original +50Hz -50Hz
Happiness


[Case 4. Phoneme-level F0 control]

(The part where F0 has changed is marked in bold.)

Text 1. 이번 여름엔(reum en) 다같이 놀러가면 좋겠다. (I hope we can all play together in this summer.)

F0 Shift Original 름(reum) 엔(en)
Happiness
Neutral

Text 2. 아빠, 우리 유(yu)원(won) 얼마나 좋은데요. (Dad, my kindergarten is so nice.)

F0 Shift Original 유(yu) 원(won)
Anger


[Case 5. Diverse local prosodic variation]

Since our proposed model provides local information from F0 values, the local prosodic variations of synthesized output are various. On the other hand, the baseline only generates averaged prosodic variation.

Please focus on the prosodic variation of the following samples.

Baseline
Proposed
Baseline
Proposed