TTS (Text-To-Speech, speech synthesis) is currently a "small and beautiful" AI field, but I personally find it very interesting. I feel that TTS will be really valued by the industry in the future, and there will be well-done entrepreneurs. the company.

This article is based on the author's "most necessary" knowledge of TTS technology and the current situation of the industry extracted by the author after collecting a lot of online/offline related information (more is unnecessary, and less is not enough to get started and prepare for an interview. Or actual work); not only saves time for everyone, but also filters out a lot of useless information and too technical content.

table of Contents

1. Core concepts

2. Current technology boundaries

3. Bottlenecks and opportunities (focus)

1. Core concepts

1. The conceptual difference between TTS and ASR

We are more familiar with ASR (Automatic Speech Recognition, speech recognition), which converts sounds into words, which can be compared to human ears.

The TTS technology (Text-To-Speech, speech synthesis) is to convert text into sound (read it aloud), analogous to a human mouth. The voices that everyone hears in various voice assistants such as Siri are all generated by TTS, not real people talking.

There are two main ways to implement TTS technology: "splicing method" and "parameter method".

2. Splicing method

1) Definition: From a large number of pre-recorded voices, select the required basic units and stitch together. Such units can be syllables, phonemes, etc.; in order to pursue the coherence of synthesized speech, diphones (from the center of one phoneme to the center of the next phoneme) are often used as the unit.

2) Advantages: higher voice quality

3) Disadvantages: The database requirements are too large. It usually takes tens of hours to predict the finished product. For enterprise-level commercial use, at least 50,000 sentences are required, and the cost is several million yuan.

3. Parameter method

1) Definition: Generate voice parameters (including fundamental frequency, formant frequency, etc.) at all times according to the statistical model, and then convert these parameters into waveforms. Mainly divided into 3 modules: front end, back end and vocoder.

What the front end does is to parse the text to determine what the pronunciation of each word is, what kind of tone and intonation are used for this sentence, what rhythm is used to read, which places are the key points that need to be emphasized, and so on. Common tone-related data descriptions include but are not limited to the following: prosodic boundaries, accents, boundary tones, and even emotions. There is more information that is even difficult to describe objectively, and the current algorithm can only be ignored for the time being.

Note: The splicing method and the parameter method both have front-end modules. The difference between splicing and parameters is mainly the difference in the back-end acoustic modeling method.

2) Advantages: The database requirements are relatively small.

If you only need to speak out (as a demo), about 500 sentences are fine, but the effect is definitely not good.

General TTS, generally requires at least 5000 sentences and 6 hours (generally, it takes 1 hour to record 800 sentences). ——It may take at least 3 months from the preliminary preparations, finding people, finding recording venues, recording, data screening, and labeling, and finally becoming "usable data". (Xunfei is relatively mature in all aspects, and the time will be much shorter)

Most of the personalized TTS uses the "parameter" method. (Adobe and Microsoft have also tried the splicing method, but it is not too mature relative to the parameter method, and the effect is not too general)

3) Disadvantages: The quality is worse than the splicing method. Because it is restricted by the voice algorithm, there is a loss.

Because the main weakness and difficulty is the vocoder. The function of the vocoder is to reproduce the sound signal, it is difficult to reproduce the sound details, and makes people unable to hear all kinds of noise, dullness, mechanical feeling and so on. At present, the common vocoders make various theoretical models and simplified assumptions on the sound signal itself. It can be said that the description of the details is almost ignored.

Note: DeepMind's WaveNet basically solves the problem of vocoder. Because they directly make predictions on speech samples and do not rely on any theoretical models of pronunciation. The final sound quality details are very rich, basically reaching a sound quality level similar to the original voice (the so-called quality is increased by 50%, here is), and almost any sound can be modeled (this is too good).

4. Judgment criteria of TTS

1) Subjective test (degree of naturalness), mainly based on MOS

A) MOS (Mean Opinion Scores), expert evaluation (subjective); 1-5 points, 5 points are the best.

Note: Microsoft Xiaoice’s public publicity is 4.3 points, but some friends in the industry believe that it cannot be said to be “absolutely” better than iFlytek, because the selection of experts for each review is different. To put it bluntly, at present, in the entire AI industry, everyone says they have a good rhythm.

B) ABX, general user evaluation (subjective). Let the user listen to the two TTS systems, compare them, and see which one is better.

C) There should be a distinction between each subjective observation. For example, this time I will focus on listening to polyphonic characters, and next time I will mainly listen to modal particles.

2) Objective test

A) Evaluate the acoustic parameters generated by the synthesis system, generally calculating Euclidean distance (RMSE, LSD).

B) Tests on synthesis system engineering: real-time rate (synthesis time/voice duration), first packet response time (the time the user sends a request to the arrival of the first packet perceived by the user), memory usage, CPU usage, 3*24 Hour crash rate etc.

2. Technical boundary

1. General TTS

1) In scenarios where the user's expectations are not harsh (APP/hardware), it can meet commercial needs, such as voice assistants / Didi / Gaode / smart speakers / robots); but if user expectations are very high, it is difficult to meet , Because there is still a "machine sense/mechanical sense", it cannot simulate the human voice very naturally.

2) At present, the products of various companies in the industry have similar effects, and they are basically commercially available.

2. Personalized TTS

1) In scenarios where the user's expectations are not harsh, it can "basically" meet the commercial needs, but the effect is not as good as the general TTS. But if the user's expectations are very high, it is temporarily unsatisfactory.

2) At present, iFLYTEK is the most mature commercial in the industry, and some startup companies have a layout in this field. For example, HEARD, a company dedicated to the audioization of mass content, has divided the sound into different categories. Generate and reserve, the enterprise-level needs they target will be more personalized and branded, such as Alibaba’s "zoo" brands (such as Tmall, Xianyu, Hema, Cainiao, etc.), will generate such as "Little Pig Characterized TTS such as "Page" has been commercialized.

3. Emotional TTS

1) At present, there is more emotional synthesis in the industry, because the data itself has become more and more rhythmic, surpassing the traditional broadcasting style, but it is not the real emotional synthesis such as "happy, angry, sad, happy" (you are happy if you want to be happy) This kind of intelligence).

2) In terms of the theory of emotional TTS, academia has reserves, but the entire industry has not done much (or has not done a good job) at present, because emotional TTS relies heavily on "emotional intention recognition", "emotional feature mining", "Emotional data" and "emotional acoustic technology" are systems engineering. Among them, the first point is related to natural language processing, such as the need to know "when to be happy or sad"; at the same time, the reserve of voice data with emotional interpretation is also very important.

3. Bottlenecks and opportunities

There are mainly five bottlenecks (also opportunities) in five directions.

1. Basic technology

1) TTS technology is undergoing major changes: the end-to-end TTS modeling method, coupled with WaveNet's vocoder idea, is the future development direction of TTS.

End-to-end TTS generally refers to tacotron. Tacotron is just a mid-section structure proposed by Google that combines the original duration model and acoustic model, and can be connected to any TTS front-end and TTS back-end. The TTS front-end, such as Chinese word segmentation, phonetic notation, and part of speech, will improve tacotron performance; the back-end, parameters, splicing, and wavenet can all be used.

Regarding the commercialization of WaveNet technology: Google commercialized the second generation of WaveNet technology at the beginning of this year, which is 10,000 times faster than the first generation. Domestic companies have basically imitated them (thesis algorithm), but the engineering will take time, and the cost is still too high, and it should not be commercially available in the short term.

Regarding the effect: the final effect of TTS is good or bad, and the technology only accounts for less than 50%. When the technologies are all the same, the quality of sound and the amount of data are the most important, followed by the TTS of the same deployment scale and cost can be compared with each other, that is, it cannot be simple Which company’s effect is better than the other one, a) For example, the effect of WaveNet v1 of many AI companies such as Baidu/Tencent/Ali/Turing, etc., generally can surpass the online interface of Xunfei, but deployment The cost is tens of thousands of times higher and it is not real-time. After WaveNet V2 is commercialized, although it can be real-time, the deployment cost is at least 10 times higher than that of high-configuration splicing TTS. b) Cost, partly related to sampling rate. For example, the sampling rate of Xunfei/Baidu TTS is 16k. If 24k and 48k are used, the subjective experience is at least 50% stronger, but the cost will double; that is, other AI companies The 24kTTS MOS can slap the API of iFLYTEK/Baidu, but it cannot be said that their technology is better than iFLYTEK/Baidu, because in commercialization, the effect will be sacrificed to reduce costs.

2) How to make the effect of the offline version reach the level of the online version. Many customers hope (extravagantly) that there is an offline version, and the effect is as good as the online version... At this stage, it may be really "cannot be done by a concubine".

2. Lack of data

On the one hand, especially personalized TTS, a larger amount of data is required. For example, it is more difficult to change the voice of a boy into a girl by default.

On the other hand, the cost and cycle of data acquisition (production) are also the initial competitive focus of each company. For example, generally speaking, a (set) of TTS data requires at least 20,000 to 30,000 sentences to be recorded first. With data annotation, it usually takes more than 3 months (and requires the full cooperation of the anchor). For 30 hours of data, the price is usually between 300,000 and 500,000. The company mentioned above, HEARD, We have mobilized more than 8,000 high-quality broadcasters, and while dubbing different content, they also store a large amount of structured data (inventory).

In this way, in response to the data needs of most customers, there is no need to find an anchor to record, but to directly call the data from the warehouse to unfreeze (data labeling); through this kind of "do business while earning data" The process is standardized, and the cost of obtaining data is greatly reduced to one-fifth of the industry, and once there is demand, it can be delivered within 1 month. The scale of the data labeling workshop set up by this company in the south is also huge, and companies including Huawei purchase speech synthesis data from it.

3. Lack of talent

Not only can it not be compared with popular AI talents such as NLP and CV, even if it is compared with ASR, which is also not a popular one, TTS has fewer talents.

4. Difficulty of productization

Due to technical limitations, it is impossible to have a perfect TTS effect at this stage, so:

1) Try to choose scenarios where the user's expectations are not harsh, or manage the user's expectations when designing the product experience (for example, taxi software, Guo Degang/Lin Chiling's voice, almost just fine).

2) The choice of "parameter method" or "splicing method" is related to the company's technical reserves, costs, and product goals. In the vertical field, the existing TTS technology (parameter or splicing) can do well for the product. The industry has not had very good results. The main reason is that the product manager has not been deeply involved, and there are many details to step on (product design + engineering realization)-there should be amazing products in the future.

3) The experience detail design is very different from general Internet products, such as:

A) Copywriting design is very important; because in the voice interaction scene, it cannot be too long, and the user has no patience and time to listen to it.

B) You can add background music to cover up details such as noise.

C) Special scenes, there are special requirements, such as far-field scenes and headset scenes, there will still be differences.

D) Mixed Chinese and English TTS. For example, if a user wants to play an English song, the difficulty lies in: among all Chinese pronunciation, it is difficult to pronounce Chinese and English together. Why? Because often people who record. Recording Chinese is a group of people, and recording English is another group of people. Combining the two languages, and then using machine learning to learn, the sound will become very strange. In this regard, Xiaoya speakers have spent a lot of energy and cost to "deadly" solve.

5. Commercialization pressure

1) If you want to have sufficient market competitiveness, it will take at least 12 months and a team of 2~6 people (if someone has done front-end related work, it will save huge costs-the workload is mainly in the Chinese front-end NLP part, such as word segmentation , Phonetic, part-of-speech text regularization, etc.), several million capital investment (one GPU is one hundred thousand a year, only a few dozen concurrent support). In addition, large companies have a huge first-mover advantage, and small companies must subdivide the scene.

2) I personally think that personalized TTS and emotional TTS will be more widely used in various subdivision scenarios, such as knowledge payment, star IP, intelligent hardware, car networking, physical/virtual robots, etc.

Attachment: relevant information

1. Relevant universities and laboratories

Speech synthesis involves a wide range of professional fields, including linguistics, hearing and vocalization mechanism, natural language analysis, deep learning, signal processing and many other fields. It is a comprehensive subject.

Internationally, Professor Simon King from the University of Edinburgh, Professor Alan W Black from Carnegie Mellon University, Professor Kawahara from Wakayama University, Japan, and Google Heiga Zen's laboratories are all top international laboratories.

Domestically, Chinese academics have also been at the forefront of the industry. The international speech synthesis challenge blizzard challenge has been champions in China for more than 10 consecutive years. Most of the domestic speech synthesis talents come from the University of Science and Technology of China, the Institute of Automation of the Chinese Academy of Sciences, the Institute of Acoustics of the Chinese Academy of Sciences, Tsinghua University, Northwestern Polytechnical University and other units. Talents, in the core positions of companies such as Microsoft, Baidu, Sogou, Xiaomi, IBM, iFLYTEK, Liulishuo, Go Ask, Orion Star, Tongdun, etc., there are students from Western University of Technology.

2. Reference articles

"At present, the tone of artificial intelligence speech in Chinese is still relatively mechanical. How to make the tone of artificial intelligence speech more natural? 》Http://t.cn/RFnP7EH

"How to evaluate WaveNet, Google's next-generation speech synthesis system? 》Http://t.cn/RFnPUkA

"What is the principle of TTS (Text-To-Speech)? 》Http://t.cn/RFnPfP1

"The author of Baidu Deep Voice discussed the five technical details with the Bengio team. How far is the end-to-end speech synthesis? 》Http://t.cn/RoUvHAg

3. Related products

Xunfei dubbing app, Xunfei Reading Assistant app, lightning dubbing (http:// ), etc.

Optical Fiber Adapter

Optical Fiber Adapter,St Optical Fiber Adapter,Sc Flange Less Adapter,Short Flange Sc Adapter

ShenZhen JunJin Technology Co.,Ltd , https://www.jjtcl.com