|
Speech Processing Technologies Prospects for Speech Interface Technologies FURUI Sadaoki Speech Processing Technologies Becoming Common in Daily Life, and Toshiba's Approach AKAMINE Masami Speech interfaces have recently become increasingly widespread for interacting with digital devices such as smartphones instead of touch keyboards. Since the 1980s, Toshiba has been developing various core technologies supporting speech interfaces, such as automatic speech recognition, speech synthesis, and so on. These technologies have been applied to a variety of products including speech middleware for in-car navigation systems, dictation software, content-creation services on websites, and machine translation systems. Aiming at the realization of the so-called cognitive assistant, we have been continuously engaged in the development of not only speech technologies but also technologies related to multimodal interfaces and various new products and services. Large-Vocabulary Speech Recognition Technologies for Achievement of Simultaneous Translation and Speech Dialog Systems MASUKO Takashi / ASHIKAWA Masayuki In order to achieve the practical use of voice translation and speech dialog systems, large-vocabulary speech recognition that recognizes utterances of various types is required. However, it is difficult for a small number of developers to collect the new words and colloquial expressions that continuously appear in the language and to add them to a system's vocabulary. Moreover, it is necessary to improve phoneme discrimination performance in order to discriminate between the increasing number of similarly pronounced words that emerge with the expansion of vocabulary size. To overcome these problems, Toshiba has established a word collection method using crowdsourcing and developed a new acoustic feature to improve phoneme discrimination ability. These technologies realize large-vocabulary speech recognition through the collection of a number of words in a short period of time and improved speech recognition accuracy. Text-to-Speech Technologies Realizing Various Voices and Expressive Reading MORITA Masahiro / TAMURA Masatsune / FUME Kosei As text-to-speech (TTS) technologies are now widely used for e-book reading and entertainment applications, improvement of their ability to provide various types of voices, speaking styles, and emotions has become a focus of attention. In response to this need, Toshiba has developed the following advanced TTS technologies: (1) a custom voice production technology that can build a wide variety of voices closely resembling the voices of specific people at low cost and within a short time; (2) an expressive reading technology that can automatically select emotions from respective dialogues in such works as novels; (3) a prosodic authoring technology that can efficiently create speech contents with the intended intonation; and (4) a digital watermarking technology that prevents the misuse of TTS, such as for identity theft. Spoken Dialogue Technology to Understand Problems and Offer Solutions NAGAE Hisayoshi / YAMASAKI Tomohiro / ICHIMURA Yumi Spoken dialogue systems such as personal assistant applications for smartphones have appeared in recent years. In order to hold a meaningful conversation with a conventional personal assistant application, however, it is necessary to input sentences containing explicit commands. Attention has therefore been increasingly focused on a spoken dialogue system to which users can speak freely, without the need for predetermined commands. Toshiba has developed a spoken dialogue technology that can assist in resolving users' problems through more spontaneous human-machine dialogues. This technology makes it possible to provide adequate solutions to users through the estimation of intended meaning based on background knowledge collected from large amounts of data, including words and patterns in sentences, even when a user utters an ambiguous expression rather than giving a clear instruction to the system. Simultaneous Interpretation Technology Supporting Conversations in Foreign Languages for Face-to-Face Services KAMATANI Satoshi / SAKAMOTO Akiko / SUMITA Kazuo With the increasing opportunities for conversation in foreign languages, demand has been growing for a simultaneous machine interpretation technology that can be used in many different situations. Toshiba has developed a simultaneous interpretation system for continuous speech conversation taking place in various face-to-face services at stores, reception desks, counters at public offices, and so on. This system, is capable of both Japanese/English and Japanese/Chinese interpretation, supports smoother communication between speakers of different languages by processing their continuous spontaneous speech and incrementally outputting the interpretation results. As a result, a user can immediately understand what a conversational partner is saying. We have conducted field experiments and confirmed that a solved task ratio of approximately 90% is achieved for various tasks including buying souvenirs and asking for directions regarding a bus route. High-Quality Voice Capture Technologies and Application to Tablet ISAKA Takehiko / SUDO Takashi / AMADA Tadashi Demand has been increasing for voice input applications including video chat systems and speech recognition systems. To improve the usability of these applications, it is essential to capture voices as clearly as possible. In order to minimize factors that degrade quality in voice input applications, Toshiba has developed the following high-quality voice capture technologies: (1) an echo canceller to suppress sounds from a speaker being picked up by a microphone, (2) beamforming to suppress directional noise, and (3) a noise canceller to suppress diffuse noises entering a microphone from various directions. These technologies have been implemented in the REGZA Tablet AT703/AT503 models, which feature a smooth voice capturing function. Audio Source Separation Technology to Control Volume Balance between Voices and Background Sounds HIROHATA Makoto / ONO Toshiyuki / NISHIYAMA Masashi The wide dissemination of audiovisual (AV) products has provided users with easy and diversified styles of viewing and listening to video contents. However, it is not always possible to view video contents comfortably because of an imbalance in the volumes of voices and background sounds. Toshiba has developed an audio source separation technology to extract voice and background sound source signals from audio signals. This new technology realizes a more enjoyable viewing experience by allowing users to adjust background sounds and hear voices more easily, thus providing highly realistic sensations when watching programs such as sports matches and enhancing the experience of karaoke while watching music programs. Voice Interface for Operation of Distant Equipment OUCHI Kazushige / KOGA Toshiyuki In order to operate distant equipment by a speech recognition system, there are two technical challenges for the realization of practical recognition accuracy: (1) commanding the equipment to start speech recognition, and (2) reducing the influence of ambient noises. Toshiba has developed a voice interface for the operation of distant equipment that utilizes a microphone array technology to emphasize the sound in the target direction and suppress noises from other directions. When a user activates the speech recognition system by clapping twice, the system simultaneously detects the direction of the clapping and sets the directivity angle of the microphones to that direction so as to prioritize the input of the target user's voice. We have conducted evaluation experiments using the operation of a TV set as a motif, and confirmed that users can operate a TV from 4.5 m away by means of speech recognition with a practical level of performance. ToScribeTM Web Application to Enhance Efficiency of Audio Transcription Work UENO Koji / ASHIKAWA Taira Toshiba has launched ToScribeTM, a new, free, cloud-based application that allows users to manually transcribe speeches more efficiently by integrating a number of speech and language processing technologies including automatic speech recognition (ASR) technology. ToScribeTM works with major Web browsers, and offers effective transcription assistance while simplifying troublesome audio player control operations by means of the following high-level speech and language processing technologies: automatic speech position estimation by manipulating the internal results of the ASR, automatic speaker estimation by clustering audio feature values, and proofreading assistance applying our test structure analysis technology. |