Human Interface Laboratory, Annual Review 2010, The University of Aizu

Human Interface Lab. is consisted of three faculty members. Each member has own research and educational interests and are doing independent research activities;

Prof. Masahide Sugiyama:

Applications of similar segment search algorithm to various problems
Development of efficiet similar segment search algorithms
Efficient histogram generation algorithm
Participation to Open Campus with lab. demonstrations. Promotion of a series of computer training courses for visual handicapped people in Fukushima region.

Prof. Jie Huang:

Computational Auditory Scene Analysis

There was intensive research on the integration and segregation factors for sound component self organization. We conducted some examinations on the interactions between different primary cues for sound integration and segregation in human auditory systems by psychological experiments and tried to find out quantitative relations for computational auditory scene analysis.
Robotic Spatial Sound Localization and Sound-based Position Identification

Audition is important for mobile robots as well as vision and other sensing systems. We developed a robotic spatial sound localization system using an auditory interface with four microphones arranged on the surface of a spherical robot head. The time difference and intensity difference from a sound source to different microphones are analyzed and used as localization cues. On the other hand, if a mobile robot can localize several sound sources at known position, the robot will be able to find out its own position. Comparing to vision based systems, the sound based position identification method has advantages in poor lighting conditions and robust against obstacles.
Research on 3-D Sound Systems
- Speaker based 3-D sound systems
- Analysis of Head Related Transfer Functions
- Reverberation and reality in 3-D sound systems

Prof. Konstantin Markov:

Spoken Language processing including speech recognition and synthesis, speaker and language identification, dialog systems and speech based human-computer interfaces.
Multimedia information processing systems for annotation, indexing and management of digital audio-video documents and archives.
Machine learning algorithms which mimic the human way of learning and knowledge acquisition by dynamically changing model structure and active interactions with the environment.

Refereed Journal Papers

[j-huang-01:2010]

Huakang Li, Jie Huang, and Qunfei Zhao. Position identification by actively localizing spacial sound beacons. IEICE Trans. Inf. & Syste, E94D(3):632-638, Mar. 2011.

In this paper, we propose a method for robot self-position identification by active sound localization. This method can be used for autonomous security robots working in room environments. A system using an AIBO robot equipped with two microphones and a wireless network is constructed and used for position identification experiments. Differences in arrival time to the robots microphones are used as localization cues. To overcome the ambiguity of front-back confusion, a three-head-position measurement method is proposed. The position of robot can be identified by the intersection of circles restricted using the azimuth differences among different sound beacon pairs. By localizing three or four loudspeakers as sound beacons positioned at known locations, the robot can identify its position with an average error of 7cm in a 2.5Γ3.0m² working space in the horizontal plane. We propose adjusting the arrival time differences (ATDs) to reduce the errors caused when the sound beacons are high mounted. A robot navigation experiment was conducted to demonstrate the effectiveness of the proposed position-identification system.

Refereed Proceedings Papers

[j-huang-02:2010]

A. Saji, J. Huang, H. Li, and K. Tanno. A 3-D sound creation system using horizontally arranged loudspeakers. In AES 129th Convention, pages . AES E-Library, Nov. 2010.

In this research, we have studied a 3-D sound creation system using 5- and 8-channel loudspeaker arrangements. This system has a great advantage that it does not require the users to purchase a new audio system or to reallocate loudspeakers. The only change for creators of television stations, video game makers and so on is to install the new proposed method for creation of the 3-D sound sources. Head-related transfer functions are used to create the signals of left and right loudspeaker groups. An extended amplitude panning method is proposed to decide the amplitude ratios between and within loudspeaker groups. Listening experiments show that the subjects could perceive the elevation of sound images created by the system well.

[j-huang-03:2010]

J. Huang, H. Li, A. Saji, K. Tanno, and T. Watanabe. The learning effect of HRTF-based 3-D sound perception with a horizontally arranged 8loudspeaker system. In AES 129th Convention, pages . AES E-Library, Nov. 2010.

This thesis argues about the learning effects on the localization of HRTF based 3-D sound using 8 channel loudspeakers system which creates virtual sound images. This system can realize sound with elevation by 8 channel loudspeakers arranged on the horizontal plane and convolving HRTF, not use high or low mounted loudspeakers. The position of the sound image that the system creates is difficult to perceived, because such as HRTF based sound are unfamiliar. However, after several times of learning process, almost listeners can perceive the position of sound images better. This thesis shows this learning effect for HRTF based 3-D sound system.

[j-huang-04:2010]

H. Li, Q. Zhao, A. Saji, K. Tanno, and J. Huang. Spatial direction estimation for multiple sound sources in reverberation environment. In Proc. IEEE Int. Conf. Robotics and Biominetics, pages 1262 - 1267. IEEE, Dec. 2010.

In this paper, we propose spatial localization of multiple sound sources using a spherical robot head equipped with four microphones. We obtain arrival time differences using phase difference candidates. Based on the model of precedence effect, arrival temporal disparities obtained from the zero-crossing point are used to calculate time differences and suppress the influence of echoes in a reverberant environment. To integrate spatial cues of different microphone pairs, we use a mapping method from the correlation between different microphone pairs to a 3D map corresponding to azimuth and elevation of sound sources directions. Experiments indicate that the system provides the distribution of sound sources in azimuth-elevation localization, with the EA model even concurrently in reverberant environments.

[j-huang-05:2010]

H. Li, A. Saji, K. Tanno, J. Ma, J. Huang, and Q. Zhao. Spatial localization of concurrent multiple sound sources. In Proc. Int. Sym. Aware Computing, pages 97 - 103 . NCKU and IEEE, Nov. 2010.

Based on the human auditory system for spatial localization theory, we proposed a spatial localization of multiple sound sources using a spherical robot head. Space sound vectors recorded by a microphone array with spatial configuration, are used to estimate the histograms of spatial arrival time difference vectors by solving the simultaneous equations in different frequency bands. The echo avoidance model based on precedence effect is used to reduce the interference of environment reverberations which provide the strong interference for phase vectors especially in small indoor environments. To integrate spatial cues of different microphone pairs, we propose a mapping method from the correlation between different microphone pairs to a 3D map corresponding to azimuth and elevation of sound sources directions. Experiments indicate that the system provides the distribution of sound source in azimuth-elevation localization, even concurrently in reverberant environments.

[markov-01:2010]

Daria Vazhenina and Konstantin Markov. Recent Developments in the Russian Speech Recognition Technology. In R. Lee T. Matsuo, N.Ishii, editor, Proc. 9th IEEE/ASIC International Conference on Computer and Information Science, pages 535-538, August 2010.

In this paper, we present a review of the latest developments in the Russian speech recognition research. Although the underlying speech technology is mostly language independent, difference between languages with respect to their structure and grammar have substantial effect on the recognition systems performance. Russian language has complicated word formation system which is characterized by high degree of inflection and unrigidness of the word order. We analyze and compare several speech recognition systems developed in Russia and Czech Republic and identify the most promising directions for further research.

[markov-02:2010]

Alexey Karpov, Andrey Ronzhin, Konstantin Markov, and Milo elezn. Viseme-Dependent Weight Optimization for CHMM-Based Audio-Visual Speech Recognition. In Proc. International Conference INTERSPEECH, pages 2678-2681. ISCA, 2010.

The aim of the present study is to investigate some key challenges of the audio-visual speech recognition technology, such as asynchrony modeling of multi-modal speech, estimation of auditory and visual speech significance, as well as stream weight optimization. Our research shows that the use of viseme-dependent significance weights improves the performance of state asynchronous CHMM-based speech recognizer. In addition, for a state synchronous MSHMM-based recognizer, fewer errors can be achieved using stationary time delays of visual data with respect to the corresponding audio signal. Evaluation experiments showed that individual audio-visual stream weights for each viseme-phoneme pair lead to relative reduction of WER by 20%

[sugiyama-01:2010]

M. Kuwabara and M. Sugiyama. Noise Robustness Evaluation of Audio Features in Segment Search. In Proc. of CIT2010/CSEA2010, pages 1285 - 1291, Bradford, UK, July 2010. CIT.

This paper evaluates the noise robustness of audio features in segment search. Active Search is well-known as a fast segment search algorithm, and it has been successfully applied to locate music or video segments (intervals) in huge databases. The noise is generated by MP3 encoding/decoding. The search accuracy is evaluated using Fmeasure, which is calculated precision and recall. The experiment results show that mel-scaled spectral features are better and have a broader range of search thresholds than linear-scaled features. The low analysis order of the mel-scaled audio features has a search speed that is about 12 times faster, with quite reasonable search accuracy.

Grants

[markov-03:2010]

Daria Vazhenina, Konstantin Markov, and Masahide Sugiyama. Studies regarding an algorithm for sorting of pieces of music by mood, 2010. Contract Research funded by FineArc company.

Academic Activities

[markov-04:2010]

Konstantin Markov, 2010.

Journal paper review, IEICE

[sugiyama-02:2010]

Masahide Sugiyama, 2009. board member of IEEE Sendai Chapter, IEEE

[sugiyama-03:2010]

Masahide Sugiyama, 2009. Reviewer of ASJ Transaction, ASJ

Ph.D., Master and Graduation Theses

[j-huang-06:2010]

Yurie Yoshida. Graduation Thesis: Improvement of the 5-loudspeaker 3-D sound system by vertical loudspeaker arrays, University of Aizu, 2010.

Thesis Adviser: J. Huang

[j-huang-07:2010]

Noriko Watanabe. Graduation Thesis: Evaluation of the 5loudspeaker 3-D sound system with moving sound images, University of Aizu, 2010.

Thesis Adviser: J. Huang

[j-huang-08:2010]

Sachio Watanabe. Graduation Thesis: Environmental sound analysis by colculating the instantaneous frequencies within narrow bands, University of Aizu, 2010.

Thesis Adviser: J. Huang

[j-huang-09:2010]

Takumi Kanbara. Graduation Thesis: Elevation perception related to amplitude modulation of different frequency bands, University of Aizu, 2010.

Thesis Adviser: J. Huang

[j-huang-10:2010]

Teruhiko Suzuki. Graduation Thesis: Evaluation of the 5-loudspeaker 3-D sound system in real ordinary rooms, University of Aizu, 2010.

Thesis Adviser: J. Huang

[j-huang-11:2010]

Hayato Suzuki. Graduation Thesis: Frontal sound localization with two near loudspeakers, University of Aizu, 2010.

Thesis Adviser: J. Huang and S. Guo

[markov-05:2010]

Daria Vazhenina. Master Thesis: Statistical Spoken Language Modeling for morphologically Rich Languages, Aizu University, 2010.

Thesis Adviser: K. Markov

Annual Review 2010 > Division of Information Systems