Human Interface Laboratory, Annual Review 2011, The University of Aizu

Human Interface Lab. is consisted of three faculty members. Each member has own research and educational interests and are doing independent research activities;

Prof. Masahide Sugiyama:

Applications of similar segment search algorithm to various problems
Development of efficient similar segment search algorithms
Efficient histogram generation algorithm
Participation to Open Campus with lab. demonstrations. Promotion of a series of computer training courses for visual handicapped people in Fukushima region.

Prof. Jie Huang:

Computational Auditory Scene Analysis

To investigate the integration and segregation factors for self organization of sound components in the human auditory system by psychological experiments.
Robotic Spatial Sound Localization and Sound-based Position Identification

Audition is important for mobile robots as well as vision and other sensing systems. By audition systems, the robot can detect a target and identify its position by sound even in the darkness. The auditory system also can complement and cooperate with vision systems.
Research on 3-D Sound Systems
- Speaker based 3-D sound systems
  
  We are developing a 3-D sound system using horizontally arranged loudspeakers same as the existing 5.1 channel home theater systems.
- Analysis of Head Related Transfer Functions
- Reverberation and reality in 3-D sound systems

Prof. Konstantin Markov:

Spoken Language processing including speech recognition and synthesis, speaker and language identification, dialog systems and speech based human-computer interfaces.
Multimedia information processing systems for annotation, indexing and management of digital audio-video documents and archives.
Machine learning algorithms which mimic the human way of learning and knowledge acquisition by dynamically changing model structure and active interactions with the environment.

Refereed Journal Papers

[j-huang-01:2011]

X. Guo, Y. Toyoda, H. Li, J. Huang, S. Ding, and Y. Liu. Environmental sound recognition using time-frequency intersection patterns. Applied Computational Intelligence and Soft Computing, 2012(ID650818):, feb 2012.

Environmental sound recognition is an important function of robots and intelligent computer systems. In this research, we use a multistage perceptron neural network system for environmental sound recognition. The input data is a combination of time-variance pattern of instantaneous powers and frequency-variance pattern with instantaneous spectrum at the power peak, referred to as a time-frequency intersection pattern. Spectra of many environmental sounds change more slowly than those of speech or voice, so the intersectional time-frequency pattern will preserve the major features of environmental sounds but with drastically reduced data requirements. Two experiments were conducted using an original database and an open database created by the RWCP project. The recognition rate for 20 kinds of environmental sounds was 92The recognition rate of the new method was about 12using only an instantaneous spectrum. The results are also comparable with HMM-based methods, although those methods need to treat the time variance of an input vector series with more complicated computations.

[j-huang-02:2011]

H. Li, Y. Luo, J. Huang, T. Kanemoto, M. Guo, and F. Tang. New acoustic monitoring method using cross-correlation of primary frequency spectrum. J. Ambient Intelligence and Humanized Computing, 3(1):—, jan 2012.

The acoustic data remotely measured by microphones are widely used to investigate monitoring and diagnose integrity of ball bearing in rotational machines. Early fault diagnosis is very difficult for acoustic emission. We propose a new method using a cross-correlation of frequency spectrum to classify various faults with fine grit. Principal component analysis (PCA) is used to separate the primary frequency spectrum into main frequency and residual frequency. Different with conventional classification using the PCA eigenvalue, we introduce the general cross-correlation (GCC) of main frequency and residual frequency spectrums between a basic signal vector and monitoring signal. Multi-classification strategy based on binary-tree support vector machine (SVM) is applied to perform faults diagnosis. In order to remove noise interference and increase robustness, a normalization method is proposed during time generation. Experiment results show that PCAgeneration. Experiment results show that PCA?GCC?SVM method is able to diagnose GCCgeneration. Experiment results show that PCA?GCC?SVM method is able to diagnose SVM method is able to diagnose various faults with high sensitivity.

[j-huang-03:2011]

H. Li, J. Huang, M. Guo, and Q. Zhao. Spatial localization of concurrent multiple sound sources using phase condidate histogram. J. Advanced Computational Intelligence and Intelligent Informatics, 15(9):1277-1286, aug 2011.

Mobile robots communicating with people would benefit from being able to detect sound sources to help localize interesting events in real-life settings. We propose using a spherical robot with four microphones to determine the spatial locations of mul tiple sound sources in ordinary rooms. The arrival temporal disparities from phase difference histograms are used to calculate the time differences. A precedence effect model suppresses the influence of echoes in reverberant environments. To integrate spatial cues of different microphones, we map the correlation between different microphone pairs on a 3D map corresponding to the azimuth and elevation of sound source direction. Results of experiments indicate that our proposed system provides sound source distribution very clearly and precisely, even concurrently in reverberant environments with the Echo Avoidance (EA) model.

Refereed Proceedings Papers

[markov-01:2011]

K. Markov. TOWARDS CONTINUOUS ONLINE LEARNING BASED COGNITIVE SPEECH PROCESSING. In In Proc. International Workshop on Statistical Machine Learning for Speech Processing, page na. IEEE, March 2012.

Despite the substantial progress of the speech processing technology, it is generally acknowledged that we have a long way to go before developing ASR systems which exhibit performance approaching that of humans. Many researchers believe that simply extending our current theories and practical solutions may never lead us to that goal. One promising research direction is development of learning algorithms exhibiting human-like learning behavior. There is an apparent discrepancy between the way humans acquire their language and the way we train our systems. Humans are ”learning machines ”while our current systems are actually ”learned machines ”. The ability to learn and reason in a continuing loop is attributed to the emerging cognitive systems. In this paper, we present our approach and ideas for future research in developing a cognitive speech processing system. This system has a hierarchical structure where each layer works according to the same algorithm, but represents different space or level of abstraction. The lowest layer corresponds to the acoustic space and the other layers - to phonetic, word and phrase spaces respectively. The information between layers flows in both directions: bottom-up during recognition and top-down during generation, i.e. synthesis. Development of the full system poses multiple research challenges and problems discussed in this paper.

[markov-02:2011]

D. Vazhenina and K. Markov. Phoneme set selection for russian speech recognition. In Proc. 7th International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pages 475 478. IEEE, Nov 2011.

n this paper, we describe a method for phoneme set selection based on combination of phonological and statistical information and its application for Russian speech recognition. For Russian language, currently used phoneme sets are mostly rule-based or heuristically derived from the standard SAMPA or IPA phonetic alphabets. However, for some other languages, statistical methods have been found useful for phoneme set optimization. In Russian language, almost all phonemes come in pairs: consonants can be hard or soft and vowels stressed or unstressed. First, we start with a big phoneme set and then gradually reduce it by merging phoneme pairs. Decision, which pair to merge, is based on phonetic pronunciation rules and statistics obtained from confusion matrix of phoneme recognition experiments. Applying this approach to the IPA Russian phonetic set, we first reduced it to 47 phonemes, which were used as initial set in the subsequent speech model training. Based on the phoneme confusion results, we derived several other phoneme sets with different number of phonemes down to 27. Speech recognition experiments using these sets showed that the reduced phoneme sets are better than the initial phoneme set for phoneme recognition and as good for word level speech recognition. (Best paper award.)

[markov-03:2011]

K. Markov and T. Matsui. MUSIC GENRE CLASSIFICATION USING SELF-TAUGHT LEARNING VIA SPARSE. In In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1929-1932. IEEE, March 2012.

Availability of large amounts of raw unlabeled data has sparked the recent surge in semisupervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the selftaught learning approach where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled data via sparse coding and then it is applied to the labeled data used for classification. In this work, we implemented this method for the music genre classification task using two different databases: one as unlabeled data pool and the other for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes. (Joint research with colleagues from ISM, Tokyo.)

[markov-04:2011]

K. Markov A. Karpov D. Vazhenina, I. Kipyatkova. State-of-the-art Speech Recognition Technologies for Russian Language. In In Proc. Joint International Conference on Human-Centered Computer Environments HCCE, pages 59-63. ACM, 2012.

In this paper, we present a review of the latest developments in the Russian speech recognition research. Although the underlying speech technology is mostly languageindependent, differences between languages with respect to their structure and grammar have substantial effect on the recognition systems performance. The Russian language has a complicated word formation system, which is characterized by a high degree of inflection and unrigidness of the word order. This greatly reduces the predictive power of the conventional language models and consequently increases the error rate. Current statistical approach to speech recognition requires large amount of both speech and text data. There exist several Russian speech databases and their descriptions are given in this paper. In addition, we describe and compare several speech recognition systems developed in Russia as well as in some other countries. Finally we suggest some promising directions for further research in Russian speech technology. (Joint paper with researchers from St. Petersburg, Russia)

Academic Activities

[sugiyama-01:2011]

Masahide Sugiyama, 2011.

board member of IEEE Sendai Chapter, IEEE

Ph.D., Master and Graduation Theses

[j-huang-04:2011]

Yuki Funyu. Elevation localization in the frontal area using two near loudspeakers. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[j-huang-05:2011]

Yuji Sato. Frontal localization using headphones with different level changes in subbands. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[j-huang-06:2011]

Hiroaki Sakai. A new reverberation generation method for 3-D sound systems using horizontally arranged loudspeakers. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[j-huang-07:2011]

Tatsuya Fujino. Evaluation of the 5-loudspeaker 3-D sound system with different sound scenes. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[j-huang-08:2011]

Kouki Yuza. Sound signal generation by analysis of narrow band instantaneous frequencies - analysis and generation of vowels -. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[j-huang-09:2011]

Yoshiyuki Morikawa. Reverberation generation by the image sound source method for loudspeaker based 3-D sound systems. Master thesis, Graduate School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[j-huang-10:2011]

Tetsuya Watanabe. Elevation perception in the median plane for virtual sound images with different frequency characteristics. Master thesis, Graduate School of Computer Science and Engineering, 2012.

Thesis Adviser: J. Huang

[markov-05:2011]

K. Fijii. Personal Footstep Identification System. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: Konstantin Markov

[markov-06:2011]

J. Yanai. Markov Model based Human-Computer Dialog Manager. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: Konstantin Markov

[markov-07:2011]

E. Meguro. Analysis and Improvement of Query Detection Start Time in Audio Search. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: Konstantin Markov