Operating Systems Laboratory, Annual Review 2007, The University of Aizu

Annual Review 2007 > Operating Systems Laboratory

Operating Systems Laboratory


Stanislav Sedukhin Professor	Hitoshi Oi Assistant Professor	Naohito Nakasato Assistant Professor

Operating Systems Laboratory mainly researches on the following areas:

Biology Inspired Cellular Computing

The new 3-dimensional (3-D) computer architecture and orbital highly-parallel algorithms for multidimensional separable transforms (Discrete Fourier, Hadamard, Cosine, etc.) were proposed and researched. The architecture is based on the set of many simple processing elements interconnected by highly-scalable toroidal network. The main building block of 3-D computer is a micro-cell which consists of eight (2 × 2) toroidally interconnected cores (processing elements with a small register le) surrounded by membranes (switchers) like in biology cell. The proposed architecture is very suitable for 3-D VLSI implementation with advanced 3-D Trough Silicon Via technology. Our 3- D architecture differs from existing array processors by allowing all multidimensional data to be inputted (by some sort of sensor array) and processed at the same time. This characteristic of a 3-D computer requires redesigning of parallel implementation of existing algorithms. The new highly-parallel orbital algorithms for multidimensional separable transforms were designed and investigated. The 3-dimensional algorithms (Forward/ Inverse Discrete Cosine Transform) were specically designed to support real-time implementation of video/image compression. To check the feasibility of the proposed array processor, we implemented an 8 × 8 cellular array processor using an FPGA board and conrmed it correctly calculates some examples including 2-D DCT/IDCT and allpars shortest paths problem. In addition, a new data I/O mechanism using external FIFOs was introduced. It makes hardware implementation easier without any modication of the original data processing scheme.

Performance Analysis and Modeling of Consolidated Systems

Analysis and Optimization of Network Performance in Xen

System-level virtualization provides us various advantages including independent and isolated computing environments on which multiple operating systems can be executed, and improved resource utilization. These benets come with performance overheads and network operation is one of most typical cases. Using Xen, an open-source virtual machine monitor, we are working on the analysis and optimization of the network performance of the virtualization system. One of recent result in this project is to port large receive offload (LRO) into the physical and virtual network interfaces in the Xen virtualized system. LRO is a technique to reduce the overhead of handling received message packet [5, 6]. LRO combines the received TPC packets and passes them as a single larger packet to the upper layer in the network. By reducing the number of packet processing operations, the CPU overhead is lowered and the performance improvement is expected. The current activities in our group include, further optimization of the Xen network architecture with LRO, breakdown of the CPU utilization in the network subsystem to point out the performance bottleneck.

Virtual Machine for Sensor Network Nodes

Development of a compiler software for SIMD computers

¹¹

¹⁰

⁹

Study on multi-presicion calculations on SIMD computers

To implement a high precision multi-dimensional integration scheme on SING processor, we have investigated a library of quadruple-precision FP operation. Since SING processor does not have quadruple-precision (QP) FPU as other usual CPU, we need to emulate QP operations. Two possible ways to do the emulation are (i) to use integer operations and (ii) use DP operations. The former method is more efficient in terms of memory storage while the latter method is easier and faster in most case. A large hurdle to implement the former method on SING processor is that it does not have branch instructions, which are repeatedly used in the former method. So, we have implemented the latter method of doing QP emulation with DP operations. In this method, one QP word is expressed as sum of two DP words. Conventionally, one QP addition/subtraction requires 20 DP operations and one QP multiplication requires 23 DP operations. After optimization for SING processor, our implementation of QP emulation on SING processor requires 21 operations for add/sub and 41 operations for mul. By combining these basic operations, we have implemented QP division that requires 199 operations. The resulted four basic operations will be used as building blocks to advance further study of implementing high-precision numerical algorithms on SING processor.

Refereed Journal Papers

[hitoshi-01:2007]	Hitoshi Oi. Local Variable Access Behavior of a Hardware-Translation Based Java Virtual Machine. Journal of Systems and Software, 2008.
	Accepted for publication. http://dx.doi.org/10.1016/j.jss.2008.03.057
[nakasato-01:2007]	T.K. Suzuki, N. Nakasato, H. Baumgardt, A. Ibukiyama, , J. J. Makino, and T. Ebisuzaki. Evolution of Collisionally Merged Massive Stars. Astrophysical Journal, 668:435-448, 2007.
	We investigate the evolution of collisionally merged stars with mass of 乣 100M which might be formed in dense star clusters. We assumed that massive stars with several tens M collide typically after ~ 1Myr of the formation of the cluster and performed hydrodynamical simulations of several collision events. Our simulations show that after the collisions, merged stars have extended envelopes and their radii are larger than those in the thermal equilibrium states and that their interiors are He-rich because of the stellar evolution of the progenitor stars. We also found that if the mass-ratio of merging stars is far from unity, the interior of the merger product is not well mixed and the elemental abundance is not homogeneous. We then followed the evolution of these collision products by a one dimensional stellar evolution code. After an initial contraction on the Kelvin-Helmholtz (thermal adjustment) timescale (乣 10^3-4 yr), the evolution of the merged stars traces that of single homogeneous stars with corresponding masses and abundances, while the initial contraction phase shows variations which depend on the mass ratio of the merged stars. We infer that, once runaway collisions have set in, subsequent collisions of the merged stars take place before mass loss by stellar winds becomes signicant. Hence, stellar mass loss does not inhibit the formation of massive stars with mass of 乣 1000M.
[sedukhin-01:2007]	A.S. Zekri and S.G. Sedukhin. Level-3 BLAS and LU Factorization on a Matrix Processor. IPSJ Trans. on Advanced Computing Systems, 49(SIG 2 (ASC 21)):37-52, 2008.
	As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as co-processors to general-purpose processors in order to handle the different workloads in scientic, engineering, and signal processing applications. In this paper, we propose a manycore matrix processor model consisting of a scalar unit augmented with b × b simple cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b × b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing ne-grained b × b matrix multiply-add (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store.Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level-3BLAS kernel, GEMM, and the LU factorization with partial pivoting,the main step in solving linear systems of equations. For the GEMM kernel, the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes, n, and block sizes, b. The speed of the LU factorization for relatively large values of n ranges from around 50-90% of the maximum speed depending on the model parameters. Overall, the analytical results show the merits of using the matrix unit for accelerating the matrix-based applications.

Refereed Proceeding Papers

[hitoshi-02:2007]	Takayuki Hatori and Hitoshi Oi. Implementation and Analysis of Large Receive Offload in a Virtualized System. In Proceedings of the Virtualization Performance: Analysis, Characterization, and Tools (VPACT'08), 2008.
	Austin, TX
[hitoshi-03:2007]	Hitoshi Oi. Hardware Support for a Wireless Sensor Network Virtual Machine. In Proceedings of International Conference on MOBILe Wire- less MiddleWARE, Operating Systems and Applications (Mobilware 2008), February 2008.
	Innsbruck, Austria, February 2008.
[nakasato-02:2007]	N. Nakasato, J. Makino, Y. Matsubara, and S. Ebisuzaki. A Compiler for High Performance Adaptive Precision Computing. In Proceedings of SACSIS 2008, 2008.
	We propose and implement a compiler program for high performance adaptive precision computing. Conventional numerical simulations have been done with oatingpoint double precision operations. However, recently available computing techniques such as SIMD computers or FPGA computers offer us much better performance than DP operations on conventional CPUs. To take an advantage of a SIMD computer, we should program it with a special programming language or libraries. Also, each SIMD computer has its own programming language or special techniques that we need to cope with. The proposed compiler denes a new language for easily using SIMD and FPGA computers. From an input source program dened in this language, our compiler generates required source codes with given numerical precision. We use core routines of two existing compute intensive applications to evaluate expected performance of the generated code for a SIMD computer (GRAPE-DR). With one application using double precision operations and other using quadruple precision operations, we obtain the performaces of 7.7 × 10¹⁰ and 7.6 × 10⁹ operations per second, respectively.
[sedukhin-02:2007]	A.S. Zekri and S.G. Sedukhin. Evaluating the Performance of Basic Linear Algebra Subroutines on a Torus Array Processor. In N.A., editor, CIT '07: Proceedings of the 7th IEEE International Conference on Computer and Information Technology, pages 300{305, Washington, DC, USA, October 2007. University of Aizu, IEEE Computer Society.
	The basic linear algebra subroutines (BLAS) are standard operations to efficiently solve the linear algebra problems on high performance and parallel systems. In this paper, we study the implementation of some important BLAS operations on a N × N torus array processor. We show that the performance of the Level-3 BLAS represented by the nxn matrix multiply-add operation, n > N, approaches the theoretical peak as n increases since the degree of data reusing is high. While the performance of Level-1 and Level-2 BLAS operations is low as a result of low data reusing. Fortunately, many applications are based on intensive use of Level-3 BLAS with small percentage of Level-1 and Level-2 BLAS.
[sedukhin-03:2007]	A.S. Zekri and S.G. Sedukhin. Performance Evaluation of Basic Linear Algebra Subroutines on a Matrix co-processor. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, PPAM 2007 - The 7th International Conference on Parallel Processing and Applied Mathemat- ics, LNCS, vol. 4967, pages 1190-1199, Gdansk, Poland, September 2007. PPAM, Springer-Verlag.
	As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to general-purpose processors in order to handle the different workloads of scientic and signal processing applications. Many kernels in these applications lend themselves to the data-parallel architectures such as array processors. The basic linear algebra subroutines (BLAS) are standard operations to efficiently solve the linear algebra problems on high performance and parallel systems. In this paper, we implement and evaluate the performance of some important BLAS operations on a matrix co-processor. Our analytical model shows the performance of the Level-3 BLAS represented by the nxn matrix multiply-add operation approaches the theoretical peak as n increases since the degree of data reuse is high. However, the performance of Level-1 and Level-2 BLAS operations is low as a result of low data reuse. Fortunately, many applications are based on intensive use of Level-3 BLAS with small percentage of Level-1 and Level-2 BLAS.
[sedukhin-04:2007]	A.S. A.S. Zekri and S.G. Sedukhin. Fine-grained Matrix Multiplyadd on a Torus Array Processor. In N.A., editor, Proc. of the 22nd Inter- national Conference on Computers and Their Applications (CATA-2007), pages 44-51, Honolulu, USA, March 2007. The International Society for Computers and Their Applications, ISCA.
	In performing the n × n matrix multiply-add operation C=C+A × B on a ne-grain N × N torus array processor, n ≫ N, the matrices are partitioned into blocks of size N so that the whole result is obtained by a sequence of N × N matrix multiplyadd operations. When the sizes of matrices are not exact multiples of the array size, the remaining parts may drastically affect the performance depending on the shape of the matrices. Previously, we represented the 3D index space of the N × N matrix multiply-add operation as a 3D torus. The projection method was used to obtain the optimal 2D data allocations to perform the operation on the N × N torus array processor in N multiply-add-roll steps. In this paper, we use the optimal data allocations to present two approaches to deal with the ne-grain blocking of the matrix multiply-add operation. The packing approach performs multiple vector scaling or vector reduction operations together by proper alignment of data inside the array processor and applying the suitable data allocation. The padding approach pads the remaining parts up to the block size N. The analytical experiments show a gained performance of the packing approach over the padding approach when the sizes of the remaining parts are small compared to N.
[sedukhin-05:2007]	K. Matsumoto, D. Vazhenin, and S. Sedukhin. Transitive Closure on the PlayStation 3. In N.A., editor, Proceedings of the 2nd International Workshop on Automatic Performance Tuning (iWAPT 2007), page 33, Tokyo, Japan, September 2007. University of Tokyo, University of Tokyo.
	The problem of nding all the shortest paths in a graph is one of the most important optimizations in operations research as it arises in many real-world applications like bioinformatics, network routing, CAD, etc. The Transitive Closure (TC), or all-pairs shortest paths, computes the length of a minimum-length path between all pairs of nodes in a directed n-node distance graph. The Floyd-Warshall (FW) algorithm is a classical algorithm to solve the TC problem with O(n³) fmin;+g operations on O(n²) data. This algorithm involves nested code that exhibits a regular access pattern with signicant data dependences. A porting of the FW-algorithm to different computing platforms usually demonstrates a very limited performance compared with linear algebra problems. We present results of our experiments on porting FW-algorithm to the PlayStation 3 (PS3). The parallel algorithm we use is a blocked (tiled) FW-algorithm from our previous work. The block size was selected as 64 × 64. The performance comparison of the FW-algorithm running on different computing platforms, including our result on PS3, demonstrates an impressive improvement for TC problem.

Unrefereed Papers

[hitoshi-04:2007]	Hitoshi Oi. A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk. In Proceedings of the Japan-China Joint Workshop on Frontier of Computer Science and Technology (FCST 2007), pages 57-60, 2007.
	Wuhan, China

Academic Activities

[hitoshi-05:2007]	Hitoshi Oi, 2006. Reviewer, Microprocessors and Microsystems
[hitoshi-06:2007]	Hitoshi Oi, 2006 to present. Member, Embedded Microprocessor Consortium
[hitoshi-07:2007]	Hitoshi Oi, 2005 to present. Member, IEEE/Computer Society
[hitoshi-08:2007]	Hitoshi Oi, since 2006. Liaison Chair for Asia, ACM International Conference on Computing Frontiers
[hitoshi-09:2007]	Hitoshi Oi, 2005 to preent. Member, ACM
[hitoshi-10:2007]	Hitoshi Oi, since 2006. Program Committee Member, Japan-China Joint Workshop on Frontier of Computer Science and Technology (FCST)
[hitoshi-11:2007]	Hitoshi Oi, 2007. Reviewer, IEEE Computer Architecture Letters
[hitoshi-12:2007]	Hitoshi Oi, 2008. Reviewer, MOBILe Wireless MiddleWARE, Operating Systems and Applications (Mobilware 2008)
[hitoshi-13:2007]	Hitoshi Oi, 2007. Track Chair (High Performance Computing), session chair and reviewer, IEEE 7th International Conference on Computer and Information Technology (CIT2007)
[sedukhin-06:2007]	S. Sedukhin, Apr. 2007. IEEE CS, member
[sedukhin-07:2007]	S. Sedukhin, Apr. 2007. ACM, member
[sedukhin-08:2007]	S. Sedukhin, Apr. 2007. IEICE, member
[sedukhin-09:2007]	S. Sedukhin, Apr. 2007. IASTED, Technical Committee on Parallel Processing, member
[sedukhin-10:2007]	S. Sedukhin, Apr. 2007. International Journal of Parallel Processing Letters, Member of the Editorial Board
[sedukhin-11:2007]	S. Sedukhin, Apr. 2007. International Journal of Neural, Parallel & Scientic Computations, Member of the Editorial Board
[sedukhin-12:2007]	S. Sedukhin, Apr. 2007. International Journal of High Performance Systems Architecture, Member of the Editorial Board
[sedukhin-13:2007]	S. Sedukhin, June 2007. The 2007 High Performance Computing & Simulation Conference (HPC&S 2007), Program Committee Member
[sedukhin-14:2007]	S. Sedukhin, August 2007. The 11th Asia-Pacic Computer Systems Architecture Conference (ACSAC007), Korea, Stearing and Program Committee Member

Patents

[hitoshi-14:2007]

Kazunori Masuyama (Kanazawa, JP) Yasushi Umezawa (Cupertino, CA, US) Jeremy J. Farrell (Campbell, CA, US) Sudheer Miryala (San Jose, CA, US) Takeshi Shimizu (Sunnyvale, CA, US) Hitoshi Oi (Boca Raton, FL, US) Patrick N. Conway (Los Altos, CA, US) FAULT CONTAINMENT AND ERROR HANDLING IN A PARTITIONED SYSTEM WITH SHARED RESOURCES, 2008.

Ph.D and Others Theses

[hitoshi-15:2007]	Takayuki Hatori. Implemenation and Analysis of Large Receive Offload in a Virtualized System, CSE, 2008.
[hitoshi-16:2007]	Takuya Sato. Simulation Study of a Routing Algorithm in a Wirelress Sensor Network, CSE, 2008.
[sedukhin-16:2007]	Kazuya Matumoto. Graduation Thesis: Solving All-Pairs Shortest Path Problem on the PLAYSTATION 3, University of Aizu, 2007.
	Thesis Advisor: S. Sedukhin

Ph.D and Others Theses

[hitoshi-17:2007]	HITOSHI OI. Invited Talks: \Hardware Support for a Wireless Sensor Network Virtual Machine," at the Centre for High Performance Embedded Systems (CHiPES), Nanyang Technological University, September 25, 2007. \Virtual Machines for Resource-Constrained Platforms," at The \Politehnica" University of Timisoara, June 8, 2007.
[hitoshi-18:2007]	HITOSHI OI. Visited Nanyang Technological University (Singapore) and The \Politehnica" University of Timisoara (Romania) for exchange program development.
[hitoshi-19:2007]	Hitoshi Oi. (in Japanese, no English title), January 2008. Fuji Xerox White Paper
[hitoshi-20:2007]	GigaExpress, a DRAM-based Solid State Disk, has been provided by courtesy of Fuji Xerox Corporation.