Annual Review 2011 > Division of Computer Engineering

Distributed Pararell Processing Laboratory

Stanislav G. Sedukhin

Professor

Hitoshi Oi

Assistant Professor

Naohito Nakasato

Assistant Professor

Romanenko Alexey

Visiting Researcher

Veles Oleksandr

Visiting Researcher

Current research at the Distributed Parallel Processing Laboratory (DPPL) encompasses:

[Stanislav G. Sedukhin]

[Hitoshi Oi]

[Naohito Nakasaoto]

[Veles, Oleksandr]

Refereed Journal Papers

[nakasato-01:2011]

H. Daisaka, N. Nakasato, J. Makino, F. Yuasa, and T. Ishikawa. GRAPE-MP: An SIMD Accelerator Board for Multi-precision Arithmetic. Procedia Computer Science, 4:878-887, 2011.

We describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for quadrupleprecision arithmetic operations. A GRAPE-MP board houses one GRAPE-MP processor chip and an FPGA chip which handles the communication with the host computer. A GRAPE-MP chip has 6 processing elements (PE) and operates with 100 MHz clock cycle. Each PE can perform one addition and one multiplication in every clock cycle. The architecture of the GRAPE-MP is similar to that of the GRAPE-DR. It is implemented using the structured ASIC chip from eASIC corp. A GRAPE-MP processor board has the theoretical peak quadruple-precision performance of 1.2 Gflops. As a preliminary result, we present the performance of the GRAPE-MP board for two target applications. The performance of the numerical integration of Feynman loop is 0.53 Gflops. The performance of a N-body simulation with the second order leapfrog schema is 0.505 Gflops for N = 1984, which is more than 10 times faster than the performance of the host computer.

[nakasato-02:2011]

K. Matsumoto, N. Nakasato, T. Sakai, H. Yahagi, and S. G. Sedukhin. Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems. Procedia Computer Science, 4:342-351, 2011.

This paper presents results of our study on double-precision general matrix-matrix multiplication (DGEMM) for GPU-equipped systems. We applied further optimization to utilize the DGEMM stream kernel previously implemented for a Cypress GPU from AMD. We have examined the effects of different memory access patterns to the performance of the DGEMM kernel by changing its layout function. The experimental results show that the GEMM kernel with X-Morton layout function superiors to the one with any other functions in terms of performance and cache hit rate. Moreover, we have implemented a DGEMM routine for large matrices, where all data cannot be allocated in a GPU memory. Our DGEMM performance achieves up to 472 GFlop/s and 921 GFlop/s on a system, using one GPU and two GPUs, respectively.

[veles-01:2011]

A. V. Shavrina, I. A. Mikulskaya, S. I. Kiforenko, V. A. Sheminova, A. A. Veles, and O. B. Blum. The Study of Ground-Level Ozone in Kiev and its Impact on Public Health. Kosmichna Nauka i Tekhnologiya (ISSN 1561-8889), 17(1):52-59, 2011.

Ground-level ozone in Kiev for the episode of its high contentration in August 2000 is simulated with the model of urban air pollution UAM-V.

Refereed Proceedings Papers

[hitoshi-01:2011]

Hitoshi Oi. Power-Performance Analysis of JVM Implementations. In Proceedings of 5th International conference on Information Technology and Multimedia (ICIM &L), pages 1-7. IEEE/CS Conference Publishing Services, November 2011.

DOI: 10.1109/ICIMU.2011.6122743

[hitoshi-02:2011]

Hitoshi Oi and Kazuaki Takahashi. Performance Modeling of a Consolidated Java Application Server. In Proceedings of 2011 IEEE International Conference on High Performance Computing and Communications (HPCC2011), pages 834-838. IEEE Conference Proceedings, September 2011.

DOI: 10.1109/HPCC.2011.118

[nakasato-03:2011]

K. Matsumoto, N. Nakasato, and S. G. Sedukhin. Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System. In , 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pages 145-152, 2011.

This paper presents a blocked algorithm for the all-pairs shortest paths (APSP) problem for a hybrid CPU-GPU system. In the blocked APSP algorithm, the amount of data communication between CPU (host) memory and GPU memory is minimized. When a problem size (the number of vertices in a graph) is large enough compared with a blocking factor, the blocked algorithm virtually requires CPUGPU exchanging of two block matrices for a block computation on the GPU. We also estimate a required memory/communication bandwidth to utilize the GPU efficiently. On a system containing an Intel West mere CPU (Core i7 970) and an AMD Cypress GPU (Radeon HD 5870), our implementation of the blocked APSP algorithm achieves the performance up to 1 TFlop/s in single precision.

Unrefereed Papers

[hitoshi-03:2011]

Sho Niboshi and Hitoshi Oi. Performance Analysis of SPECjEnterprise2010. In IPSJ SIG Technical Report, number IPSJ-EVA1103600, pages 1-2. Information Processing Society of Japan, 2011.

[veles-02:2011]

Peter Berczik, Keigo Nitadori, Shiyan Zhong, Rainer Spurzem, Tsuyoshi Hamada, Xiaowei Wang, Ingo Berentzen, Alexander Veles, and Wei Ge. High performance massively parallel direct N-body simulations on large GPU clusters. In Proceedings of International conference on High Performance Computing 2011 Kyiv, Ukraine, 2011.

Academic Activities

[hitoshi-04:2011]

Hitoshi Oi, Since 2005.

Professional Member, ACM

[hitoshi-05:2011]

Hitoshi Oi, Since 2005.

Member, IEEE/Computer Society

[hitoshi-06:2011]

Hitoshi Oi, Since 2009.

Academic member of the T-Engine Forum (representative for the University of Aizu). http://www.t-engine.org/

[hitoshi-07:2011]

Hitoshi Oi, Since 2006.

Academic Member, EEMBC

[hitoshi-08:2011]

Hitoshi Oi, Since 2009.

Senior Member, IACSIT

[hitoshi-09:2011]

Hitoshi Oi, March 2012.

Hosted 37th meeting of Information Processing Society of Japan (IPSJ), Special Interest Group on System Evaluation (SIGEVA) at the University of Aizu, on March 30, 2012.

[hitoshi-10:2011]

Hitoshi Oi, Since 2011.

Program committee member and chair of the special session in Network on Chip and Multi-core technologies (NMT2011). (The conference has been postponed to 2012 due to earthquake).

[hitoshi-11:2011]

Hitoshi Oi, Since 2011.

Program Committee member, The 10th IEEE International Symposium on Parallel and Distributed Processing with Applications.

Ph.D., Master and Graduation Theses

[nakasato-04:2011]

Yuta Suzuki. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: N.Nakasato

[nakasato-05:2011]

Takafumi Suzuki. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: N.Nakasato

[nakasato-06:2011]

Kou Kaimijima. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: N.Nakasato

[nakasato-07:2011]

Kousuke Nakamura. Graduation thesis, School of Computer Science and Engineering, 2012.

Thesis Adviser: N.Nakasato

[nakasato-08:2011]

Kazuhiro Seiwa. Graduation Thesis: GPU Acceleration of Numerical Simulation of Fluid by the Lattice Boltzmann Method, University of Aizu, 2012.

Thesis Adviser: N.Nakasato

[nakasato-09:2011]

Tsuyoshi Watanabe. Graduation Thesis: Fluid Simulations in Curved Pipes using Smoothed Particle Hydrodynamics on GPU, University of Aizu, 2012.

Thesis Adviser: N.Nakasato

Others

[hitoshi-12:2011]

Hitoshi Oi.

Journal reviewer for Microprocessor and Microsystems (Elsevier) and International Journal of High Performance Systems Architecture (Inderscience Enterprises)