Annual Review 2010 > Division of Computer Engineering

Distributed Pararell Processing Laboratory

Stanislav G. Sedukhin

Professor

Hitoshi Oi

Assistant Professor

Naohito Nakasato

Assistant Professor

Marcin Paprzycki

Visiting Researcher

Current research at the Distributed Parallel Processing Laboratory (DPPL) encompasses:

[Stanislav G. Sedukhin]

[Hitoshi Oi]

[Naohito Nakasaoto]

Refereed Journal Papers

[nakasato-01:2010]

N. Nakasato. Implementation of a Parallel Tree Method on a GPU. Journal of Computational Science, page in press, 2011.

The kd-tree is a fundamental tool in computer science. Among other applications, the application of kd-tree search (by the tree method) to the fast evaluation of particle interactions and neighbor search is highly important, since the computational complexity of these problems is reduced from O(N2) for a brute force method to O(N log N) for the tree method, where N is the number of particles. In this paper, we present a parallel implementation of the tree method running on a graphics processing unit (GPU). We present a detailed description of how we have implemented the tree method on a Cypress GPU. An optimization that we found important is localized particle ordering to effectively utilize cache memory. We present a number of test results and performance measurements. Our results show that the execution of the tree traversal in a force calculation on a GPU is practical and efficient.

[nakasato-02:2010]

C. Kobayashi and N. Nakasato. CHEMODYNAMICAL SIMULATIONS OF THE MILKY WAY GALAXY. Astrophysical Journal, 729(1):in press, 2011.

We present chemodynamical simulations of a Milky-Way-type galaxy using a selfconsistent hydrodynamical code that includes supernova feedback and chemical enrichment, and predict the spatial distribution of elements from oxygen to zinc. In the simulated galaxy, the kinematical and chemical properties of the bulge, disk, and halo are consistent with the observations. The bulge formed from the assembly of subgalaxies at z 3, and has higher [&A/Fe] ratios because of the small contribution from Type Ia supernovae. The disk formed with a constant star formation over 13 Gyr, and shows a decreasing trend of [&A/Fe] and increasing trends of [(Na,Al,Cu,Mn)/Fe] against [Fe/H]. However, the thick disk stars tend to have higher [&A/Fe] and lower [Mn/Fe] than thin disk stars. We also predict the frequency distribution of elemental abundance ratios as functions of time and location, which can be directly compared with galactic archeology projects such as HERMES.

Refereed Proceedings Papers

[hitoshi-01:2010]

Hitoshi Oi and Sho Niboshi. Workload Analysis of SPECmail2009. In Proceedings of International Symposium on Information Technology 2010 (ITSim 2010), pages 672-677, June 2010.

SPECmail2009 is a new benchmark suite for measuring the performance of corporate mail servers. It assumes the IMAP4 protocol for retrieving messages and a large fraction of transactions are for manipulating hierarchical messages and mail folders from the users connected via high-speed networks. In this paper, we present a performance analysis of SPECmail2009, including disk I/O behavior, quality of service metrics, and CPU time breakdown. The IMAP4's capability of directly manipulating the messages on the server makes SPECmail2009 a highly I/O-bound workload. Our experiments show that a combination of a small but fast access SCSI drive for indexing messages and larger but slower SATA drives for message files can be an option for building a cost-effective mail storage. Both of multiprocessing and multithreading are quite effective for SPECmail2009 especially in the user mode execution. Keywords: Mail server, performance analysis, disk I/O.

[hitoshi-02:2010]

Sho Niboshi and Hitoshi Oi. Application of Fuzzy Control Theory in Resource Management of a Consolidated Server. In Proccedings of Annual International Conference on Cloud Computing & Virtualization 2010 (CCV2010), pages 100-107, May 2010.

A virtualized system incorporates multiple systems into a single physical computer as virtual domains. A lot of data centers and server systems have been organized using virtualization technology to merge several computer systems. On the shared system, resource manager is the key affecting the performance. However, the resource management in current systems does not provide accurate resource allocation, because it only utilizes information from virtual machines and disregards the state of running applications. The paper demonstrates the CPU resource controller taking the state of application as inputs to produce the minimum resource retaining application performance in acceptable level. In particular, it employs two-layered controller. The rst layer controller makes resource request based on the relationship between the state and resource demand of each application, modeled by fuzzy control theory. This approach is efcient to represent resource allocation model since fuzzy control theory deals imprecise and uncertain problems. The second layer con- troller adjusts the requests to the system capacity and builds the layout of resource capacity based on the relative Quality of Service performances between applications. For the separation of resource, common resource controller imposes a hard limit on the amount of resource a given domain can consume. The controller allocates resource with most effective capacity conguration. Under certain specied conditions, the controller does not set the capacities and allows domains to use the free time if the resource is idle. This results in eliminating unused resources and achieves relative high resource usage. Finally, the resource controller is evaluated with a virtualized system, and its advantages over conventional resource allocation methods are shown.

[nakasato-03:2010]

N. Nakasato. A fast GEMM implementation on a Cypress GPU. In 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), pages 1-8, 2010.

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP more than 200 times faster than the performance results in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

[sedukhin-01:2010]

Stanislav G. Sedukhin, Ahmed S. Zekri, and Toshiaki Myiazak. Orbital Algorithms and Unified Array Processor for Computing 2D Separable Transforms. In Parallel Processing Workshops, International Conference on, pages 127-134, Los Alamitos, CA, USA, September 2010. IEEE Computer Society.

The two-dimensional (2D) forward/inverse discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh-Hadamard transform (DWHT), play a fundamental role in many practical applications. Due to the separability property, all these transforms can be uniquely defined as a triple matrix product with one matrix transposition. Based on a systematic approach to represent and schedule different forms of the n Based on a systematic approach to represent and schedule different forms of the n ? n n matrix-matrix multiply-add (MMA) operation in 3D index space, we design new orbital highly-parallel/scalable algorithms and present an efficient n highly-parallel/scalable algorithms and present an efficient n ? n unified array n unified array processor for computing any n processor for computing any n ? n forward/inverse discrete separable transform in the n forward/inverse discrete separable transform in the minimal 2n time-steps. Unlike traditional 2D systolic array processing, all n2 registerstored elements of initial/intermediate matrices are processed simultaneously by all n2 processing elements of the unified array processor at each time-step. Hence the proposed array processor is appropriate for applications with naturally arranged multidimensional data such as still images, video frames, 2D data from a matrix sensor, etc. Ultimately, we introduce a novel formulation and a highly-parallel implementation of the frequently required matrix data alignment and manipulation by using MMA operations on the same array processor so that no additional circuitry is needed.

[sedukhin-02:2010]

K. Matsumoto and S. G. Sedukhin. Matrix Multiply-Add in Min-plus Algebra on a Short-Vector SIMD Processor of Cell/B.E. In Networking and Computing (ICNC), 2010 First International Conference on, pages 272-274, Los Alamitos, CA, USA, November 2010. IEEE Computer Society.

It is well-known that the all-pairs shortest paths problem has a similar algorithmic characteristic to the classical matrix-matrix multiply-add (MMA) problem, one of the differences between the two problems is in the underlying algebra: the matrix multiplyadd uses linear (+, uses linear (+, ?)-algebra whereas the all-pairs shortest paths problem uses (min, )-algebra whereas the all-pairs shortest paths problem uses (min, +)-algebra. This paper presents an implementation of 64+)-algebra. This paper presents an implementation of 64?64 matrix multiply-add kernel 64 matrix multiply-add kernel in (min, +)-algebra on a short-vector SIMD processor, the so-called Synergistic Processing Element (SPE), of the Cell Broadband Engine (Cell/B.E.). Our implementation for the shortest paths problem adopts an existing fast algorithm of matrix multiply-add with a reduction of the number of required registers. The MMA implementation in (min, +)-algebra achieves the speed of 8.502 Gflop/s, which is about three times as low as that of the (+, times as low as that of the (+, ?)-algebra MMA and is very close to the theoretical )-algebra MMA and is very close to the theoretical estimation based on the required number of instructions.

[sedukhin-03:2010]

A. A. Ravankar and S. G. Sedukhin. Mesh-of-Tori: A Novel Interconnection Network for Frontal Plane Cellular Processors. In Networking and Computing (ICNC), 2010 First International Conference on, pages 281-284, Los Alamitos, CA, USA, November 2010. IEEE Computer Society.

In this paper we propose a novel Mesh-of-Tori cellular interconnection network for scalable and massively parallel array processors with frontal plane I/O. The unit (called m-Cell) in this topology is the smallest double (for 2D case) or triple (for 3D) folded torus, which forms the basic tile. The Cells can be repeated and fused to form macroCells, or divided to form smaller Cells, without destroying the homogeneity of the entire structure, to give a highly scalable and cellular Mesh-of-Tori topology. The key features of the proposed interconnection network are (1) an excellent up and down scalability due to regularity and modularity, (2) no end-around connections, and (3) capability to map 2D streaming data from frontal plane I/O (stacked layer of sensors) to the processing elements. We also provide solutions for stream data manipulation through frontal plane I/O, on the proposed cellular topology.

Unrefereed Papers

[hitoshi-03:2010]

Kazuaki Takahashi and Hitoshi Oi. Measurement of Virtualization Overhead in a Java Application Server. In IPSJ SIG Technical Report, volume Vol. 2010-EVA-32. IPSJ SIGEVA, 2010.

In this technical report, we present our measurement results of virtualization overhead in SPECjAppServer2004, which is a benchmark suite of a 3-tier J2EE server system. By breaking down the CPU time for each transaction, we predict the total CPU utilization in two different consolidated configurations under three transaction mixes. While predicted CPU utilization draws similar curves to the throughput over the system scaling factor, utilization of each virtual machine exhibits differences between predictions and measurements, especially while two of servers are placed in the same virtual machine.

[hitoshi-04:2010]

Kazuaki Takahashi and Hitoshi Oi. Workload Analysis of a Consolidated Java Application Server. In IPSJ Tech. Report, volume Vol. 2010-EVA33, 2010.

With SPECjAppserver2004 as the target workload, we are currently developing a performance analysis model of a muti-tier application server consolidated by virtualization. In this work-in-progress report, we will present the measurement results including the CPU utilization on each domain, and analysis of per-core workload by comparing the results we reported previously.

[hitoshi-05:2010]

Kazuaki Takahashi and Hitoshi Oi. Measurement of SPECjEnterprise2010. In IPSJ Tech. Report, volume Vol.2011-EVA-34, 2011.

SPECjEnterprise2010 is a benchmark suite from Standard Performance Evaluation Corporation (SPEC) for evaluating the performance of Java EE servers. Similarly to its predecessor SPECjAppserver2004, it is designed with an automobile manufacturer as a model. In this report, we present a brief introduction to SPECjEnterprise2010, a comparison to SPECjAppserver2004, and the current status of our performance measurement of SPECjEnterprise2010.

Academic Activities

[hitoshi-06:2010]

Hitoshi Oi, since 2005. Member, EEE/Computer Society

[hitoshi-07:2010]

Hitoshi Oi, since 2006. Academic Member, EEMBC

[hitoshi-08:2010]

Hitoshi Oi, 2009. Senior Member, IACSIT

[hitoshi-09:2010]

Hitoshi Oi, 2010.

Member and officer, IPSJ SIGEVA

[hitoshi-10:2010]

Hitoshi Oi, 2009 to present.

Program Committee Member, Annual International Conference on Cloud Computing and Virtualization (CCV 2010)

[hitoshi-11:2010]

Hitoshi Oi, 2010.

Program Committee Member, Xen Summit

[hitoshi-12:2010]

Hitoshi Oi, Since 2005. Professional Member, ACM

[sedukhin-04:2010]

S. G. Sedukhin, May 2011.

The Eleventh International Conference on Pattern Recognition and Information Processing, Program Committee Member

[sedukhin-05:2010]

S. G. Sedukhin, Apr. 2010.

IEEE CS, member

[sedukhin-06:2010]

S. G. Sedukhin, Apr. 2010.

ACM, member

[sedukhin-07:2010]

S. G. Sedukhin, Apr. 2010.

IEICE, member

[sedukhin-08:2010]

S. G. Sedukhin, Apr. 2010.

International Journal of Parallel Processing Letters, Member of the Editorial Board

[sedukhin-09:2010]

S. G. Sedukhin, Apr. 2010.

International Journal of High Performance Systems Architecture, Member of the Editorial Board

[sedukhin-10:2010]

S. G. Sedukhin, June 2010.

The 2010 High Performance Computing and Simulation Conference (HPCS 2010), Program Committee Member

[sedukhin-11:2010]

S. G. Sedukhin, September 2010.

The 13-th International Conference on Network-Based Information Systems (NBiS2010), Program Committee Member

[sedukhin-12:2010]

S. G. Sedukhin, Dec. 2010.

The 11 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), Program Committee Member

[sedukhin-13:2010]

S. G. Sedukhin, January 2011.

IEEE Transactions on Circuits and Systems for Video Technology, Reviewer

[sedukhin-14:2010]

S. G. Sedukhin, July 2011.

The 2011 IEEE International Conference on High Performance Computing & Simulation (HPCS 2011), Program Committee Member

[sedukhin-15:2010]

S. G. Sedukhin, Apr. 2010.

International Journal of Neural, Parallel & Scientific Computations, Member of the Editorial Board

Ph.D., Master and Graduation Theses

[hitoshi-13:2010]

Sho Niboshi. Master Thesis: Adaptive Resource Management in a Virtualized System, Graduate School of Computer Science and Engineering, March 2010.

Thesis Adviser: Hitoshi Oi

[hitoshi-14:2010]

Kazuaki Takahashi. Graduation Thesis: Measurement of Virtualization Overhead in a Java Application Server, School of Computer Science and Engineering, March 2011.

Thesis Adviser: Hitoshi Oi how to reach a machine's performance limits. performance model of an integrated system in virtualization. model will predict distinct CPU load. performed using SPECjAppServer2004 as the workload, which is a benchmark suite of a 3-tier J2EE server system is obtained. thecongurations and results will be used for the performance model.

[nakasato-04:2010]

Mari Murakami. Graduation Thesis: Simulations of Merging White Dwarf Stars by Smoothed Particle Hydrodynamics Method, School of Computer Science and Engineering, March 2011.

Thesis Adviser: N.Nakasato

[sedukhin-16:2010]

Syoudai Yokoyama. Porting Matrix Inversion and 2D DFT Algorithms to Cell/B.E., Graduate School of Computer Science and Engineering, March 2011. Master

Thesis Adviser: S. G. Sedukhin

[sedukhin-17:2010]

Tomoya Sakai. Fast Matrix Multiplication on Multicore Processors, School of Computer Science and Engineering, March 2011. Graduation

Thesis Adviser: S. G. Sedukhin

Others

[hitoshi-15:2010]

Hitoshi Oi, December 2010.

Organized a special lecture by Dr. Joâo Pedro Pedroso from University of Porto, Portugal, titled "An introduction to the Traveling Salesman Problem,"