Distributed Pararell Processing Laboratory, Annual Review 2009, The University of Aizu

Annual Review 2009 > Distributed Pararell Processing Laboratory

Distributed Pararell Processing Laboratory


Stanislav Sedukhin Professor	Hitoshi Oi Assistant Professor	Naohito Nakasato Assistant Professor	Ahmed Sherif AbdallaZekri Visiting Researcher

Current research at the Distributed Parallel Processing Laboratory (DPPL) encompasses:

[Stanislav G. Sedukhin]

Design and evaluation of fine-grained massively-parallel scalable algorithms and architectures for forthcoming micro/macro-electronics technologies;

Collaboration with Computer Organization Lab. and Adaptive System Lab. to design in hardware the scalable array processors;

Design and evaluation of coarse-grained parallel numerical and graph algorithms for Cell/B.E. processor.

[Hitoshi Oi]

Adaptive Resource Management in a Virtualized System

Workload analysis of server applications using industrial standard and open-source benchmark programs, such as SPEC and OSDL DBT.

Design and analysis of Java and wireless sensor network virtual machines.

System-level virtualization: performance analysis and modeling of consolidated systems

Performance analysis of I/O subsystems for server workload

[Nakasaoto, Naohito]

A DSL compiler for May-core Accelerators

High Precision Computing on May-core Accelerators

Application of GPU to Numerical Simulations in Astronomy

Refereed Journal Papers

[sedukhin-01:2009]	Kazuya Matsumoto and Stanislav G. Sedukhin. A Solution of the All-Pairs Shortest Paths Problem on the Cell Broadband Engine Processor. IEICE TRANSACTIONS on Information and Systems, Vol.E92-D(6):1225 E12310, 2009.
	The All-Pairs Shortest Paths (APSP) problem is a graph problem which can be solved by a three-nested loop program. The Cell Broadband Engine (Cell/B.E.) is a heterogeneous multi-core processor that offers the high single precision floating-point performance. In this paper, a solution of the APSP problem on the Cell/B.E. is presented. To maximize the performance of the Cell/B.E., a blocked algorithm for the APSP problem is used. The blocked algorithm enables reuse of data in registers and utilizes the memory hierarchy. We also describe several optimization techniques for effective implementation of the APSP problem on the Cell/B.E. The Cell/B.E. achieves the performance of 8.45 Gflop/s for the APSP problem by using one SPE and 50.6 Gflop/s by using six SPEs.
[sedukhin-02:2009]	Stanislav G. Sedukhin, Toshiaki Miyazaki, and Kenichi Kuroda. Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem. IEICE TRANSACTIONS on Information and Systems, Vol.E93-D(3):534 E41, 2010.
	The algebraic path problem (APP) is a general framework which unifies several solution procedures for a number of well-known matrix and graph problems. In this paper, we present a new 3-dimensional (3-D) orbital algebraic path algorithm and corresponding 2-D toroidal array processors which solve the n ÁEn APP in the theoretically minimal number of 3n time-steps. The coordinated time-space scheduling of the computing and data movement in this 3-D algorithm is based on the modular function which preserves the main technological advantages of systolic processing: simplicity, regularity, locality of communications, pipelining, etc. Our design of the 2-D systolic array processors is based on a classical 3-D E-D space transformation. We have also shown how a data manipulation (copying and alignment) can be effectively implemented in these array processors in a massively-parallel fashion by using a matrix-matrix multiply-add operation.

Refereed Proceeding Papers

[hitoshi-01:2009]	Hitoshi Oi. A Comparative Study of JVM Implementations with SPECjvm2008. In Proccedings of the 2nd International Conference on Computer Engineering and Applications (ICCEA 2010), pages 351 E57, March 2010.
	Abstract―SPECjvm2008 is a new benchmark program suite for measuring client-side Java runtime environments. It replaces JVM98, which has been used for the same purpose for more than ten years. It consists of 38 benchmark programs grouped into eleven categories and has wide variety of workloads from computation-intensive kernels to XML le processors. In this paper, we will compare two proprietary Java Virtual Machines (JVMs), HotSpot of Sun Microsystems and JRockit of Oracle, using SPECjvm2008 on three platforms that have CPUs with the same microarchitecture but different clock speed and cache hierarchies. The wide variations of the SPECjvm2008 benchmark categories, together with the differences in hardware congurations of the platforms, reveal the strong and weak points of each JVM implementation. In the composite SPECjvm2008 performance metrics, JRockit performs 19 to 27% better than HotSpot. This is the results of JRockit ’s outperforming HotSpot in nine out of eleven categories. However, JRockit is quite weak in JVM initialization as it is revealed from the executions of startup.helloworld; the relative performance of JRockit can be as low as 19% of HotSpot. Another remarkable result is, JRockit runs scimark.monte carlo much faster (up to 285% of HotSpot) which affects the per- formance metrics of three categories. The relatively higher performances of JRockit on non-startup benchmarks likely to be the differences in number of x86 instructions executed in JVMs, with exceptions in compiler.* benchmarks. In startup.* benchmarks, the performance differences should also be due to the numbers of x86 instructions executed, but their effects widely vary from benchmark to benchmark. Keywords Java Virtual Machine, Workload Analysis, Performance Evaluation.
[hitoshi-02:2009]	Fumio Nakajima and Hitoshi Oi. Optimizations of Large Receive Offload in Xen. In Proceedings of The 8th IEEE International Symposium on Network Computing and Applications (IEEE NCA09), pages 314 E18, July 2009.
	Xen provides us with logically independent computing environments (domains) and I/O devices can be multiplexed so that each domain considers as if it has own instances of I/O devices. These benefits come with the performance overhead and network interface is one of most typical cases. Previously, we ported the large receive offload (LRO) into the physical and virtual network interfaces of Xen and evaluated its effectiveness. In this paper, two optimizations are attempted to further improve the network performance of Xen. First, copying packets at the bridge within the driver domain is eliminated. The aggregated packets are flushed to the upper layer in the network stack when the kernel polls the network device driver. Our second optimization is to increase the number of aggregated packets by waiting for every other polling before flushing the packets. Compared to the original LRO, the first optimization reduces the packet handling overhead in the driver domain from 13.4 to 13.0 (clock cycles per transferred byte). However, it also increases the overhead in the guest domain from 7.1 to 7.7 and the overall improvement in throughput is negligible. The second optimization reduces the overhead in driver and guest domains from 13.4 to 3.3 and from 7.1 to 5.9, respectively. The receive throughput is improved from 577Mbps to 748Mbps.
[nakasato-01:2009]	N. Nakasato and J. Makino. A Compiler for High Performance Computing With Many-Core Accelerators. In Workshop on Parallel Programming on Accelerator Clusters, 2009.
	We introduce a newly developed compiler for high performance computing using manycore accelerators. A high peak performance of such accelerators attracts researchers who are always demanding faster computers. However, it is difficult to create an efficient implementation of an existing serial program for such accelerators even in the case of massively parallel problems. While existing parallel programming tools force us to program every details of an implementation from loop-level parallelism to 4-vector SIMD operations, our novel approach is that given a compute intensive problem expressed as a nested loop, the compiler only ask us to define a compute kernel inside the inner-most loop. We observe that input variables appeared in the kernel is classified into two types; invariant during the loop and variables updated in each iteration. The compiler let us to specify either type of the inputs so as it pick a predefined optimal way to process them. The compiler successfully generates the fastest code ever for many-particle simulations with the performance of 500 GFLOPS (single precision) on RV770 GPU. Another successful application is the evaluation of a multi-dimensional integral. It runs at a speed of 4 GFLOPS (quadruple precision) on both GRAPE-DR and GPU.
[sedukhin-03:2009]	Shodai Yokoyama, Kazuya Matsumoto, and Stanislav G. Sedukhin. Matrix Inversion on the Cell/B.E. Processor. In 10th IEEE International Conference on High Performance Computing and Communications, pages 148 E53, Los Alamitos, CA, USA, June 2009. IEEE Computer Society.
	The problem of inverting matrices is one that occurs in some problems of practical importance. This paper introduces and evaluates the block algorithm for high performance matrix inversion on the Cell Broadband Engine (Cell/B.E.) processor. The Cell/B.E. is a heterogeneous multi-core processor on a singlechip jointly developed by Sony, Toshiba and IBM, which has a very high speed of the single precision floatingpoint arithmetic. The discussed matrix inversion algorithm is a combination of the block Algebraic Path Problem algorithm and the well-known block matrix inversion algorithm based on the LU decomposition. For relatively big matrices, this combined block algorithm spends the most time in computing matrix-matrix multiplication of blocks and achieves 149.4 Gflop/s on Cell/B.E., when PPE and six SPEs of PlayStation3 are used, or 93.4% of the aggregated double (PPE) and single (SPEs) precision peak performance, which is 160.0 Gflop/s.
[sedukhin-04:2009]	Stanislav G. Sedukhin and Toshiaki Miyazaki. Rapid*Closure: Algebraic Extensions of a Scalar Multiply-add Operation. In 25th ISCA International Conference on Computers and Their Applications, pages 19 E4, Honolulu, USA, March 2010. ISCA.
	One of the outstanding characteristics of scalar fused multiply-add (FMA) and multiply-accumulate (MAC) operations is in reducing twice the number of required operations of an (n ÁEn) matrix-matrix multiplication Efrom 2n3 multiplications and additions to exactly n3 scalar FMA/MAC operations. The existing advanced processors with FMA/MAC units are greatly reduce the time of solution of many scientific, engineering, and multimedia problems which are based on linear algebra (matrix) transforms. In this paper we show that there are other forms of matrix-matrix multiply-add in different algebraic semirings which are intensively used in many other than linear algebra real-world problems. These problems suffer from the absence of a corresponding hardware support and exhibit a relatively low speed of computing due to introducing one or two logical (branching) operations in the underline kernels. The propo sed fusion of scalar multiply-add operations for different semirings will eliminate branching and lead to an equal performance as for linear algebra. The implementation of suggested fused operations is simple and will only slightly increase the complexity of a conventional FMA/MAC unit. Our experiments on the Sony/Toshiba/IBM Cell/B.E. processor demonstrate that adding the algebraic multiply-add scalar extensions to existing FMA unit can remarkably (3ÁE∼ 4ÁE increase performance of many important problems.

Academic Activities

[hitoshi-03:2009]	Sho Niboshi and Hitoshi Oi. Application of Fuzzy Control Theory to Resource Management in a Virtualized System. In IPSJ SIG Technical Report, volume Vol.2009-EVA-30. IPSJ SIGEVA, 2009.
	In recent years, companies choice virtualization technology for the integra- tion of computer systems. The virtualized system has eective use of computer resources by reducing the number of computers and nancial costs for admin- istration. In a virtualized system, all resources must be controlled by resource manager and it allocates shared resources to each domain. Since this control method aects an entire virtualized system, allocation method needs to achieve more eective. However, the resource manager in current systems only utilizes the informa- tion from the operating system on each domain and lacks the running state of the application running on it. We are developing the resource controller that adapt to dynamic workload and application states on a virtualized system. The relationship between the applications ’running states and their required resources are often vague, complex and empirical. To represent such relation- ship, we use fuzzy control theory to maximize the total system performance. This technical report describes the brief of our controller and its experiment results.
[nakasato-02:2009]	N. Nakasato, T. Ishikawa, J. Makino, and F. Yuasa Fast Quad- Precision Operations On Many-core Accelerators (in Japanese). In Proceedings of SWoPP 2009, pages 1 - 8, August 2009.

Academic Activities

[hitoshi-04:2009]	Hitoshi Oi, 2009 to present. Program Committee Member, Annual International Conference on Cloud Computing and Virtualization (CCV 2010)
[hitoshi-05:2009]	Hitoshi Oi, Since 2005. Professional Member, ACM
[hitoshi-06:2009]	Hitoshi Oi, Since 2005. Member, IEEE/Computer Society
[hitoshi-07:2008]	Hitoshi Oi, Since 2005. Academic Member, EEMBC
[hitoshi-08:2009]	Hitoshi Oi, 2009. SocialNet 2009, December 12-14, 2009, Chengdu, China, Publicity Chairs
[hitoshi-09:2009]	Hitoshi Oi, 2009. Senior Member. IACSIT
[hitoshi-10:2009]	Hitoshi Oi, 2010. Program Committee Member, Xen Summit
[sedukhin-05:2009]	S. Sedukhin, Apr. 2009. IEEE CS, member
[sedukhin-06:2009]	S. Sedukhin, Apr. 2009. ACM, member
[sedukhin-07:2009]	S. Sedukhin, Apr. 2009. IEICE, member
[sedukhin-08:2009]	S. Sedukhin, Apr 2009. International Journal of Neural, Parallel & Scientific Computations, Member of the Editorial Board
[sedukhin-09:2009]	S. Sedukhin, Apr 2009. International Journal of Parallel Processing Letters, Member of the Editorial Board
[sedukhin-10:2009]	S. Sedukhin, Apr. 2009. International Journal of High Performance Systems Architecture, Member of the Editorial Board
[sedukhin-11:2009]	S. Sedukhin, June 2010. The 2010 High Performance Computing and Simulation Conference (HPCS 2010), Program Committee Member
[sedukhin-12:2009]	S. Sedukhin, September 2010. The 13-th International Conference on Network-Based Information Systems (NBiS- 2010), Program Committee Member
[sedukhin-13:2009]	S. Sedukhin, November 2009. Mathematical and Computer Modelling, an International Journal, Rewever
[sedukhin-14:2009]	S. Sedukhin, February 2009. The Eighth IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 2009), Austria, Program Committee Member
[sedukhin-15:2009]	S. Sedukhin, September 2009. The 4th International Symposium on Embedded MultiCore Systems-on-Chip (MCSoC- 09), Program Committee Member
[sedukhin-16:2009]	S. Sedukhin, March 2010. 25th ISCA International Conference on Computers and Their Applications, Program Committee Member

Ph.D, Master and Graduation Theses

[hitoshi-11:2009]	Sho Niboshi. Master Thesis: Adaptive Resource Management in a Virtualized System, CSE, 2010. Thesis Advisor: Hitoshi Oi,
[nakasato-04:2009]	KouiWatanabe. Graduation Thesis: Simulation of Collision Between Elastic Material, University of Aizu, 2010. Thesis Advisor: Nakasato, N
[nakasato-05:2009]	Wataru Horie. Graduation Thesis: An Interactive Galaxy Simulation Program Using GPU, University of Aizu, 2010. Thesis Advisor: N. Nakasato,
[nakasato-06:2009]	Hiroki Ishikawa. Graduation Thesis: Gomoku Program Using Monte Carlo Methods, University of Aizu, 2010. Thesis Advisor: N. Nakasato,
[nakasato-07:2009]	Yoshiyuki Abe. Graduation Thesis: A Consultation Algorithm for Shogi Using the Bonanza Library, University of Aizu, 2010. Thesis Advisor: N. Nakasato,
[sedukhin-17:2009]	Kazuya Matsumoto. Master Thesis: A Solution of the Algebraic Path Problem on the Multicore CELL Processor, University of Aizu, 2009.
	Research Adviser: S. Sedukhin,
[sedukhin-18:2009]	Yusuke Kobayashi. Master Thesis: Orbital Algorithms for 3- Dimensional Separable Transforms, University of Aizu, 2009.
	Thesis Adviser: Sedukhin, S.