Professor 
Assistant Professor 
Assistant Professor 
Visiting Researcher 
Current research at the Distributed Parallel Processing Laboratory (DPPL) encompasses: 
[Stanislav G. Sedukhin] 
[Hitoshi Oi] 
[Nakasaoto, Naohito] 
[sedukhin01:2009] 
Kazuya Matsumoto and Stanislav G. Sedukhin. A Solution of the
AllPairs Shortest Paths Problem on the Cell Broadband Engine Processor.
IEICE TRANSACTIONS on Information and Systems, Vol.E92D(6):1225 E12310, 2009. 
The AllPairs Shortest Paths (APSP) problem is a graph problem which can be solved
by a threenested loop program. The Cell Broadband Engine (Cell/B.E.) is a heterogeneous
multicore processor that offers the high single precision floatingpoint performance.
In this paper, a solution of the APSP problem on the Cell/B.E. is presented.
To maximize the performance of the Cell/B.E., a blocked algorithm for the APSP
problem is used. The blocked algorithm enables reuse of data in registers and utilizes
the memory hierarchy. We also describe several optimization techniques for effective
implementation of the APSP problem on the Cell/B.E. The Cell/B.E. achieves the
performance of 8.45 Gflop/s for the APSP problem by using one SPE and 50.6 Gflop/s
by using six SPEs. 

[sedukhin02:2009] 
Stanislav G. Sedukhin, Toshiaki Miyazaki, and Kenichi Kuroda.
Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic
Path Problem. IEICE TRANSACTIONS on Information and Systems,
Vol.E93D(3):534 E41, 2010. 
The algebraic path problem (APP) is a general framework which unifies several solution
procedures for a number of wellknown matrix and graph problems. In this
paper, we present a new 3dimensional (3D) orbital algebraic path algorithm and
corresponding 2D toroidal array processors which solve the n ÁEn APP in the theoretically
minimal number of 3n timesteps. The coordinated timespace scheduling
of the computing and data movement in this 3D algorithm is based on the modular
function which preserves the main technological advantages of systolic processing:
simplicity, regularity, locality of communications, pipelining, etc. Our design of the
2D systolic array processors is based on a classical 3D ED space transformation.
We have also shown how a data manipulation (copying and alignment) can be effectively
implemented in these array processors in a massivelyparallel fashion by using
a matrixmatrix multiplyadd operation. 
[hitoshi01:2009] 
Hitoshi Oi. A Comparative Study of JVM Implementations with
SPECjvm2008. In Proccedings of the 2nd International Conference on Computer
Engineering and Applications (ICCEA 2010), pages 351 E57, March
2010. 
Abstract―SPECjvm2008 is a new benchmark program suite for measuring clientside
Java runtime environments. It replaces JVM98, which has been used for the same
purpose for more than ten years. It consists of 38 benchmark programs grouped into
eleven categories and has wide variety of workloads from computationintensive kernels
to XML le processors. In this paper, we will compare two proprietary Java Virtual
Machines (JVMs), HotSpot of Sun Microsystems and JRockit of Oracle, using
SPECjvm2008 on three platforms that have CPUs with the same microarchitecture but
different clock speed and cache hierarchies. The wide variations of the SPECjvm2008
benchmark categories, together with the differences in hardware congurations of the
platforms, reveal the strong and weak points of each JVM implementation. In the composite
SPECjvm2008 performance metrics, JRockit performs 19 to 27% better than
HotSpot. This is the results of JRockit ’s outperforming HotSpot in nine out of eleven
categories. However, JRockit is quite weak in JVM initialization as it is revealed from
the executions of startup.helloworld; the relative performance of JRockit can be as low
as 19% of HotSpot. Another remarkable result is, JRockit runs scimark.monte carlo
much faster (up to 285% of HotSpot) which affects the per formance metrics of three
categories. The relatively higher performances of JRockit on nonstartup benchmarks
likely to be the differences in number of x86 instructions executed in JVMs, with
exceptions in compiler.* benchmarks. In startup.* benchmarks, the performance differences
should also be due to the numbers of x86 instructions executed, but their
effects widely vary from benchmark to benchmark. Keywords Java Virtual Machine,
Workload Analysis, Performance Evaluation. 

[hitoshi02:2009] 
Fumio Nakajima and Hitoshi Oi. Optimizations of Large Receive Offload
in Xen. In Proceedings of The 8th IEEE International Symposium on
Network Computing and Applications (IEEE NCA09), pages 314 E18, July
2009. 
Xen provides us with logically independent computing environments (domains) and
I/O devices can be multiplexed so that each domain considers as if it has own instances
of I/O devices. These benefits come with the performance overhead and network interface
is one of most typical cases. Previously, we ported the large receive offload (LRO)
into the physical and virtual network interfaces of Xen and evaluated its effectiveness.
In this paper, two optimizations are attempted to further improve the network
performance of Xen. First, copying packets at the bridge within the driver domain
is eliminated. The aggregated packets are flushed to the upper layer in the network
stack when the kernel polls the network device driver. Our second optimization is to
increase the number of aggregated packets by waiting for every other polling before
flushing the packets. Compared to the original LRO, the first optimization reduces
the packet handling overhead in the driver domain from 13.4 to 13.0 (clock cycles per
transferred byte). However, it also increases the overhead in the guest domain from 7.1
to 7.7 and the overall improvement in throughput is negligible. The second optimization
reduces the overhead in driver and guest domains from 13.4 to 3.3 and from 7.1
to 5.9, respectively. The receive throughput is improved from 577Mbps to 748Mbps. 

[nakasato01:2009] 
N. Nakasato and J. Makino. A Compiler for High Performance Computing
With ManyCore Accelerators. In Workshop on Parallel Programming
on Accelerator Clusters, 2009. 
We introduce a newly developed compiler for high performance computing using manycore
accelerators. A high peak performance of such accelerators attracts researchers
who are always demanding faster computers. However, it is difficult to create an efficient
implementation of an existing serial program for such accelerators even in the
case of massively parallel problems. While existing parallel programming tools force us
to program every details of an implementation from looplevel parallelism to 4vector
SIMD operations, our novel approach is that given a compute intensive problem expressed
as a nested loop, the compiler only ask us to define a compute kernel inside the
innermost loop. We observe that input variables appeared in the kernel is classified
into two types; invariant during the loop and variables updated in each iteration. The
compiler let us to specify either type of the inputs so as it pick a predefined optimal
way to process them. The compiler successfully generates the fastest code ever for
manyparticle simulations with the performance of 500 GFLOPS (single precision) on
RV770 GPU. Another successful application is the evaluation of a multidimensional
integral. It runs at a speed of 4 GFLOPS (quadruple precision) on both GRAPEDR
and GPU. 

[sedukhin03:2009] 
Shodai Yokoyama, Kazuya Matsumoto, and Stanislav G. Sedukhin.
Matrix Inversion on the Cell/B.E. Processor. In 10th IEEE International Conference
on High Performance Computing and Communications, pages 148 E53,
Los Alamitos, CA, USA, June 2009. IEEE Computer Society. 
The problem of inverting matrices is one that occurs in some problems of practical
importance. This paper introduces and evaluates the block algorithm for high performance
matrix inversion on the Cell Broadband Engine (Cell/B.E.) processor. The
Cell/B.E. is a heterogeneous multicore processor on a singlechip jointly developed by
Sony, Toshiba and IBM, which has a very high speed of the single precision floatingpoint
arithmetic. The discussed matrix inversion algorithm is a combination of the
block Algebraic Path Problem algorithm and the wellknown block matrix inversion
algorithm based on the LU decomposition. For relatively big matrices, this combined
block algorithm spends the most time in computing matrixmatrix multiplication of
blocks and achieves 149.4 Gflop/s on Cell/B.E., when PPE and six SPEs of PlayStation3
are used, or 93.4% of the aggregated double (PPE) and single (SPEs) precision
peak performance, which is 160.0 Gflop/s. 

[sedukhin04:2009] 
Stanislav G. Sedukhin and Toshiaki Miyazaki. Rapid*Closure: Algebraic
Extensions of a Scalar Multiplyadd Operation. In 25th ISCA International
Conference on Computers and Their Applications, pages 19 E4,
Honolulu, USA, March 2010. ISCA. 
One of the outstanding characteristics of scalar fused multiplyadd (FMA) and
multiplyaccumulate (MAC) operations is in reducing twice the number of required
operations of an (n ÁEn) matrixmatrix multiplication Efrom 2n3 multiplications and
additions to exactly n3 scalar FMA/MAC operations. The existing advanced processors
with FMA/MAC units are greatly reduce the time of solution of many scientific,
engineering, and multimedia problems which are based on linear algebra (matrix)
transforms. In this paper we show that there are other forms of matrixmatrix
multiplyadd in different algebraic semirings which are intensively used in many other
than linear algebra realworld problems. These problems suffer from the absence of a
corresponding hardware support and exhibit a relatively low speed of computing due
to introducing one or two logical (branching) operations in the underline kernels. The
propo sed fusion of scalar multiplyadd operations for different semirings will eliminate
branching and lead to an equal performance as for linear algebra. The implementation
of suggested fused operations is simple and will only slightly increase the complexity
of a conventional FMA/MAC unit. Our experiments on the Sony/Toshiba/IBM
Cell/B.E. processor demonstrate that adding the algebraic multiplyadd scalar extensions
to existing FMA unit can remarkably (3ÁE∼ 4ÁE increase performance of many
important problems. 
[hitoshi03:2009] 
Sho Niboshi and Hitoshi Oi. Application of Fuzzy Control Theory to
Resource Management in a Virtualized System. In IPSJ SIG Technical
Report, volume Vol.2009EVA30. IPSJ SIGEVA, 2009. 
In recent years, companies choice virtualization technology for the integra tion of
computer systems. The virtualized system has eective use of computer resources
by reducing the number of computers and nancial costs for admin istration. In
a virtualized system, all resources must be controlled by resource manager and
it allocates shared resources to each domain. Since this control method aects
an entire virtualized system, allocation method needs to achieve more eective.
However, the resource manager in current systems only utilizes the informa tion
from the operating system on each domain and lacks the running state of the
application running on it. We are developing the resource controller that adapt to
dynamic workload and application states on a virtualized system. The relationship
between the applications ’running states and their required resources are often
vague, complex and empirical. To represent such relation ship, we use fuzzy
control theory to maximize the total system performance. This technical report
describes the brief of our controller and its experiment results. 

[nakasato02:2009] 
N. Nakasato, T. Ishikawa, J. Makino, and F. Yuasa Fast Quad
Precision Operations On Manycore Accelerators (in Japanese). In Proceedings
of SWoPP 2009, pages 1  8, August 2009. 
[hitoshi04:2009] 
Hitoshi Oi, 2009 to present. Program Committee Member, Annual International Conference on Cloud Computing and Virtualization (CCV 2010) 
[hitoshi05:2009] 
Hitoshi Oi, Since 2005. Professional Member, ACM 
[hitoshi06:2009] 
Hitoshi Oi, Since 2005. Member, IEEE/Computer Society 
[hitoshi07:2008] 
Hitoshi Oi, Since 2005. Academic Member, EEMBC 
[hitoshi08:2009] 
Hitoshi Oi, 2009. SocialNet 2009, December 1214, 2009, Chengdu, China, Publicity Chairs 
[hitoshi09:2009] 
Hitoshi Oi, 2009. Senior Member. IACSIT 
[hitoshi10:2009] 
Hitoshi Oi, 2010. Program Committee Member, Xen Summit 
[sedukhin05:2009] 
S. Sedukhin, Apr. 2009. IEEE CS, member 
[sedukhin06:2009] 
S. Sedukhin, Apr. 2009. ACM, member 
[sedukhin07:2009] 
S. Sedukhin, Apr. 2009. IEICE, member 
[sedukhin08:2009] 
S. Sedukhin, Apr 2009. International Journal of Neural, Parallel & Scientific Computations, Member of the Editorial Board 
[sedukhin09:2009] 
S. Sedukhin, Apr 2009. International Journal of Parallel Processing Letters, Member of the Editorial Board 
[sedukhin10:2009] 
S. Sedukhin, Apr. 2009. International Journal of High Performance Systems Architecture, Member of the Editorial Board 
[sedukhin11:2009] 
S. Sedukhin, June 2010. The 2010 High Performance Computing and Simulation Conference (HPCS 2010), Program Committee Member 
[sedukhin12:2009] 
S. Sedukhin, September 2010. The 13th International Conference on NetworkBased Information Systems (NBiS 2010), Program Committee Member 
[sedukhin13:2009] 
S. Sedukhin, November 2009. Mathematical and Computer Modelling, an International Journal, Rewever 
[sedukhin14:2009] 
S. Sedukhin, February 2009. The Eighth IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 2009), Austria, Program Committee Member 
[sedukhin15:2009] 
S. Sedukhin, September 2009. The 4th International Symposium on Embedded MultiCore SystemsonChip (MCSoC 09), Program Committee Member 
[sedukhin16:2009] 
S. Sedukhin, March 2010. 25th ISCA International Conference on Computers and Their Applications, Program Committee Member 
[hitoshi11:2009] 
Sho Niboshi. Master Thesis: Adaptive Resource Management in a
Virtualized System, CSE, 2010. Thesis Advisor: Hitoshi Oi, 
[nakasato04:2009] 
KouiWatanabe. Graduation Thesis: Simulation of Collision Between
Elastic Material, University of Aizu, 2010. Thesis Advisor: Nakasato, N 
[nakasato05:2009] 
Wataru Horie. Graduation Thesis: An Interactive Galaxy Simulation
Program Using GPU, University of Aizu, 2010. Thesis Advisor: N. Nakasato, 
[nakasato06:2009] 
Hiroki Ishikawa. Graduation Thesis: Gomoku Program Using Monte
Carlo Methods, University of Aizu, 2010. Thesis Advisor: N. Nakasato, 
[nakasato07:2009] 
Yoshiyuki Abe. Graduation Thesis: A Consultation Algorithm for
Shogi Using the Bonanza Library, University of Aizu, 2010. Thesis Advisor: N. Nakasato, 
[sedukhin17:2009] 
Kazuya Matsumoto. Master Thesis: A Solution of the Algebraic
Path Problem on the Multicore CELL Processor, University of Aizu, 2009. 
Research Adviser: S. Sedukhin, 

[sedukhin18:2009] 
Yusuke Kobayashi. Master Thesis: Orbital Algorithms for 3
Dimensional Separable Transforms, University of Aizu, 2009. 
Thesis Adviser: Sedukhin, S. 