Professor 
Assistant Professor 
Assistant Professor 
Visiting Researcher 
Current research at the Distributed Parallel Processing Laboratory (DPPL) encompasses: 
[Stanislav G. Sedukhin] 
[Hitoshi Oi] 
[Nakasaoto, Naohito] such as GRAPEDR and GPU. In our programming model, we explicitly specify input and operations between input variables to obtain output results. Precisely, we divide the inputs into two types: one type is a local variable that is invariant during the inner loop and another type is a broadcast variable that is different in each iteration. Our compiler consists of three stages; (a) frontend, (b) optimization on an internal representation and (c) backend that generates a target code. This year, we have significantly enhanced the backend part of the compiler. So far, we have developed two backends for GRAPEDR and RV770 GPU. Both backends are flexible to generate singleprecision (SP), doubleprecision (DP) and quadrupleprecision (QP) code. With GRAPEDR backend, we obtain the measured computing speeds of a sample application are 2.27  6.64 GFLOPS depending on the size of the problem. With RV770 GPU backend, our compiler generates highly optimized code. With SP operations, our code shows 1 TFLOPS on RV770 running at 750 MHz. This significant performance result is fastest in the world as far as we know. Although we can use manycore SIMD computers to do QP emulations very effectively, a performance penalty of a factor of 20 is very huge. That is we require about 20 DP add/mul operations to emulate one QP addition. In this project, we have investigated a possibility of designing native QP circuit. In a pilot study that we have done in 20072008, we have found that it was possible to implement 128 bit QP units in HDL with a modest resource. In this year, we have further extended our first effort. Together with collaborators at NAOJ and KEK, we have designed optimized implementation of QP add and mul units in HDL along with required emulation software. In addition, we have combining the two units to design a simple programmable processor. The processor consists of a QP add unit, a QP mul unit and an unit for inverse of square root that is used to computer reciprocal and square root. In addition, it has two register files and a temporary register. Based on this processor design, we have packed six processors that are working in SIMD operations. We estimate that a performance of the new SIMD processor running at 100 MHz is more than 1 Gflops. It will outperform the emulation code running on a recently multicore CPU ( 400 Mflops). As an effort to effectively utilize recently expanding computing power of GPU, we have investigated optimized computation scheme for GPU. This year, we have concentrated on manybody particle simulations that are widely used techniques in numerical astronomy. We have obtained significantly improved performance of O(N2) force calculation algorithm (where N is a number of particles) from 200 Gflops by a simple code to 1000 Gflops by a highly optimized code with the loopunrolling techniques. Practically, the force calculation algorithm with O(N2) computational complexity is only useful for N < 100, 000. We have also implement more efficient O(NlogN) force calculation algorithm on GPU. Our code is very effective even with N > 100, 000 such that the force calculation for million particles takes less than 1 seconds. We have successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N 2.87×106 particles only took 5.79 hours. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of $41.6/Gflops that is two and half times better than the previous record in 2006. 
[hitoshi01:2008] 
Hitoshi Oi. Instruction Folding in a Hardware TranslationBased Java
Virtual Machine. The Journal of InstructionLevel Parallelism, Vol. 10, pp. 116,
June 2008. 
Bytecode hardwaretranslation improves the performance of a Java Virtual Machine
(JVM) with small hardware resource and complexity overhead. Instruction folding is a
technique to further improve the performance of a JVM by reducing the redundancy in
its stackbased operations. However, the variable instruction length of the Java bytecode
makes the folding logic complex. In this paper, we propose a folding scheme with reduced
hardware complexity and evaluate its performance. For eleven benchmark cases, the
proposed scheme folded 7.1% to 36.8% of the bytecodes which correspond to 74.0% to
99.7% of the PicoJavaII’s folding performance. 

[hitoshi02:2008] 
Hitoshi Oi. Local Variable Access Behavior of a HardwareTranslation
Based Java Virtual Machine. Journal of Systems and Software, 81(11), pp.
20592068, 2008. 
Hardware bytecode translation is a technique to improve the performance of the Java
Virtual Machine (JVM), especially on the portable devices for which the overhead of
dynamic compilation is significant. However, since the translation is done on a single
bytecode basis, a naive implementation of the JVM generates frequent memory accesses
for local variables which can be not only a performance bottleneck but also an obstacle
for instruction folding. A solution to this problem is to add a small register file to the
data path of the microprocessor which is dedicated for storing local variables. However,
the effectiveness of such a local variable register file depends on the size and the local
variable access behavior of the applications. In this paper, we analyze the local variable
access behavior of various Java applications. In particular, we will investigate the
fraction of local variable accesses that are covered by the register file of a varying size,
which determines the chip area overhead and the operation speed. We also evaluate the
effectiveness of the sliding register window for parameter passing in context of JVM
and onthefly optimization of local variable to register file mapping . With two types
of exceptions, a 16entry register file achieves coverages of up to 98%. The first type
of exception is represented by the SAXON XSLT processor for which the effect of cold
miss is significant. Adding the sliding window feature to the register file for parameter
passing turns 6.2 to 13.3% of total accesses from miss to hit to the register file for the
SAXON with XSLTMark. The second type of exception is represented by the FFT,
which accesses more than 16 local variables for most of method invocations. In this
case, onthefly profiling is effective. The hit ratio of a 16entry register file for the FFT
is increased from 44 to 83% by an array of 8bit counters. 
[hitoshi03:2008] 
Fumio Nakajima and Hitoshi Oi. Optimizations of Large Receive Offload
in Xen. In Proceedings of The 8th IEEE International Symposium on
Network Computing and Applications (IEEE NCA09), pp. 314318, 2009. 
Xen provides us with logically independent computing environments (domains) and
I/O devices can be multiplexed so that each domain considers as if it has own instances
of I/O devices. These benefits come with the performance overhead and
network interface is one of most typical cases. Previously, we ported the large receive
offload (LRO) into the physical and virtual network interfaces of Xen and evaluated
its effectiveness. In this paper, two optimizations are attempted to further improve
the network performance of Xen. First, copying packets at the bridge within the
driver domain is eliminated. The aggregated packets are flushed to the upper layer
in the network stack when the kernel polls the network device driver. Our second
optimization is to increase the number of aggregated packets by waiting for every
other polling before flushing the packets. Compared to the original LRO, the first
optimization reduces the packet handling overhead in the driver domain from 13.4
to 13.0 (clock cycles per transferred byte). However, it also increases the overhead in
the guest domain from 7.1 to 7.7 and the overall improvement in throughput is negligible.
The second optimization reduces the overhead in driver and guest domains
from 13.4 to 3.3 and from 7.1 to 5.9, respectively. The receive throughput is improved
from 577Mbps to 748Mbps. 

[hitoshi04:2008] 
Hitoshi Oi. A Preliminary Workload Analysis of SPECjvm2008. In
Proceedings of 2009 International Conference on Computer Engineering and
Technology (ICCET 2009), pp. 1319, 2009. 
SPECjvm2008 is a new benchmark program suite for measuring clientside Java
runtime environment. It replaces JVM98, which has been used for the same purpose
for more than ten years. It consists of 38 benchmark programs grouped into eleven
categories and has wide variety of workloads from computationintensive kernels to
XML file processors. In this paper, we present the results of running SPECjvm2008 on
three machines that have CPUs with the same microarchitecture and different cache
sizes and clock speeds. The result of measurements include instruction and data
cache reference and miss rates, and the effect of the multithreading. We relate these
workload parameters to the SPECjvm2008 performance metrics. Roughly speaking,
an L2 cache of 1MB sufficiently lows the cache miss rates of SPECjvm2008 and
compared to the singlecore, 1.5 to 2 times speedups are achieved by dualcore
executions. 

[hitoshi05:2008] 
Hitoshi Oi and Fumio Nakajima. Performance Analysis of Large Receive
Offload in a Xen Virtualized System. In Proceedings of 2009 International
Conference on Computer Engineering and Technology (ICCET 2009),
pp. 475480, 2009. 
Systemlevel virtualization provides us various advantages including independent and
isolated computing environments on which multiple operating systems can be executed,
and improved resource utilization. These benefits come with performance
overheads and network operation is one of most typical cases. In this paper, we first
present the experimental results of network performance with LRO under varying
message sizes and maximum transmission unit (MTU). We analyze the effectiveness
of large receive offload (LRO) in a consolidated execution environment based on Xen.
When SPECjbb, a CPUintensive workload, and network receiver are executed on
separated domains, LRO in physical and virtual NICs improve the throughput up to
8 and 14%, respectively. 

[hitoshi06:2008] 
Takayuki Hatori and Hitoshi Oi. Implementation and Analysis of Large
Receive Offload in a Virtualized System. In Proceedings of the Virtualization
Performance: Analysis, Characterization, and Tools (VPACT’08), 2008. 
System level virtualization has advantages such as easy server consolidation, dynamic
reconfiguration, and low power consumption. However, there exists network overhead.
In native environment, software level Large Receive Offload (LRO) is effective
to improve network receive performance. In this paper, we present the performance
study of the LRO implemented in a Xen virtualized environment. We implemented
LRO in both the physical and the virtual interface and measured the performance in
these cases. The receive performance improved 22% with LRO implementation in the
virtual interface, whereas no performance improvement appeared with an LRO implemented
physical interface. We analyze the performance improvement and describe
effectiveness of LRO. 

[nakasato01:2008] 
N. Nakasato and T. Ishikawa. An Implementation of High Precision
Floatingpoint Operation Units on FPGA. In IEICE Technical Report,
volume 108, pages 7–11, 5 2008. 
Kinds of numerical application programs require highprecision (better than doubleprecision)
floatingpoint (FP) operations. An example of such applications is a numerical
evaluation of Feynmann loop integrals. To evaluate those integrals with a desired
accuracy on conventional CPUs, one utilize emulation techniques using doubleprecision
operations or 64bit integer operations. However, a perfomance penalty of
emulation techniques is as large as 4  5 % of DP operations. To tackle those problems,
we are currently working on implementation of quadrupleprecision floatingpoint
operation units on a FPGA device. We report a current status of our project and
propose a computation unit that is capable of highprecision numerical integration. 

[sedukhin01:2008] 
A.S. Zekri and S.G. Sedukhin. 2D Separable Transforms on a Matrix
Processor. In Frederick C. Harris Jr., editor, Proc. of the the 21st International
Conference on Computers and Their Applications in Industry and
Engineering (CAINE2008), pages 106–111, Honolulu, USA, November 2008.
The International Society for Computers and Their Applications, ISCA. 
2D separable transforms play a fundamental role in the field of digital signal and
image processing. Based on efficient methods to perform both matrixmatrix multiplication
and matrix transposition on a matrix processor, we design unifying algorithms
to perform all 2D separable transforms and their inverses. For computing a specific
transform, the coefficient matrix or its transpose is initially calculated and the same
unifying algorithms are applied. Our analytical evaluation of the algorithms for the
2D Discrete Cosine Transform (DCT) and its inverse (IDCT) showed that the proposed
matrix processor can be satisfy the realtime processing requirement for the
MPEG2 codec standard using low memory bandwidth. 

[sedukhin02:2008] 
S.G. Sedukhin, T. Miyazaki, and K. Kuroda. 3D Toroidal Array
Processor for Multidimensional DSP Transforms. In N.A., editor, Technical
Digest of the International 3D System Integration Conference (3DSIC)
2008, pages 401–410, National Institute of Informatics, May 2008. ASET,
3DSIC. 
We present the forward and inverse 3D separable transforms in the form of
chained matrixmatrix multiplyadd and transposition operations. Then, a new orbital
systoliclike implementation of the 3D transforms and its inverse on a 3D
toroidal array processor are proposed. In this implementation, a required transposition
is avoided, but each 1D transform becomes data dependend from other
coexisting 1D transforms. The main difference of our orbital implementation is in
dimensionality of I/O: we design an array processor under assumption when all 3
D data is immediately available for the finegrain massivelyparallel computing and
during processing we do not destroy an integrity (spacetime locality) of processed
data. 

[sedukhin03:2008] 
A.S. Zekri and S.G. Sedukhin. Computationally Efficient Parallel
MatrixMatrix Multiplication on the Torus. In Volume. 4759, editor, Lecture
Notes in Computer Science, pages 219–226, Honolulu, USA, November 2008.
DOI: 10.1007/978354077704519, Springer. 
In this paper, we represent the computation space of the (n×n)matrix multiplication
problem C = C +A×B as a 3D torus. All possible timeminimal scheduling vectors
needed to activate the computations inside the corresponding 3D index points at
each step of computing are determined. Using the projection method to allocate the
scheduled computations to the processing elements, the resulting array processor
that minimizes the computing time is a 2D torus with n?n processing elements. For
each optimal time scheduling function, three optimal array allocations are obtained
from projection. All the resulting allocations of all the optimal scheduling vectors can
be classified into three groups. In one group, matrix C remains and both matrices A
and B are shifted between neighbor processors. The wellknown Cannon 痴algorithm
belongs to this group. In another group, matrix A remains and both matrices B and
C are shifted. In the third group, matrix B remains while both matrices A and C
are shifted. The obtained array processor allocations need n computeshift steps to
multiply n × n dense matrices. 

[sedukhin04:2008] 
T. Miyazaki, Y. Nomoto, Y. Sato, and S.G. Sedukhin. Array processor
featuring an effective FIFObased data stream management. In N.A.,
editor, Computer and Information Technology, (CIT 2008) 8th IEEE International
Conference on, pages 255–260, New York, Apr. 2008. IEEE, IEEE
Computer Soc. Press. 
In array processors, data I/O management is the key to realizing highspeed matrix
operations that are often required in signal and image processing. In this paper, we
propose an array processor utilizing an effective data I/O mechanism featuring external
FIFOs. The FIFOs are used to buffer initial matrix data and partially processed
results. Therefore, if all required data are stored in the FIFOs, matrix operations,
including the algorithm to solve the Algebraic Path Problem (APP), can be performed
without any data I/Os. In addition, we can eliminate register files from the
processing elements (PEs) if we construct the PE array by controlling the external
FIFOs systematically and transferring the data from (to) the FIFOs to (from) the
PE array. This enables us to simplify each PE structure and realize a large array
processor with limited hardware resources. Here, the FIFOs themselves can be easily
realized using conventional discrete FIFO or memory chips. 
[hitoshi07:2008] 
Hitoshi Oi, 2005 to present. Member, IEEE/Computer Society 
[hitoshi08:2008] 
Hitoshi Oi, 2005 to preent. Member, ACM 
[hitoshi09:2008] 
Hitoshi Oi, 2006 to present. Liaison Chair for Asia, ACM International Conference on Computing Frontiers 
[hitoshi10:2008] 
Hitoshi Oi, 2006 to present. Reviewer, Microprocessors and Microsystems 
[hitoshi11:2007] 
Hitoshi Oi, 2008. Program Committee member, International Conference on Computer Design (ICCD) 
[sedukhin05:2008] 
S. Sedukhin, Apr. 2008. Member of the Editorial Board, International Journal of High Performance Systems Architecture, InderScience Publisher 
[sedukhin06:2008] 
S. Sedukhin, Apr. 2008. IEEE CS, Member 
[sedukhin07:2008] 
S. Sedukhin, Apr. 2008. ACM, Member 
[sedukhin08:2008] 
S. Sedukhin, May 2008. The 5th Internationsl Conference on Neural Networks and Artificial Intelligence (ICNNAI2008), Belarus, Program Committee Member, International Neuro Network Society 
[sedukhin09:2008] 
S. Sedukhin, June 2008. The 2008 High Performance Computing and Simulation Conference (HPCS 2007), Program Committee Member 
[sedukhin10:2007] 
S. Sedukhin, Apr. 2008. Member, IEICE 
[sedukhin11:2008] 
S. Sedukhin, Apr. 2008. Member of the Editorial Board, International Journal of Neural, Parallel & Scientific Computations, Dynamic Pub. 
[sedukhin12:200] 
S. Sedukhin, Apr. 2008. Technical Committee on Parallel Processing, Member, IASTED 
[sedukhin13:2008] 
S. Sedukhin, Apr. 2008. Member of the Editorial Board, International Journal of Parallel Processing Letters, Parallel & Scientific Computations, World Scientific Pub. 
[sedukhin14:2008] 
S. Sedukhin, August 2007. The 13th IEEE AsiaPacific Computer Systems Architecture Conference (ACSAC008), Taiwan, Stearing Committee Member 
[sedukhin15:2008] 
S. Sedukhin, February 2009. The Eighth IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 2009), Austria, Program Committee Member 
[sedukhin16:2008] 
T. Miyazaki, K. Kuroda, S. Sedukhin, Array Processor JP2008
119884, May 2008. 
[hitoshi12:2008] 
Fumio Nakajima. Graduation Thesis: Optimization and Performance
Analysis of Large Receive Offload in the Xen Virtualized System, 2008. Thesis Advisor: Hitoshi Oi, 
[hitoshi13:2008] 
Masato Chiba. Master Thesis: Benchmark for Image Registraion,
2008. Thesis Advisor: Hitoshi Oi, 
[nakasato02:2008] 
Kazuhiro Hosoda. Graduation Thesis: Acceleration of Solid Body
Simulations using GPU, University of Aizu, 2008. Thesis Advisor: N. Nakasato, 
[nakasato03:2008] 
Masanori Sato. Graduation Thesis: Speedup of the TreeMethod
Solution to the Nbody Problem, University of Aizu, 2008. Thesis Advisor: N. Nakasato, 
[nakasato04:2008] 
Kazuhiro Hosoda. Graduation Thesis: Acceleration of Solid Body
Simulations using GPU, University of Aizu, 2008. Thesis Advisor: Nakasato, N 
[nakasato05:2008] 
Kazuki Fujiwara. Graduation Thesis: Fast Simulation of Gravitational
Nbody Problem on GPU, University of Aizu, 2008. Thesis Advisor: N. Nakasato, 
[nakasato06:2008] 
Hiromitsu Tominaga. Graduation Thesis: Japanese Chess Program
by the Monte Carlo Method on Playstation 3, University of Aizu, 2008 Thesis Advisor: N. Nakasato, 
[nakasato07:2008] 
Junki Hoshi. Graduation Thesis: Shogi System using the Monte
Carlo Method on the Multicore CPU, University of Aizu, 2008. Thesis Advisor: N. Nakasato, 
[sedukhin17:2008] 
Ahmed Sherif Zekri. Ph.D. Thesis: Design and Evaluation of Dataparallel
Algorithms on a Matrix Processor, 2008. 
Research Adviser: S. Sedukhin, 

[sedukhin18:2008] 
Shoudai Yokoyama. Graduation thesis: Matrix Inversion on the
Cell/B.E. Processor, 2008. 
Thesis Adviser: Sedukhin, S. 
[hitoshi14:2008] 
Hitoshi Oi, August 2008. Participated Open Campus Summer Session. 
[hitoshi15:2008] 
Hitoshi Oi, May to July 2008. Hosted Dr. Abdel Ejnioui, Department of Information Technology, University of South Florida as a visiting researcher in the Operating Systems Laborary, from May 08 to July 08. 