Professor |
Assistant Professor |
Assistant Professor |
Operating Systems Laboratory mainly researches on the following areas:
Biology Inspired Cellular Computing
Performance Analysis and Modeling of Consolidated Systems
The objectives of this project are (1) to understand the behavior of each virtual machine and also virtual machine monitor (VMM) when the workload of a VM is varied, (2) to relate this behavior to the usage of various computing resources, such as CPU time, memory, storage devices and (3) to develop a mathematical model of performance interference between VMs in the consolidated system. Analysis and Optimization of Network Performance in Xen
Virtual Machine for Sensor Network Nodes
One possible approach to write programs for sensor network nodes concisely is to use a virtual machine (Mate). In this approach, the program is represented by the instructions of the virtual machine which are much denser than the native instructions of the microcontroller on the node. However, while the program size can be much smaller than the native instruction codes, the execution of programs in virtual instructions can be costly due to the interpretation of (dense) virtual instructions. Therefore, it has been pointed out that this approach is feasible when the node is frequently reprogrammed but the program is infrequently executed. We have been working on this topic with the group lead by Dr. Chris Bleakley, Lecturer in the School of Computer Science and Informatics at University College Dublin (UCD), Ireland. Hitoshi Oi visited UCD in May 2006 for a kick-off meeting and then stayed at UCD from June to September 2006 to collaborate with his group on this research topic. In 2008, Dr. Abdel Ejnioui from Information Technology Dept., University of South Florida Polytechnic joined this project and he stayed at Aizu as a visiting scholar from May to July. We have proposed an architecture for a wireless sensor network virtual machine which utilizes hardware acceleration techniques to reduce the execution inefficiencies of the virtual machine. We plan to further proceed this topic by designing detailed models of the proposed virtual machine architecture. [1] R. J. Creasy, "The Origin of the VM/370 Time-Sharing System," IBM Journal of Research and Development, Vol. 25, No. 5, pp483-490, 1981. [2] Paul Barham, et. al., "Xen and the Art of Virtualization," Proceedings of the nine- teenth ACM symposium on Operating systems principle (SOSP), pp164-177, October 2003. [3] "VMware: Virtualization via Hypervisor, Virtual Machine & Server Consolidation - VMware" http://www.vmware.com/ [4]"Bounding the Resource Savings of Utility Computing Models," [5] Leonid Grossman, "Large Receive Offoad implementation in Neterion 10GbE Ethernet driver," in Proceedings of the Ottawa Linux Symposium, pp195-200, July 2005. [6] Jan-Bernd Themann, "[RFC 0/1] lro: Generic Large Receive Offoad for TCP traf- c," Linux Kernel Mailing List archive, http://lkml.org/lkml/2007/7/20/250. Development of a compiler software for SIMD computers
We have proposed and developed a compiler software for SIMD type computers to easily utilize the high performance computing power. In our compiler system, we abstract SIMD computers as parallel processing units with own local memory and broadcast memory. Also, these processing units are controlled by a host computer (CPU). According with this computing model, we dene our programming language. Currently, our compiler convert an input source code into (I) DP operation on SING processor, (II) QP operation on SING processor and (III) computing pipeline for a FPGA board as a special type of SIMD computers. We use core routines of two existing compute intensive applications to evaluate expected performance of the generated code for SING processor. With one application using DP operations (a kernel routine of multi-dimensional Monte Carlo code) and other using QP operations (an evaluation of integrand of Feynman loop integral), we obtain the performances of 7.7 x 1010 DP operations and 7.6 x 109 QP operations per second, respectively. Study on multi-presicion calculations on SIMD computers
|
[hitoshi-01:2007] |
Hitoshi Oi. Local Variable Access Behavior of a Hardware-Translation
Based Java Virtual Machine. Journal of Systems and Software, 2008. |
Accepted for publication. http://dx.doi.org/10.1016/j.jss.2008.03.057 |
|
[nakasato-01:2007] |
T.K. Suzuki, N. Nakasato, H. Baumgardt, A. Ibukiyama, , J. J.
Makino, and T. Ebisuzaki. Evolution of Collisionally Merged Massive Stars.
Astrophysical Journal, 668:435-448, 2007. |
We investigate the evolution of collisionally merged stars with mass of ` 100M which
might be formed in dense star clusters. We assumed that massive stars with several
tens M collide typically after ~ 1Myr of the formation of the cluster and performed
hydrodynamical simulations of several collision events. Our simulations show that after
the collisions, merged stars have extended envelopes and their radii are larger than
those in the thermal equilibrium states and that their interiors are He-rich because of
the stellar evolution of the progenitor stars. We also found that if the mass-ratio of
merging stars is far from unity, the interior of the merger product is not well mixed
and the elemental abundance is not homogeneous. We then followed the evolution of
these collision products by a one dimensional stellar evolution code. After an initial
contraction on the Kelvin-Helmholtz (thermal adjustment) timescale (` 103-4 yr),
the evolution of the merged stars traces that of single homogeneous stars with corresponding
masses and abundances, while the initial contraction phase shows variations
which depend on the mass ratio of the merged stars. We infer that, once runaway
collisions have set in, subsequent collisions of the merged stars take place before mass
loss by stellar winds becomes signicant. Hence, stellar mass loss does not inhibit the
formation of massive stars with mass of ` 1000M. |
|
[sedukhin-01:2007] |
A.S. Zekri and S.G. Sedukhin. Level-3 BLAS and LU Factorization
on a Matrix Processor. IPSJ Trans. on Advanced Computing Systems, 49(SIG
2 (ASC 21)):37-52, 2008. |
As increasing clock frequency approaches its physical limits, a good approach to enhance
performance is to increase parallelism by integrating more cores as co-processors
to general-purpose processors in order to handle the different workloads in scientic,
engineering, and signal processing applications. In this paper, we propose a manycore
matrix processor model consisting of a scalar unit augmented with b × b simple
cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels.
Data load/store is overlapped with computing using a decoupled data access unit that
moves b × b blocks of data between memory and the two scalar and matrix processing
units. The operation of the matrix unit is mainly processing ne-grained b × b matrix
multiply-add (MMA) operations. We formulate the data alignment operations including
matrix transposition and skewing as MMA operations in order to overlap them with
data load/store.Two fundamental linear algebra algorithms are designed and analytically
evaluated on the proposed matrix processor: the Level-3BLAS kernel, GEMM, and the LU factorization with partial pivoting,the main step in solving linear systems
of equations. For the GEMM kernel, the maximum speed of computing measured in
FLOPs/cycle is approached for different matrix sizes, n, and block sizes, b. The speed
of the LU factorization for relatively large values of n ranges from around 50-90% of the
maximum speed depending on the model parameters. Overall, the analytical results
show the merits of using the matrix unit for accelerating the matrix-based applications. |
[hitoshi-02:2007] |
Takayuki Hatori and Hitoshi Oi. Implementation and Analysis of Large
Receive Offload in a Virtualized System. In Proceedings of the Virtualization
Performance: Analysis, Characterization, and Tools (VPACT'08), 2008. |
Austin, TX |
|
[hitoshi-03:2007] |
Hitoshi Oi. Hardware Support for a Wireless Sensor Network Virtual
Machine. In Proceedings of International Conference on MOBILe Wire-
less MiddleWARE, Operating Systems and Applications (Mobilware 2008),
February 2008. |
Innsbruck, Austria, February 2008. |
|
[nakasato-02:2007] |
N. Nakasato, J. Makino, Y. Matsubara, and S. Ebisuzaki. A Compiler
for High Performance Adaptive Precision Computing. In Proceedings of
SACSIS 2008, 2008. |
We propose and implement a compiler program for high performance adaptive precision
computing. Conventional numerical simulations have been done with
oatingpoint
double precision operations. However, recently available computing techniques
such as SIMD computers or FPGA computers offer us much better performance
than DP operations on conventional CPUs. To take an advantage of a SIMD computer,
we should program it with a special programming language or libraries. Also,
each SIMD computer has its own programming language or special techniques that
we need to cope with. The proposed compiler denes a new language for easily
using SIMD and FPGA computers. From an input source program dened in this
language, our compiler generates required source codes with given numerical precision.
We use core routines of two existing compute intensive applications to evaluate
expected performance of the generated code for a SIMD computer (GRAPE-DR).
With one application using double precision operations and other using quadruple
precision operations, we obtain the performaces of 7.7 × 1010 and 7.6 × 109
operations per second, respectively. |
|
[sedukhin-02:2007] |
A.S. Zekri and S.G. Sedukhin. Evaluating the Performance of Basic
Linear Algebra Subroutines on a Torus Array Processor. In N.A., editor, CIT
'07: Proceedings of the 7th IEEE International Conference on Computer and Information Technology, pages 300{305, Washington, DC, USA, October
2007. University of Aizu, IEEE Computer Society. |
The basic linear algebra subroutines (BLAS) are standard operations to efficiently
solve the linear algebra problems on high performance and parallel systems. In
this paper, we study the implementation of some important BLAS operations on a
N × N torus array processor. We show that the performance of the Level-3 BLAS
represented by the nxn matrix multiply-add operation, n > N, approaches the
theoretical peak as n increases since the degree of data reusing is high. While the
performance of Level-1 and Level-2 BLAS operations is low as a result of low data
reusing. Fortunately, many applications are based on intensive use of Level-3 BLAS
with small percentage of Level-1 and Level-2 BLAS. |
|
[sedukhin-03:2007] |
A.S. Zekri and S.G. Sedukhin. Performance Evaluation of Basic Linear
Algebra Subroutines on a Matrix co-processor. In R. Wyrzykowski,
J. Dongarra, K. Karczewski, and J. Wasniewski, editors, PPAM 2007 - The
7th International Conference on Parallel Processing and Applied Mathemat-
ics, LNCS, vol. 4967, pages 1190-1199, Gdansk, Poland, September 2007.
PPAM, Springer-Verlag. |
As increasing clock frequency approaches its physical limits, a good approach to
enhance performance is to increase parallelism by integrating more cores as coprocessors
to general-purpose processors in order to handle the different workloads
of scientic and signal processing applications. Many kernels in these applications
lend themselves to the data-parallel architectures such as array processors. The basic
linear algebra subroutines (BLAS) are standard operations to efficiently solve
the linear algebra problems on high performance and parallel systems. In this paper,
we implement and evaluate the performance of some important BLAS operations on
a matrix co-processor. Our analytical model shows the performance of the Level-3
BLAS represented by the nxn matrix multiply-add operation approaches the theoretical
peak as n increases since the degree of data reuse is high. However, the
performance of Level-1 and Level-2 BLAS operations is low as a result of low data
reuse. Fortunately, many applications are based on intensive use of Level-3 BLAS
with small percentage of Level-1 and Level-2 BLAS. |
|
[sedukhin-04:2007] |
A.S. A.S. Zekri and S.G. Sedukhin. Fine-grained Matrix Multiplyadd
on a Torus Array Processor. In N.A., editor, Proc. of the 22nd Inter-
national Conference on Computers and Their Applications (CATA-2007),
pages 44-51, Honolulu, USA, March 2007. The International Society for
Computers and Their Applications, ISCA. |
In performing the n × n matrix multiply-add operation C=C+A × B on a ne-grain
N × N torus array processor, n ≫ N, the matrices are partitioned into blocks of size
N so that the whole result is obtained by a sequence of N × N matrix multiplyadd
operations. When the sizes of matrices are not exact multiples of the array size, the remaining parts may drastically affect the performance depending on the
shape of the matrices. Previously, we represented the 3D index space of the N × N
matrix multiply-add operation as a 3D torus. The projection method was used
to obtain the optimal 2D data allocations to perform the operation on the N × N
torus array processor in N multiply-add-roll steps. In this paper, we use the optimal
data allocations to present two approaches to deal with the ne-grain blocking of
the matrix multiply-add operation. The packing approach performs multiple vector
scaling or vector reduction operations together by proper alignment of data inside
the array processor and applying the suitable data allocation. The padding approach
pads the remaining parts up to the block size N. The analytical experiments show
a gained performance of the packing approach over the padding approach when the
sizes of the remaining parts are small compared to N. |
|
[sedukhin-05:2007] |
K. Matsumoto, D. Vazhenin, and S. Sedukhin. Transitive Closure on
the PlayStation 3. In N.A., editor, Proceedings of the 2nd International
Workshop on Automatic Performance Tuning (iWAPT 2007), page 33,
Tokyo, Japan, September 2007. University of Tokyo, University of Tokyo. |
The problem of nding all the shortest paths in a graph is one of the most important
optimizations in operations research as it arises in many real-world applications
like bioinformatics, network routing, CAD, etc. The Transitive Closure (TC),
or all-pairs shortest paths, computes the length of a minimum-length path between
all pairs of nodes in a directed n-node distance graph. The Floyd-Warshall (FW)
algorithm is a classical algorithm to solve the TC problem with O(n3) fmin;+g operations
on O(n2) data. This algorithm involves nested code that exhibits a regular
access pattern with signicant data dependences. A porting of the FW-algorithm
to different computing platforms usually demonstrates a very limited performance
compared with linear algebra problems. We present results of our experiments on
porting FW-algorithm to the PlayStation 3 (PS3). The parallel algorithm we use is
a blocked (tiled) FW-algorithm from our previous work. The block size was selected
as 64 × 64. The performance comparison of the FW-algorithm running on different
computing platforms, including our result on PS3, demonstrates an impressive improvement
for TC problem. |
[hitoshi-04:2007] |
Hitoshi Oi. A Case Study: Performance Evaluation of a DRAM-Based
Solid State Disk. In Proceedings of the Japan-China Joint Workshop on Frontier
of Computer Science and Technology (FCST 2007), pages 57-60, 2007. |
Wuhan, China |
[hitoshi-05:2007] |
Hitoshi Oi, 2006. Reviewer, Microprocessors and Microsystems |
[hitoshi-06:2007] |
Hitoshi Oi, 2006 to present. Member, Embedded Microprocessor Consortium |
[hitoshi-07:2007] |
Hitoshi Oi, 2005 to present. Member, IEEE/Computer Society |
[hitoshi-08:2007] |
Hitoshi Oi, since 2006. Liaison Chair for Asia, ACM International Conference on Computing Frontiers |
[hitoshi-09:2007] |
Hitoshi Oi, 2005 to preent. Member, ACM |
[hitoshi-10:2007] |
Hitoshi Oi, since 2006. Program Committee Member, Japan-China Joint Workshop on Frontier of Computer Science and Technology (FCST) |
[hitoshi-11:2007] |
Hitoshi Oi, 2007. Reviewer, IEEE Computer Architecture Letters |
[hitoshi-12:2007] |
Hitoshi Oi, 2008. Reviewer, MOBILe Wireless MiddleWARE, Operating Systems and Applications (Mobilware 2008) |
[hitoshi-13:2007] |
Hitoshi Oi, 2007. Track Chair (High Performance Computing), session chair and reviewer, IEEE 7th International Conference on Computer and Information Technology (CIT2007) |
[sedukhin-06:2007] |
S. Sedukhin, Apr. 2007. IEEE CS, member |
[sedukhin-07:2007] |
S. Sedukhin, Apr. 2007. ACM, member |
[sedukhin-08:2007] |
S. Sedukhin, Apr. 2007. IEICE, member |
[sedukhin-09:2007] |
S. Sedukhin, Apr. 2007. IASTED, Technical Committee on Parallel Processing, member |
[sedukhin-10:2007] |
S. Sedukhin, Apr. 2007. International Journal of Parallel Processing Letters, Member of the Editorial Board |
[sedukhin-11:2007] |
S. Sedukhin, Apr. 2007. International Journal of Neural, Parallel & Scientic Computations, Member of the Editorial Board |
[sedukhin-12:2007] |
S. Sedukhin, Apr. 2007. International Journal of High Performance Systems Architecture, Member of the Editorial Board |
[sedukhin-13:2007] |
S. Sedukhin, June 2007. The 2007 High Performance Computing & Simulation Conference (HPC&S 2007), Program Committee Member |
[sedukhin-14:2007] |
S. Sedukhin, August 2007. The 11th Asia-Pacic Computer Systems Architecture Conference (ACSAC007), Korea, Stearing and Program Committee Member |
[hitoshi-14:2007] |
Kazunori Masuyama (Kanazawa, JP) Yasushi Umezawa (Cupertino,
CA, US) Jeremy J. Farrell (Campbell, CA, US) Sudheer Miryala (San Jose,
CA, US) Takeshi Shimizu (Sunnyvale, CA, US) Hitoshi Oi (Boca Raton, FL,
US) Patrick N. Conway (Los Altos, CA, US) FAULT CONTAINMENT AND
ERROR HANDLING IN A PARTITIONED SYSTEM WITH SHARED RESOURCES,
2008. |
[hitoshi-15:2007] |
Takayuki Hatori. Implemenation and Analysis of Large Receive Offload
in a Virtualized System, CSE, 2008. |
[hitoshi-16:2007] |
Takuya Sato. Simulation Study of a Routing Algorithm in a Wirelress
Sensor Network, CSE, 2008. |
[sedukhin-16:2007] |
Kazuya Matumoto. Graduation Thesis: Solving All-Pairs Shortest
Path Problem on the PLAYSTATION 3, University of Aizu, 2007. |
Thesis Advisor: S. Sedukhin |
[hitoshi-17:2007] |
HITOSHI OI. Invited Talks: \Hardware Support for a Wireless Sensor Network Virtual Machine," at the Centre for High Performance Embedded Systems (CHiPES), Nanyang Technological University, September 25, 2007. \Virtual Machines for Resource-Constrained Platforms," at The \Politehnica" University of Timisoara, June 8, 2007. |
[hitoshi-18:2007] |
HITOSHI OI. Visited Nanyang Technological University (Singapore) and The \Politehnica" University of Timisoara (Romania) for exchange program development. |
[hitoshi-19:2007] |
Hitoshi Oi. (in Japanese, no English title), January 2008. Fuji Xerox White Paper |
[hitoshi-20:2007] |
GigaExpress, a DRAM-based Solid State Disk, has been provided by courtesy
of Fuji Xerox Corporation. |