Professor |
Assistant Professor |
Visiting Researcher |
Toroidal computer architecture and algorithms
We have shown that matrix multiply-add operation can be effectively used for multidimensional signal processing like Discrete Fourier Transforms (DFT), Discrete Cosine Transforms (DCT), and Discrete Hadamard Transforms (DHT). These 2D/3D transforms can be represented as three/four matrix multiply-add operations and be implemented on 2D/3D torus array processor with high efficiency and speed. A design of the toroidal 3D computer for multidimensional DSP transforms is currently under consideration. The model of a multidimensional computer was used in the Cmpware CMPDK multiprocessor development environment. We have implemented a data-driven modeling technique to simultaneously produce simulation models and programming tools (an assembler and disassembler) for a new architecture. The technical specification of the functional unit for the scalar fused multiplyadd operation in different algebras was implemented and tested in cooperation with Zuken Ltd. (Japan). Embedded Processor Design and Evaluation
We are particularly interested in the Java VirtualMachine (JVM) for embedded platforms. So far we have investigated two sources of execution inefficiencies in the embedded JVMs: local variable accesses and redundant stack operations, that are inherent to the architecture of the JVM, and published the results at LCTES05 and Computing Frontiers 2006. We are representing the academic membership of the University of Aizu for the Embedded Microprocessor Benchmark Consortium (EEMBC). We have been using their GrinderBench (Java benchmark programs for embedded processors) for the research in this topic. Virtual Machine for Sensor Network Nodes
One possible approach to write programs for sensor network nodes concisely is to use a virtual machine (MatLe). In this approach, the program is represented by the instructions of the virtual machine which are much denser than the native instructions of the microcontroller on the node. However, while the program size can be much smaller than the native instruction codes, the execution of programs in virtual instructions can be costly due to the interpretation of (dense) virtual instructions. Therefore, it has been pointed out that this approach is feasible when the node is frequently reprogrammed but the program is infrequently executed. We have been working on this topic with the group lead by Dr. Chris Bleakley, Lecturer in the School of Computer Science and Informatics at University College Dublin (UCD), Ireland. Hitoshi Oi visited UCD in May 2006 for a kick-off meeting and then stayed at UCD from June to September 2006 to collaborate with his group on this research topic. We have proposed an architecture for a wireless sensor network virtual machine which utilizes hardware acceleration techniques to reduce the execution inefficiencies of the virtual machine. We plan to further proceed this topic by designing detailed models of the proposed virtual machine architecture. Performance Analysis of Server Workload
In particular, currently at the Operating Systems Laboratory, we are working on the Open Source Development Labsf Database Test suites (OSDL DBT), which are open source implementations of TPC benchmark programs. In addition to the behavioral analysis of the workload, we have also been investigating and evaluating hardware components of the modern computers, such as multi-core processors and solid-state disks (SSDs) under such workload. Possible outcomes from this project include development of analytical models of server systems, server simulation programs with graphical user interface that can be used for classroom teaching for the computer systems course, and optimizations of hardware and software components (such as memory hierarchy, disk systems) for performance improvement. System Level Virtualization
These advantages of the system level virtualization are, however, not free. Various system operations, such as dispatching interrupts to appropriate virtual machines, or protecting a VM from other VMfs failures, incur system resource overhead, including wasted CPU cycles and memory space. One of our objectives in this topic is to identify the hardware and software components that can be performance bottlenecks in the virtualized system. We analyze each of such components and propose solutions to such components. Modern processors are equipped with hardware support for system level virtualization (e.g. Intel VT and AMD-V) and now industry is working toward the virtualization of I/O devices (e.g. PCI-IOV). Evaluating the effectiveness of these hardware support techniques for virtualization is another objective of our research in this area. |
[hitoshi-01:2006] |
Hitoshi Oi. Instruction Folding in a Hardware-Translation Based
Java VirtualMachine. In Proceedings of ACM International Conferenceon
Computing Frontiers, pages 138.145. ACM/SIGMICRO, May 2006. |
Bytecode hardware-translation improves the performance ofa Java Virtual Machine
(JVM) with small hardware resourceand complexity overhead.Instruction
folding is a technique to further improve the performanceof a JVM by reducing
the redundancy in the stack-basedinstruction execution.However, the variable
instruction length of the Java bytecode makesthe folding logic complex.In this
paper, we propose a folding scheme with reduced hardwarecomplexity and evaluate
its performance.For seven benchmark cases, the proposed scheme folded6.6%
to 37.1% of the bytecodes which correspond to84.2% to 102% of the PicoJava-IIfs
performance. |
|
[sedukhin-01:2006] |
A.S. Zekri and S.G. Sedukhin. Fine-grained Matrix Multiply-
Add on a Torus Array Processor. In Bidyut Gupta, editor, Proceedings
of the ISCA 22nd International Conference on Computers and Their Applications,
pages 44.51, Honolulu, Hawaii, March 2007. ISCA, ISCA. |
In performing the n~n matrix multiply-add operation C=C+A~B on a finegrain
N~Ntorus array processor, nN, the matrices are partitioned into blocks
of size N so that the whole result is obtained by a sequence of N~N matrix
multiply-add operations. When the sizes of matrices are not exact multiples of
the array size, the remaining parts may drastically affect the performance depending
on the shape of the matrices. Previously, we represented the 3D index
space of the N~N matrix multiply-add operation as a 3D torus. The projection
method was used to obtain the optimal 2D data allocations to perform the operation
on the N~N torus array processor in N multiply-add-roll steps. In this
paper, we use the optimal data allocations to present two approaches to deal with
the fine-grain blocking of the matrix multiply-add operation. The packing approach
performs multiple vector scaling or vector reduction operations together
by proper alignment of data inside the array processor and applying the suitable
data allocation. The padding approach pads the remaining parts up to the block
size N. The analytical experiments show a gained performance of the packing
approach over the padding approach when the sizes of the remaining parts are
small compared to N. |
|
[sedukhin-02:2006] |
A.S. Zekri and S.G. Sedukhin. Matrix Transpose on 2D Torus
Array Processor. In N.A., editor, Sixth International Conference on Computer and InformationTechnology (CIT 2006), pages 45.46, Seoul, Korea,
Sept. 2006. CIT, IEEE Computer Society. |
Previously, we represented the index space of the (n~n)-matrix multiply-add
problem C=C+A~B as a 3D torus, where A, B, and C are rolled along the
corresponding axes of the index space. All optimal 2D data allocations (resulted
from projection) to solve the problem on the n~n torus array processor in n
multiply-add-roll steps were obtained.In this paper, we formulate the operations
needed for aligning both the data before computing and the results after computing
as matrix multiply-add problems. These alignment operations are combined
with the optimal data allocations that solve the matrix multiply-add problem
to propose new algorithms to transpose an n~n matrix on the n~n torus array
processor in O(n) multiply-add-roll steps. Using the proposed algorithms, we
showed different approaches to solve the transposed matrix multiply-add problem,
C=C+AT~BT , on the 2D torus array processor. |
|
[sedukhin-03:2006] |
A.S. Zekri and S.G. Sedukhin. The general matrix multiplyadd
operation on 2D torus. In N.A., editor, Proc. of the 20th International
Parallel and Distributed Processing Symposium (IPDPS 2006), pages 125.
131, Rhodes Island, Greece, April 2006. IPDPS, IEEE Computer Society. |
In this paper, the computation space of the (n~n)-matrix multiply-add problem
C = C +AEB is represented as a 3D n~n~n torus. All possible time-scheduling
functions to activate the computations inside the 3D torus are determined. To
maximize efficiency when solving a single problem, we mapped the computation
points into the 2D n~n toroidal array processor. All optimal 2D data allocations
that solve the problem in n multiply-add-roll steps are obtained. The well
known Cannonfs algorithm is one of the resulting allocations. We used the optimal
data allocations to describe all variants of the GEMM operation on the
2D toroidal array processor. By controlling the data flow, the transposition operation
is avoided in 75% of the GEMM variants. However, only one explicit
matrix transpose is needed for the remaining 25%. Ultimately, we described four
versions of the GEMM operation covering the possible layouts of the initially
loaded data into the array processor. |
|
[hitoshi-02:2006] |
Hitoshi Oi and C. J. Bleakley. Towards a Low Power Virtual
Machine for Wireless Sensor Network Motes. In Proceedings of the Japan-China Joint Workshopon Frontier of Computer Science and Technology
(FCST 2006). IEEE, November 2006. |
Virtual Machines (VMs) have been proposed as an efficientprogramming model
for Wireless Sensor Network (WSN) devices.However, the processing overhead
required for VM execution hasa significant impacton the power consumption and
battery lifetime of these devices.This paper analyses the sources of power consumptionin
the MatLe VM for WSNs. The paper proposesa generalised processor
architecture allowingfor hardware acceleration of VM execution.The paper proposes
a numberof hardware accelerators for MatLe VM execution and assesses
theireffectiveness. |
[hitoshi-03:2006] |
Hitoshi Oi. The University of Aizu Competitive Research Fund,
2006-2007. |
Title: Design Investigation of the Embedded Microprocessors, Amount: \=817,000 |
[hitoshi-04:2006] |
Hitoshi Oi, Feb. 2006. Professional Member, IEEE/CS |
[hitoshi-05:2006] |
Hitoshi Oi, Apr. 2005. Professional Member, ACM |
[hitoshi-06:2006] |
Hitoshi Oi, Jan. 2006. Academic Member, representative for Aizu University, EEMBC |
[hitoshi-07:2006] |
Hitoshi Oi, Jul. 2006. Reviewer for Microprocessors and Microsystems, Elsevier |
[hitoshi-08:2006] |
Hitoshi Oi, May 2006. ACM International Conference on Computing Frontiers, Liaison Chair for Asia |
[hitoshi-09:2006] |
Hitoshi Oi, November 2006. Program Committee Member for the Japan-China Joint Workshopon Frontier of Computer Science and Technology (FCST 2006) |
[hitoshi-10:2006] |
Reviewer for the IEEE 2006 International Workshop onSignal Processing Systems |
[sedukhin-04:2006] |
S. Sedukhin, Apr. 2006. IEEE CS, member |
[sedukhin-05:2006] |
S. Sedukhin, Apr. 2006. ACM, member |
[sedukhin-06:2006] |
S. Sedukhin, Apr. 2006. IEICE, member |
[sedukhin-07:2006] |
S. Sedukhin, Apr. 2006. IASTED Technical Committee on Parallel Processing, member |
[sedukhin-08:2006] |
S. Sedukhin, Apr. 2006. International Journal of Parallel Processing Letters, Member of the Editorial Board |
[sedukhin-09:2006] |
S. Sedukhin, Apr. 2006. International Journal of High Performance Systems Architecture, Member of the Editorial Board |
[sedukhin-10:2006] |
S. Sedukhin, Sept. 2006. The 11th Asia-Pacific Computer Systems Architecture Conference (ACSAC006), Korea, Stearing Committee Member |
[sedukhin-11:2006] |
S. Sedukhin, August 2006. The 8th Workshop on High Performance Scientific and Engineering Computing, Program Committee Member |
[sedukhin-12:2006] |
S. Sedukhin, Apr. 2006. International Journal of Neural, Parallel & Scientific Computations, Member of the Editorial Board |
[sedukhin-13:2006] |
S. Sedukhin, May 2006. The 2006 High Performance Computing & Simulation Conference (HPC&S 2006), Program Committee Member |
[sedukhin-14:2006] |
S. Sedukhin, Feb. 2006. IASTED International Conference on Parallel and Distributed Computing and Networks, Program Committee Member |
[hitoshi-11:2006] |
Masato Chiba. Graduation Thesis: Performance Evaluation of
Dual CoreProcessors under On-Line Transaction Processing Workload,
University of Aizu, 2007. |
Thesis Adviser: Hitoshi Oi |
|
[sedukhin-15:2006] |
Yuusuke Kobayashi. Graduation Thesis: An Evoluation of
Four Algorithms for the Algebraic Paths Problem, University of Aizu,
2006. |
Thesis Advisor: Sedukhin, S. |
|
[sedukhin-16:2006] |
Ben Hachimori. Graduation Thesis: Parallelization of the Raytracing
Algorithm with FastMATH Processor, University of Aizu, 2006. |
Thesis Advisor: Sedukhin, S. |