/ N. N. Mirenkov / Professor
/ S. G. Sedukhin / Associate Professor
/ Shietung Peng / Assistant Professor
/ Sergey V. Ten / Assistant Professor
Our general program is high-performance computing and advanced information infrastructure. We are undertaking research efforts in parallel and distributed algorithms/architectures, visual (multimedia) languages and tools. We lead joint research projects ``Massively Parallel Computation: Synchronous vs. Asynchronous Designs", ``Parallel Scientific Computations", ``Fine-grain parallelism for Cellular automata" and take part in projects: ``Aizu Supercomputer" and ``Efficient Algorithmic Design Techniques for Parallel Computing".
We are also developing a top-down education project related to distributed parallel processing. In addition, together with the Center for Language Research and students we started the project ``Advanced Computing and Communication Multimedia English-Japanese Glossary". We are studying the problem of parallel program portability and developing the supporting tools. Our theoretical work is related to efficient algorithms for parallel numerical computation including matrix multiplication, Fourier transformation, and Singular Value Decomposition as well as for fault tolerant routing on certain interconnection networks, like hypercubes, star graphs, and torus.
As one of our results we would like to mention the S4CAD system. S4CAD (System of Systolic Structures Synthesis) is a high level interactive software tool for designing, analysis and modeling of systolic VLSI-oriented array processors. Starting with a localized data dependence graph of the algorithm S4CAD automatically produces all permissible array processors, among which the designer may chose the optimal one. World Wide Web page of the S4CAD is available at http://www.u-aizu.ac.jp/labs/sw-dpp/S4CAD/s4cad.html.
The important question of our research is parallel program transparency and accessibility of massively parallel computers to large user populations. We are developing a multimedia technology for the interactive specification of application algorithms. This technology is related to visualization, animation and sonification of data processing methods. In fact, it is a 'filmification' of application algorithms. Each film is based on a rather big peace of knowledge of a class of the algorithms. Our approach is close to the idea of algorithmic skeletons (higher order functions). However, skeletons leave a big gap between reality and a human (a multi-channel being). So, we use algorithmic multimedia skeletons based on both mathematical and physical abstractions.
Each film is a series of frames (computational steps) displaying one or more parameterized sets of nodes and/or moving objects in multi-dimensional space-time. Each frame highlights a subset of these nodes and/or moving objects. The films have their own sets of figures, sounds, colors and their own ``shape'' of the movement. Users recognize them due to a combination of all these features. Each film defines a partial order of scanning of the nodes or objects (and possibly, colors and sounds). In fact, partial orders of scanning are defined via: positions of nodes and objects, trajectories of movement, external configuration patterns of surrounding nodes and/or variable values (internal configuration) on nodes.
As a rule, computation specified on different nodes (objects) of a frame is considered to be performed in parallel. Computation specified in different frames is performed sequentially. So, it is possible to say: the shorter film the better.
The user defines the specification by creating his new film. The corresponding program (sequential or parallel) is generated automatically. Each film itself is considered as a special-purpose application for which a parallel program is developed in advance as an item of the system library. The specification of a problem is considered as input data for such a parallel program. However, this data can introduce a composition of a few films, an irregularity of computation, special sizes of data, etc. It means that to reach efficiency it is impossible to have only one variant of the film implementation. So, each film is supported at least by a few parallel programs from the system library. Different programs are related to different conditions being defined by the specification. In other words, a film has a special space of decision-making. This space takes into account levels of problem granularity, irregularity, dynamics of the data decomposition, and some others related to algorithm features. Because each film reflects certain knowledge about data processing, we select the best possible solutions for the implementation of this data processing under different conditions and on the basis of the best hand-made algorithms. These solutions ``approximate'' possible decision-making.
Refereed Journal Papers
In this paper, we present new parallel NC algorithm to find smallest mixed dominating set in trees. The algorithm is based on tree compression techniques that have been used traditionally to evaluate linear arithmetic expressions in parallel. The model of parallel computation used is the CRCWP-RAM (Concurrent Read Concurrent Write Parallel RAM), where Writing conflicts are resolved in a non-deterministic fashion. The algorithm requires $O(n)$ processors and runs in $O(\log n)$ time on a CRCW P-RAM.
To solve the algebraic path problem including transitive closure, shortest paths problem and finding of the minimum-cost spanning tree in a graph and also the inverse of a matrix, three functionally equivalent algorithms are introduced. These algorithms have different regularity of the computations. For each of the algorithms, wehave systematically derived and evaluated various designs of systolic array processors, which are differed in their dimension, number and layout of processing elements, the total time of computations, etc. Specific properties of operations performed for different problems in the underlying algebra may be an issue for further design improvment.
A visual language paradigm for the interactive specification of application algorithms is proposed for consideration. The approach is based on a set of computational schemes (shapes of computation) presented by color figures, pictures and animation films with sound accompaniment. Each film is related to series of frames (computational steps) and it reflects some knowledge about data processing. Each frame ``brightens'' up a substructure of data for which operations should be specified. As a rule, this substructure is a set of points and/or moving objects in a multi-dimensional space-time. A user embeds his algorithm into computational schemes by making these schemes more precise. In fact, he defines the specification by creating his new film. The corresponding program (sequential or parallel) is generated automatically.
An approach supporting parallel program transparency for the VIM language paradigm is considered. The basis of the approach is that for each computational scheme (``shape'' of computation) presented by an animation film, a ``virtual'' parallel program is developed. Application specifications are considered as input data for ``virtual'' programs. Such a ``virtual'' program is based on a set of hand-made library items covering certain fields of data processing. Each item is a scalable parallel program. Only one of these parallel programs is used during run-time, which one depends on specification and system environment.
An approach supporting a new programming philosophy and a way of developing the corresponding technology are presented. The idea behind the approach is making use of animation films as communication units for computer-human dialog. These films have their own sets of figures, sounds, colors and their own ``shape'' of the movement. Users recognize the units due to a combination of all these features. Each film is related to series of frames (computational steps) and reflects some knowledge about data processing. Each frame highlights a substructure of data for which operations should be specified. As a rule, this substructure is a set of points and/or moving objects in a multi-dimensional space-time. To specify a problem a user employ: modification of the system films, composition of these films, seeing a film being created, etc. The corresponding program (sequential or parallel) is generated automatically.
The set-to-set routing which allows certain node failure is considered. We first present an interesting theorem for the set-to-set routing problem in general graphs. Then, efficient algorithms are given for the set-to-set routing problem in hypercubes and star graphs. Hypercubes and star graphs have been accepted as attractive interconnection networks. They have logarithmic and sublogarithmic node degree and diameter, respectively. They both possess rich structure and symmetry properties as well as many desirable fault tolerance characteristics. For $n$-dimensional hypercubes and star graphs, our algorithms run in $O(n^{2})$ optimal time. The algorithms find the node disjoint paths of length at most $2n-2$ for hypercubes and the node disjoint paths of length at most $\lfloor \frac{3(n-1)}{2} \rfloor + 5$ for star graphs.
In this paper, we give optimal time algorithms for the following node disjoint path problems in $n$-dimensional star graphs $G_n$: (1) Given an arbitrary node $s$ and a set of arbitrary distinct nodes $T=\{t_1,\ldots, t_{n-1}\}$, $s\not\in T$, in $G_n$, find $n-1$ node disjoint paths connecting $s$ and each node of $T$; and (2) Given arbitrary $k= \lceil \frac{n-1}{2}\rceil$ pairs of distinct nodes $(s_1,t_1),\ldots,(s_k,t_k)$ in $G_n$, find $k$ node disjoint paths, one path connecting one pair of nodes. For problem (1), our algorithm finds the $n-1$ node disjoint paths of length at most $d(G_n)+4$ in $O(n^2)$ optimal time, where $d(G_n)=\lfloor\frac{3(n-1)}{2}\rfloor$ is the diameter of $G_n$. The length of the paths given in this paper improves the previous result of $5n-2$. For problem (2), our algorithm finds the $k$ node disjoint paths of length at most $d(G_n)+5$ in $O(n^2)$ optimal time. Our results improve the previous ones of the time $O(n^4\log n)! $ and the length $4(n-2)$, respectively.
In this paper, we introduce a general fault-tolerant routing problem, {\it cluster fault tolerant routing}, which is a natural extension of the well studied node fault tolerant routing problem. A cluster is a connected subgraph of a graph $G$ and a cluster is faulty if all nodes in it are faulty. {\it Cluster fault tolerant} (abbreviated as CFT in what follows) routing studies the number of fault clusters and the size (diameter) of the clusters that an interconnection network can tolerate in fault tolerant routing problems. In this paper, we investigate $k$-pairwise CFT routing in an $n$-dimensional hypercubes $H_n$. We give an $O(n^2\log n)$ time algorithm which finds the required routing paths for $k$-pairwise CFT routing in $H_n$. Our algorithm implies an $O(n^2\log n)$ algorithm for $k$-pairwise node disjoint path problem in $H_n$. The algorithm for $k$-pairwise node disjoint path problem in $H_n$ significantly improves the previous results of time comp! lexity $O(n^3\log n)$.
In this paper, we study the fault tolerant properties of $n$-dimensional hypercubes $H_n$ for node-to-set and set-to-set routing problems on a general fault-tolerant routing model, {\it cluster fault tolerant routing}, which is a natural extension of the well studied node fault tolerant routing. A cluster of a graph $G$ is a connected subgraph of $G$ and a cluster is called faulty if all nodes in the cluster are faulty. For node-to-set routing and set-to-set routing, where $k$ ($2\leq k\leq n$) fault-free node disjoint paths are needed, in $H_n$, we show that the maximum numbers of fault clusters of diameter at most 1 that can be tolerated is $n-k$. We give $O(kn)$ optimal time algorithms which find $k$ fault-free node disjoint paths of length at most $n+3$ for node-to-set and $k$ fault-free node disjoint paths of length at most $2n$ for set-to-set cluster fault tolerant routing problems in $H_n$, respectively.
In this paper, we give an efficient algorithm for the following $k$-pairwise node disjoint path problem in $n$-dimensional hypercubes $H_n$: Given $k=\lceil \frac{n}{2}\rceil$ pairs of $2k$ distinct nodes $(s_1,t_1),\ldots,(s_k,t_k)$ in $H_n$, $n\geq 4$, find $k$ node disjoint paths, each path connecting one pair of nodes. Our algorithm finds the $k$ node disjoint paths in $O(n^2\log^* n)$ time which improves the previous result of $O(n^2\log n)$. The length of the paths constructed in our algorithm is at most $n+\lceil \log n\rceil +1$ which improves the previous result of $2n$ as well.
In this paper, we give an $O(r^2)$ time algorithm for onstructing a fault-free routing path of ptimal length between any two non-fault nodes of $r$-dimensional torus with $2r-1$ faulty nodes. he fastest previously known algorithm for this problem takes $O(r^3)$ time; e show that Rabin diameter of a $r$-dimensional torus is ts diameter plus one. e also describe a cluster fault tolerant (CFT) routing model and give an efficient algorithm or node-to-node CFT routing.
A systematic approach for formal synthesis of a set of systolic array processors for the matrix formulated algorithms is presented. This approach is a theoretical background for the visual software tool S4CAD. Starting with the source algorithm specification in the form of the system of linear equations the formal approach and S4CAD tool allow us to obtain and examine the set of permissible array processors. As an example the design of systolic array processors for the transitive closure algorithm is shown.
The S4CAD software tool for systolic design is presented. The tool uses advanced media technologies of the MS Windows that gets more comfort for the user to evaluate and chose an optimal solution observing requirements on the computing time, number of processing elements, connection topology, etc. Number of basic algorithms of linear algebra and graph theory was considered and added to the S4CAD library.
This paper presents an approach and examples of a polygonal approximation for surfaces defined by implicit functions. The algorithm has been designed to be scalable and to provide high rate of parallelization. Firstly this algorithm has been implemented on a transputer network with toroidal architecture. Later its version suitable for execution under PVM system was implemented.
This paper suggests a way for Petri Net analysis by checking the ordering relations between places and transitions. The method is based on unfolding the original net into an equivalent acyclic description. We improved on the previously known cutoff criterion for truncating the unfolding. No restrictions are imposed on the class of general PNs. The new criterion significantly reduces the size of unfolding obtained by PN. The implementability conditions are formulated in such a way that they can be checked by analysis of ordering relations between signal transitions rather than by traversal of states. This allows avoiding the state explosion problem for highly parallel specifications.
In this study, we conducted an empirical static analysis of NASA and CLAPACK Benchmarks. Different metrics were measured and their dispersion analysed. All the programs are written in C and compiled with two different compilers cc and gcc. For each compiler we used both no optimization and full optimization modes. The goal of this study is multiple and attempts to understand more programs structure and programs requirements in order to derive some hints about architectural support or compiler time optimizations algorithms.
N. Mirenkov, Q-P. Gu, S. Peng, and S. Sedukhin, editors. The First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis. IEEE Computer Society Press, 1995.
Editor of Proceedings.