/ Harvey Abramson / Professor
/ David S.L. Wei / Associate Professor
/ Sy-Yen Kuo / Visiting Professor
This year was the third year for the Language Processing Systems Laboratory. As such, the achievement consisted in continuing existing lines of research by each of the lab members (as exemplified in the publication and achievement list) and also continuing two group projects, namely Towards a Programming Language Designer's Workbench based on Logic Programming and Action Semantics, and Granularity Optimization and Scheduling on Massively Parallel Computer Systems, which are stated in what follows.
Towards a Programming Language Designer's Workbench based on Logic Programming and Action Semantics
Language development usually proceeds in the following stages: design, specification, prototyping, and implementation (with no strict ordering). We call it a programming language life cycle. Many programming languages have been developed based on the life cycle. However, their specifications are usually written in English: such specifications can be very misleading for both implementors and users since the designers' original intention may be misunderstood. Even when a programming language is formally specified (normally based on denotational, operational, or axiomatic semantics frameworks), it is still difficult to pass on the designer's intention to others because it rarely gives hints on how to implement the language. With this in mind, we are developing a Programming Language Designer's Workbench based on a recently developed framework, action semantics. The workbench allows a language designer to design and implement a language in one setting. Thus, the language designers can actually confirm that the designed language is in fact the one he intended. This consistency check is indeed important because it eliminates potential design errors and possible misunderstanding among designers and implementors.
The project has been carried out in three differrent directions: (1) developing theories necessary to realize programming language designer's workbench; (2) designing methodology related to actually building up the general-purpose compiler generator based on action semantics framework; (3) studying semantics of many constructs of programming languages (imperative, functional, logic, object-oriented, etc.) to see how action semantics specifications affect the understanding of semantics of programming languages as well as the design of new languages.
Granularity Optimization and Scheduling on Massively Parallel Computer Systems
Despite recent advances in massively parallel processing (MPP), resource management remains a critical issue for achieving high performance on MPP architectures. Resource management involves code and data partitioning, scheduling and control of program execution. How to manage these activities to achieve the fastest parallel execution for a single application is an important and challenging problem that so far has not been satisfactorily resolved. CASS, which stands for Clustering And Scheduling System, is a compiler optimization tool that provides facilities for automatic partitioning and scheduling of parallel programs on distributed memory architectures. Central to the design of CASS is grain-size optimization. The high communication overhead in current MPP architectures imposes a minimum grain-size (i.e., computation-to-communication ratio) below which performance degrades significantly. Using task clustering techniques, CASS restructures a parallel program to one whose grain-size matches that of the target architecture, while minimizing overall parallel execution time.
The completed work of this research in the past fiscal year includes:
Refereed Journal Papers
Abstract: This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures. Because of the high communication overhead in existing parallel machines, a crucial step in scheduling is {\em task clustering}, the process of coalescing fine grain tasks into single coarser ones so that the overall execution time is minimized. The task clusting problem is {\em NP}-hard, even when the number of processors is unbounded and task duplication is allowed. A simple greedy algorithm is presented for this problem which, for a task graph with arbitrary granularity, produces a schedule whose makespan is at most twice optimal. Indeed, the quality of the schedule improves as the granularity of the task graph becomes larger. For example, if the granularity is at least 1/2, the makespan of the schedule is at most 5/3 times optimal. For a task graph with $n$ tasks and $e$ inter-task communication constraints, the algorithm runs in $O(n(n \lg n + e))$ time, which is $n$ times faster than the current best known algorithm for this problem. Similar algorithms are develpoed that produce: (1) optimal schedlues for coarse grain graphs; (2) 2-optimal schedules for trees with {\em no} task duplication; and (3) optimal schedules for coarse grain trees with {\em no} task duplication.
Abstract: The scheduling problem for dynamic tree-structured task graphs is studied and is shown to be inherently more difficult than the static case. It is shown that any online scheduling algorithm, deterministic or randomized, has competitive ratio $\Omega((1/g)/log_{d}(1/g))$ for trees with granularity $g$ and degree at most $d$. On the other hand, it is known that static trees with arbitrary granularity can be scheduled to within twice the optimal schedule. It is also shown that this lower bound is tight: there is a $deterministic$ online tree scheduling algorithm that has competitive ratio $\Omega((1/g)/ log_{d}(1/g))$. Thus, randomization does $not$ help in online scheduling of dynamic trees.
Abstract: We show that any deterministic oblivious routing scheme for the permutation routing on a $d$-ary de Bruijn network of $N=d^{n}$ nodes, in the worst case, will take $\Omega(\sqrt{N})$ steps under the {\it single-port} model. This lower bound improves the results shown in [BH82] and [KKT90] provided $d$ is not a constant. A matching upper bound is also given. We also present an efficient general sorting algorithm which runs in $O((\log d) \cdot d \cdot n^{2})$ time for directed de Bruijn network with $d^{n}$ nodes, degree $d$ and diameter $n$.
Abstract: We consider the problem of distributed ranking on a {\it Finite Projective Plane} which has been widely accepted as a communication structure for distributed processing. Distributed ranking is to compute the rank of the sole key in each of $n$ sites of the communication structure with no central control. It is harder than a consensus protocol, a vital and well studied problem in distributed processing, in that the later one computes for only one function (e.g. maximum) while the former one actually performs $n$ functions. The currently best known decentralized consensus protocols on the finite projective plane of $n$ sites uses $O(n\sqrt{n})$ messages, and requires two rounds of message exchange. In this paper we present a two-round ranking scheme which also requires only $O(n\sqrt{n})$ messages on the finite projective plane of $n$ sites. Our algorithm is optimal in that the lower bound of messages needed for achieving a consensus is $\Omega(n\sqrt{n})$.
Abstract: This paper presents efficient implementations of backtrack and branch-and-bound searches on {\it reconfigurable meshes}. For a search tree with $N$ nodes and height $h$, a backtracking algorithm is described that runs in time $O(\frac{N}{P}\sqrt{P}+h)$ on a reconfigurable mesh with $\sqrt{P}\times \sqrt{P}$ processors. In addition, a branch-and-bound algorithm is described that runs in the same time bound on the same reconfigurable mesh.
Abstract: We consider the problems of distributed ranking and sorting on a Coterie, a communication structure which has proven to be a good candidate as underline interconnection network for distributed processing. Ranking and sorting problems are harder than a consensus one, a vital and well studied problem in distributed processing, in that the later one computes for only one function (e.g. summation), while the former one actually performs $n$ functions, as ranking is to rank the key in each of $n$ sites. The currently best known decentralized consensus protocols on a coterie uses $O(n\sqrt{n})$ messages, and requires two rounds of message exchange. In this paper we show that both ranking and sorting can be done on a coterie with the same message complexity although the problems we investigate are much harder. We first present a two-round ranking algorithm which requires only $O(n\sqrt{n})$ messages. Then using this ranking algorithm, we obtain a sorting algorithm which also uses only $O(n\sqrt{n})$ messages, but requires two more rounds of message exchange. Our schemes are optimal in the sense that the lower bound of messages needed for achieving a consensus is $\Omega(n\sqrt{n})$.
Abstract: In this paper, a parallel architecture is proposed to support the operations described in the ITU-T Recommendation I.432 (B-ISDN user-network interface-physical layer specification). It is rather difficult to perform these operations on a bit serial architecture at a high rate. This paper demonstrates how these tasks can be achieved by means of parallelism. First, we describe the user-network interfaces in general and their physical layer properties. Then a parallel architecture is proposed with a general translation method which converts the serial operation into the parallel one. The application of the parallel architecture on each function is also depicted and the system has been realized in hardware using CMOS technology.
Abstract: In this paper we show that using sampling techniques, in a finite projective plane, the selection can be done in such a way that the message complexity is independent of the cardinality of the set (file). For example, given a file $F$ of size $n$ and an integer $k(1\leq k\leq n)$, on a $p$-site finite projective plane, our selection algorithm can find the $k$th smallest key from $F$ using $O(p^{3/2}\log p)$ messages and communication delay $O(\tau\log p)$. The property that the number of messages needed and communication delay are independent of the size of the file makes our distributed selection schemes extremely attractive in the applications of very large database systems. Making use of our selection algorithms, we also develop an optimal enumeration sorting scheme which can sort a distributed file of size $n$ on a $p$-site finite projective plane in $O(n)$ messages and with communication delay $O(\tau \frac{n}{p})$ provided $n$ is polynomial in $p$ and $n=\Omega(p^{5/2}\log p)$, which is optimal in the sense of both message complexity and communication delay. Our algorithms are fully distributed without any {\it a priori} central control.
Abstract: Cayley graphs have attracted a number of researchers in designing interconnection networks. A Cayley graph is based on permutation group which serves as the vertex set while the edges are determined by the generators of the given permutation group. Cayley graphs include a large number of families of graphs like star graphs, hypercubes, pancake graphs and circulant graphs. We shall show in this paper that the proposed network called degree four Cayley graph which appeared in the January 1996 issue of IEEE Transactions on Parallel and Distributed Systems is structurally isomorphic to a wrapped butterfly.