Gary Grewal
Gary Grewal
Associate Professor
School of Computer Science
University of Guelph, Guelph, ON   N1G 2W1
Office:MacLachlan 218
Phone:1-519-824-4120 ext. 52630

Interdisciplinary Research Interests

Parallel FPGA Placement

To benefit from the current era of multicore and many-core hardware, the FPGA community has begun to parallelize their existing CAD software. Although this has helped to improve performance, the resulting algorithms have often failed to show good scaling results in terms of runtime and Quality-of-Result (QoR). This is because parallel software development is hard, and scalability requires scalable programming practices to be employed right from the start. Working with Dr. Areibi in the School of Engineering, and graduate students Christian Fobel and Ryan Pattison, we focus on placement, one of the most important, but time-consuming problems in the FPGA CAD flow. Two scalable, parallel placement algorithms have been developed. The first is based on simulated annealing which is dominant in FPGA flows. The second is based on a near-linear analytic model, and seeks to build on the success of ASIC placers. Key to both implementations is the acknowledgement that a parallel algorithm cannot be evaluated apart from the architecture it is implemented on. Therefore, we employ a parallel performance metric, called isoefficiency, as a means of determining scalability with respect to the problem size and available number of processors. Using this metric as a guide, both placement algorithms were developed using only structured parallel patterns, each with isoefficiency functions. The end result is two placement algorithms that are highly scalable. In both cases, speedups are guaranteed to increase linearly with the number of available processors, provided that the problem size also increases at a particular rate. Most importantly, both algorithms show no degradation insolution quality. Overall, when run on an NVIDIA GTX 980-GPU both algorithms scale to achieve speedups of almost 80x compared with serial VPR, while improving on QoR. With even larger benchmarks, and more cores available on future GPUs, we believe both algorithms can achieve even larger speedups with little, if any, degradation in QoR. Moreover,we believe that this example of using scalable, structured parallel patterns may serve as an impetus for solving other CAD problems that can benefit from scalable parallelism.

VLSI Routing

I also direct research in the area of routing, one of the most tedious, yet important steps in the FPGA/VLSI design flow. Working with graduate student M. Xu, we developed a new algorithm for generating Steiner trees for connecting nets (pins that must be connected to the same electrical signal), the primary sub-problem in FPGA/VLSI routing. The results for this algorithm are excellent, and the algorithm has the lowest worst-case runtime complexity for this Steiner-tree problem. We also developed a parallel version of the algorithm to run on distributed-memory clusters. This work resulted in a best paper nomination at the International Conference on Computer Design, 2008. In practice, many well-known routing algorithms consider the inclusion of Hanan points when constructing a Steiner tree for a net. However, the number of Hanan points that may need to be considered can be very large. To compensate, we have proposed several clustering strategies for first reducing the number of Hanan points that need be considered. The results obtained show that by judiciously clustering Hanan points a priori, an average reduction in runtime of 30%, with little loss in quality, can be achieved when constructing Steiner trees. Recently, I was involved in joint work to develop a routability prediction model for predicting the routability of a net list prior to placement. The main contribution of this work is that it uses a mixture of single, double, quad-length as well as long length lines (in Xilinx terminology) to complete a connection, with the objective of minimizing switch count. This work allows routability prediction very early in the design process, so the designer can get a good idea of whether the design is likely to route successfully and avoid full place-and-route, when a design is clearly not routable on the target FPGA.

Gender Wage Gap

A real gender wage gap still persists in the Province of Ontario, Canada. The current wage gap is 26%, meaning that on average for every $1.00 a male earns a female earns $0.74 for equivalent work. This economic disparity has caused economists and politicians to ask the question. What can political policy do to reduce or eliminate the gender wage gap? However, for legislation to be effective, it must be based on a proper understanding of where the gender wage gap is coming from. Working with Drs. Antonie and Plesca, and graduate student Andrew D'Angelo, we employ an interdisciplinary approach to analyzing the gender wage gap in Ontario's public sector. In particular, we create an enhanced Sunshine List database containing substantially more information than what is provided by Ontarios Ministry of Finance. Moreover, we clean the data and format it to allow for more efficient manipulation, extraction, and analysis. We also employ a 2-stage approach to assigning gender to an individual based solely on their first name. The experimental results show that this approach is able to achieve almost a 90% accuracy rate when tested with manually labeled data extracted from the Sunshine List. Currently, the enhanced database is being used in Pay Equity studies and will continue to be used in future studies related to this topic.

Measuring Completness of Online Privacy Policies

Website privacy policies are often long, difficult to understand, and contain incomplete information. Consequently, users tend not to read privacy policies, thus putting their privacy at risk. Working with Dr. Dara, and graduate student Niha Guntamukkala, an automated approach for assisting users to evaluate online privacy policies based on completeness was developed. The term completeness refers to the presence of 8 sections in an online privacy policy that have been recognized as helpful in establishing the transparency of a privacy policy. Given a new online privacy policy, the proposed system employs a machine learning based approach to predict a completeness score for the privacy policy. This score can then be used by the user to assess the risk to their privacy.

Pilot - A New, Simpler Approach to Cluster Programming

Working with Dr. Gardner and his Ph.D. student, John Carter, a novel approach to programming for high-performance clusters was developed, called Pilot, which seeks to bring the power of formal methods, in a transparent way, to the novice parallel programmer. Unlike most ad-hoc designs, whose "principles" may not go past "usability", Pilot has a theoretical basis in formal process algebra, Communicating Sequential Processes (CSP). This makes the design inherently sound, prevents the user from falling into typical problems (most deadlocks, races, protocol errors, mistyped communication), gives methods for detecting the problems they do fall into (circular waiting), and provides a basis for formal analysis. The first prototype implementation was completed in 2008. After much testing and feedback from SHARCNET personnel, it was decided to make Pilot available to the High-Performance Computing (HPC) community through SHARCNET. A second SHARCNET grant was obtained, and an undergraduate student was hired to upgrade the original Pilot prototype, turning it into an industrial-strength implementation complete with documentation and support website. Several Pilot workshops in have been performed. Currently, a graduate student is working on extending the Pilot paradigm to the very irregular/heterogeneous Cell architecture.

Retargetable Code Generation for Embedded Processors

We have developed a whole new methodology for compiling code for embedded processors, which can produce very high quality code for a variety of processors. A new methodology was required because of the severe space limitations, real-time requirements, and irregular, parallel instruction sets of programmable embedded processors. In particular, the whole code generator can be applied to processors that are still being designed, and can serve as an estimation tool for code performance of non-existent processors during their design. Our most important contributions have been the evolutionary optimization algorithms at the heart of this methodology. We have developed a new type of genetic algorithm, called an Enhanced Genetic Algorithm (EGA), which has been incorporated into almost every major optimization module in our code generator. Briefly, the EGA is designed to quickly solve pure constraint-satisfaction problems. Unlike traditional GAs, the EGA promotes the mutation operator into the primary tool for directly attacking the cause of unsatisfied constraints. While this causes rapid convergence, a new type of elitism maintains diversity in the population. This allows a variety of different solutions to be generated very quickly, and allows the EGA to consider larger problems than other approaches. By employing fast genetic algorithms to produce pools of feasible solutions, then using machine-specific criteria to select the best solutions, high-quality code can be generated for processors with extremely non-orthogonal instruction sets. We used the EGA to develop new code-generation strategies for a variety of tasks including operation scheduling and assigning data objects to multiple memory banks. The former work on scheduling was later extended with the help of a graduate student to include VLIW architectures, while the latter work on memory assignment was improved with the help of Andrew Morton from the University of Waterloo. Many of the strategies that we have developed involve using the EGA in a hierarchical fashion where one algorithm controls another, or by hybridizing the algorithm by combining it with other well-known meta-heuristics. Our most important contribution to code generation, however, is a new mapping strategy for mapping reference code to the target machine. Rather than deal with the target machine at every stage of the compilation, machine-independent algorithms are used to optimize code, called reference code, for an idealized abstraction of the true target machine. This code is then mapped to the real instruction set by two EGAs. One perturbs the original schedule to find a number of alternative (parallel) instruction sequences, and the other evolves feasible register assignments, if possible. This novel approach has resulted in a code generator capable of generating very high quality code for a variety of commercial processors and Application-Specific Instruction-Set Processors (ASIPs).

High-Level Synthesis: Scheduling, Allocation, and Binding

The algorithms and strategies we have developed for code generation are flexible, and can be applied to optimization problems in other areas. For example, we have also applied many of the same techniques to sub-problems in the area of high-level synthesis. High-level synthesis is the process of mapping a description of a digital system at the behavioural level to a structural implementation at the register-transfer level. Perhaps our most important contribution is a comprehensive model, based on the hierarchical application of two genetic algorithms, that optimizes scheduling of operations, allocating of functional units, and assigning operations to units. This work has recently been extended with the help of another student to include assignment of variables to registers. As well, performance improvements have been made by replacing both genetic algorithms with different types and combinations of meta-heuristics, like simulated annealing and tabu search.

We have also developed models, based on Integer-Linear Programming (ILP), for solving many of the interdependent problems related to synthesis. These models address issues such as scheduling, module allocation and binding, register binding, chaining, pipelining and interconnect minimization.