|
Navigation
Research |
Automatic Performance Tuning of Numerical KernelsLarge-scale simulations in computational engineering and science often spend a great deal of time in a few computational methods kernels, such as dense or sparse matrix-vector products, relaxation on a structured or unstructured mesh, or the computation of forces between pairs of attracting or repelling particles. There has been a great deal of work in generating high performance libraries for these applications, including dense and sparse linear algebra, multigrid methods, and n-body techniques. One idea established in these application-level libraries is to organize the computations around a set of numerical kernels, with the assumption that these kernels will be highly optimized on each of the hardware platforms of interest. The best known example of this approach is the BLAS (the Basic Linear Algebra routines), which are used in building LAPACK, ScaLAPACK, and other libraries; the BLAS are implemented by hardware vendors and are highly tuned to the memory hierarchy of each machine. However, this approach is limited by the growing number of kernels, the large number of machines, the increasing depth of memory hierarchies and complexity of processors, and by the difficulty of performance tuning each kernel on each machine. For linear algebra alone, the latest BLAS standard proposal contains hundreds of numerical kernels, and there are many other kernels arising from multigrid and particle methods that are not covered by that proposal. The great majority of these are susceptible to large speedups when machine-specific tuning is performed. However, the hand tuning takes weeks or months of a skilled engineer's time, and this work must be repeated for each microarchitecture, i.e., each time the memory system or functional unit organization changes, even if the instruction set is unchanged. Some vendors produce their kernels in C or Fortran, so the tuning may have to be redone with new compiler releases. We propose to automate the process of architecture-dependent tuning of numerical kernels, replacing the current hand tuning process with a semi-automated search procedure. Prototypes of this approach exist for dense matrix-multiplication (Atlas and our own PHiPAC), FFTs (FFTW), and sparse matrix-vector multiplication (our own Spacity). These results show that we can frequently do as well as or even better than hand-tuned vendor code on the kernels attempted. These systems use a handwritten "search directed code generator (SDCG)" to produce many different implementations of a single kernel, which are all run on each architecture, and the fastest one selected. We will extend this approach to a much wider range of numerical kernels by combing compiler technology with algorithm-specific transformation rules to automate the production of these SDCGs. Ultimately, we expect our technology to be useful in conventional compilers, provided that appropriate abstract data types or annotations are used to sidestep very difficult or "impossible" dependency analysis needed to justify the desired code transformations. We also believe that this work will stimulate research into new high-level numerical methods and architectures, both of which are limited by the lack of highly tuned kernels for new kernels. |