Loop nests and data layout transformations for temporal and spatial data locality

A significant source of enhancing application performance and of reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. Such objective classically translates into optimizing spatial and temporal data locality especially for nested loops. On the other hand, many existing optimization methods focus on cache locality without paying special attention to TLBs, although TLB misses can take from tens to hundreds of processor cycles. Hence, a program that exhibits good cache locality can yield a very poor performance.

Data Sequence Locality

Temporal locality is classically improved either by transforming loops such that constant data are accessed in the innermost loop for some data references, or by loop blocking. But more generally, temporal locality can be improved anyway by reducing the number of iterations that separates two successive accesses to a given reused data. We consider all these objectives by using a new approach : optimizing temporal locality for reused data sequences of minimum sizes. This approach has similarities with loop blocking except that it is not as machine-dependent and often yields better performance results.

A web-based interface has been developped allowing any user to transform through our method a fortran loop nest with affine loop bounds and array references into an equivalent temporally-optimized version. This tool can be accessed here. Notice that it is a beta version still in development. Some loops example can be downloaded from these following links that can then be processed with our tool:

 

Data layout transformations

We improve spatial locality by general data layout transformations. Our approach only considers active data, i.e. array elements that are actually accessed by a loop, in order to prevent unuseful memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.

Related publications