Methods order to hide the latency caused by

to Improve GPGPU






GPGPU’s support the concurrent execution of hundreds and thousands of threads.
However, the massive multi-threading of GPGPUs often results in serious cache
contention, as the cache lines brought by one thread can easily be removed by
other threads in the small shared cache. The massive threads often compete in
the small sized first level data (L1D) cache, which leads to severe cache
thrashing problem and throttle the GPGPU performance. The first reference paper
‘Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling’
by Li-Jhan Chen , Hsiang-Yun Cheng , Po-Han Wang , Chia-Lin Yang emphasizes on
a software-hardware cooperative approach that make use of the spatial locality
among different thread blocks to better utilize the valuable cache capacity and
the second one  ‘Applying Victim Cache in
High Performance GPGPU Computing’ by Fengfeng Fan , Jianfei Wang , Li Jiang ,
Xiaoyao Liang, and Naifeng Jing  apply
victim cache design into GPGPUs to reduce L1D cache thrashing problem for
better data locality and system performance.




Purpose Graphic Processing Units (GPGPUs) are becoming increasingly popular for
various applications, such as physical simulation, image processing and cloud
computing, due to their great computing capability. Modern GPGPUs allow
thousands of threads to be executed simultaneously. With vast multi-threading
capabilities, GPGPUs can offer better performance and power efficiency than
with usage of CPUs. To execute an application on a GPGPU device, the
application is divided into kernels that are executed by a large number of
concurrent threads. These threads are grouped together into cooperative thread
arrays (CTAs) that are the basic units for resource allocation. After the CTAs
are introduced onto streaming multiprocessors (SMs), every 32 threads within a
CTA are bundled together, namely a warp, as basic unit for execution.
Oftentimes during the runtime, multiple warps can be available for scheduling.
The schedulers are responsible for picking up the warp that is the most
suitable for execution in order to hide the latency caused by data dependencies
and long memory accesses. The high thread-level parallelism (TLP) in GPGPU
applications enables faster computing capability compared with the conventional

the high theoretically achievable thread-level parallelism capabilities, GPGPU
cores suffer from serious cache contention. In each stream multiprocessor (SM)
of modern GPGPUs, thousands of threads share the same small L1 cache, resulting
in extremely small per-thread cache capacity. For example, the per-thread L1
cache capacity in NVIDIA’s Fermi is only about 32 bytes. Thus, with limited
cache size, data in the cache can easily be evicted before being reused,
leading to serious performance loss.

mapping between Thread Blocks (TBs) and SMs determines how much locality can be
utilized inside each SM, while the warp scheduling within a single SM can only
leverage the available locality distributed by the thread block scheduler. With
larger design space, a locality-aware thread block scheduler can provide
tremendous opportunities to increase cache hit rate and improve the performance.
A naive heuristic method, which assumes that scheduling each pair of
consecutive TBs to the same SM can exploit spatial locality. This solution only
works for row-major applications.

the first paper, a generic thread block scheduler to cover different types of
access pattern is explained. The key challenge in this is how to accurately
quantify the block-level spatial locality. To tackle this challenge a
software-hardware cooperative approach to capture the amount of shared cache
lines between different TBs is embraced. Through dispatching TBs with spatial
locality to the same SM, the cache hit rate of each SM can be improved.

possible solution explained in second paper is by changing the traditional
victim cache structure to fit the massive threads execution in GPGPUs. Then,
given the limited area and power budget in GPGPUs, further to leverage the
unallocated registers to meet the victim cache storage requirement. The
unallocated registers can be known at compile time, and are generally
sufficient to our needs. The experiment results demonstrate that the victim
cache can greatly reduce conflict misses and the data cache hit rate can be
increased largely.  The proposed victim
cache design can achieve impressive performance improvement with relatively
small hardware cost.




Cache Locality Aware Thread Block Scheduling


software-hardware cooperative approach to increase cache hit rate and improve
performance by exploiting the spatial locality between different TBs. The
compiler is modified to extract address calculation codes from kernel programs
and it generates a separate address calculation binary. The small, in-order CPU
in modern GPU’s thread block scheduler is put to use to run the locality-aware
block scheduling algorithm, which is basically composed of two modules. The
address range calculation module analyzes the memory access range of each TB
when it is inserted into the block queue. To maximize the SM’s inter-block
spatial locality, the thread block dispatching module selects the candidate TB
that should be scheduled next, and the selection process runs concurrently with
thread block execution on SMs to hide the timing overhead.

capture the working sets of each TB, the compiler is modified to extract the
address calculation code, so that the thread block scheduler can compute the
memory access range based on the address calculation binary. A GPGPU program is
made up of one or more kernels, and each kernel is an array of threads which
run the same program code on different data. The mapping between thread IDs and
data can be obtained through simple arithmetic’s, since threads operate on
structural data often, such as 1D or 2D arrays, in regular GPGPU programs. The
compiler can easily extract the address calculation code, i.e., the code that
is utilized to compute the index of the data array (line 4-7), from a kernel
function, and use the address calculation code and the base address of the data
array to generate the address calculation binary. At run-time, the thread block
scheduler can use the binary and the thread ID to compute the memory addresses
accessed by an arbitrary thread.

address of the start point, i.e., the upper-left address, can be computed by
the memory address of the first thread in the block, and the width/height can
be computed by the address difference between the first and the last thread in
the block. These access range parameters, including the start point, width, and
height, are stored in the block queue when the block is inserted. In order to
compute the inter-block locality, the coordination of cache lines to represent
the access range rectangles is used. The memory addresses of the data arrays
can be transformed into the corresponding cache line addresses. For example, if
a cache line is 128 bytes, memory addresses from (0,0) to (127,0) belong to the
cache line (0,0) and memory addresses (128,1) to (255,1) belong to the cache
line (1,1). Through the address transformation, the start point, width, and
height of each TB can be represented in cache line coordination.

a TB is completed on a SM m, the thread block scheduler allocates the candidate
TB to the SM. The selection process for the next candidate TB is then triggered
and run concurrently with TB execution to hide the timing overhead. We select
the candidate TB based on the overall interblock locality (Lall), i.e., the summation
of inter-block locality between the candidate TB and all the running TBs on the
SM. The inter-block locality of any two TBs (Lpair) is defined as the summation
of the overlapped data access range in all data arrays between them. For each
data array, the overlapped area is computed by the following steps:

1)     distancex
and distancey are the differences in x-axis and y-axis between the start points
of two TBs.

2)     If
distancex > width or distancey > height, there is no overlapped area
between the two TBs, indicating that there is no spatial locality between these
two TBs.

3)     Otherwise,
the overlapped area is (width ? distancex) ?
(height ? distancey), which is equal to the number of shared cache lines
between these two TBs.

on the estimation of inter-block locality, the TB with maximum Lall is
selected. If no TB in the queue has inter-block locality on this SM, the TB
with minimum Lall if being scheduled on other SMs is selected, so that the
degradation of inter-block locality on other SMs can be avoided.

TB scheduler only incurs small hardware and timing overhead. To avoid duplicate
calculation of memory access range, 4 bytes is added per block queue entry for
storing related parameters. In the TB scheduler, we add one register per SM to
store the candidate TB, so that the candidate TB can be directly dispatched
once a TB is completed on the SM. Two methods to reduce the timing overhead of
selecting candidate TBs are used. First, at the initial phase with less than 8
TBs being executed on the SM, we group consecutive TBs together and directly
dispatch them to the same SM. After the initial phase, the TB selection process
is executed concurrently with TB execution to hide the timing overhead.


Victim Cache

equivalent structure of victim cache to L1D cache is efficient to alleviate the
conflict misses from the massive threads in GPGPU applications. Meanwhile, It
requires a storage of the same capacity. However, given area and power
limitations in GPGPUs, it is not wise to introduce such a victim cache. To
further improve the victim cache storage, instead of adding a new data array
for victim cache, unallocated registers in RF for the victim cache data storage
is used. The RF is not fully used due to the resource allocation in GPGPUs. In
fact, since the TLP in GPGPUs is limited by several factors, e.g. the maximum
number of threads and thread blocks allowed in a SM, and the RF and shared
memory capacity, it is very likely that the RF is not fully used with many
registers unallocated that can be known before execution. In addition, because
the registers and cache blocks in GPGPUs are both on a warp basis with a 1024
bits width, it becomes a nature fit to use the unallocated registers to hold
the victim cache data. To find these unallocated registers, a bitmap to mark
the status of each register in the RF per SM is employed. For example, we only
need a bitmap with 1024 bits assuming a 128KB RF with each warp operand of 1024
bits. Status 1 means the corresponding register is allocated and 0 means it is
unallocated throughout the execution and therefore can be used as a victim
cache line. The bitmap can be initialized before each kernel launch because the
register requirement does not change during the kernel execution. The data flow
of our improved victim cache can be similar to the regular victim cache. One of
the differences is that when the victim cache hits, we need to access RF to get
the victim data. Instead of a 1024 bits storage, we only require an index of 10
bits to record the register position and use it to access the RF to get the
victim data. Another difference is that we have to check the bitmap for
unallocated registers. If not available, some of the victim cache blocks should
be marked as invalid because there is no storage to use. However, as the number
of unallocated registers is usually large enough, the invalidation of victim
cache blocks can be rare. In summary, our proposed victim cache design can well
alleviate the conflict misses that suffer the GPGPU L1D cache. Moreover, we
introduce limited hardware supports. For example, for a victim cache of 16KB,
we need 2.5Kbit tag storage. For a RF of 1024 registers, we need a bitmap with
1024 bits to record the register availability and 1280 bits to record the
register indexes for victim storage. At the same time, some associated logics
are needed to support the pipeline changes and manage the data flow. However,
given the performance gained from the victim cache design the investment can be





victim cache design in GPGPUs. It works with the on-chip L1D cache to keep more
data on-chip for fast accesses. To reduce the storage requirement for victim
data, we further leverage the unallocated registers found in RF to serve as
data array of the victim cache, and therefore the hardware cost can be greatly
reduced. The experiment results show that using our approach, the on-chip cache
hit rate can be greatly increased, which leads to a large performance
improvement with only small changes to the baseline GPGPU design.

locality-aware thread block scheduler to improve GPGPU performance by reducing
cache misses. We utilize a software-hardware cooperative method to capture the
spatial locality among different TBs at runtime, and increase cache hit rates
by dispatching TBs with numerous shared cache lines to the same SM. Results
show that our approach can effectively reduce L1 miss rate (average 3%, up to
13%) and significantly improve performance (average 9%, up to 34%) over the
state-of-the-art approach. This work opens up new research directions for
leveraging GPGPU software and hardware characteristics to precisely capture
cache locality and optimize cache utilization.

these methods can be used to significantly improve upon the existing
performance of GPGPU in multi-thread handling. Depending upon the specific
application the performance of these methods may vary. But Cache Locality Aware
Thread Block Scheduling is slightly better approach compared to victim cache




‘Improving GPGPU Performance via Cache
Locality Aware Thread Block Scheduling’ by Li-Jhan Chen , Hsiang-Yun Cheng ,
Po-Han Wang , Chia-Lin Yang


 ‘Applying Victim Cache in High Performance
GPGPU Computing’ by Fengfeng Fan , Jianfei Wang , Li Jiang , Xiaoyao Liang, and
Naifeng Jing