Since its introduction more than 7 years ago, the CUDA unified memory programming model has been very popular among developers. Unified Memory provides an easy interface for prototyping GPU applications without manually MIG rating memory between host and device.

Starting with NVIDIA Pascal, which scales more easily to larger problem-size architectures, applications that support unified memory can use all available CPU and GPU memory in the system CPU. For more information on starting GPU computations with unified memory, see A Simpler Introduction to CUDA.

Do you want to seamlessly run applications with large datasets while keeping memory management simple? Unified memory can be used to make virtual memory allocations larger than available GPU memory. When an oversubscription occurs, the GPU automatically begins to evict memory pages to system memory to make room for active in-use virtual memory addresses.

However, application performance is highly dependent on memory access patterns, data residency, and the operating system. Over the past few years, we've published several articles on GPU memory oversubscription using unified memory. We help your application achieve higher performance through various programming techniques such as prefetching and memory usage hints.

In this post, we delve into the performance characteristics of a microbenchmark that highlights different memory access patterns in oversubscription scenarios. It helps you break down and understand all the performance aspects of unified memory: when it's good, when it's not, and what you can do about it. As you'll see from our results, performance can vary by a factor of 100 depending on platform, oversubscription factors, and memory hints. We hope this article gave you a clearer idea of ​​when and how to use unified memory in your applications!

Benchmark Settings and Access Modes

To evaluate uniform memory oversubscription performance, a simple program that allocates and reads memory can be used. Use cudaMallocManaged to allocate a large chunk of contiguous memory, then access that memory on the GPU, and measure the effective kernel memory bandwidth. Different unified memory performance hints like cudaMemPrefetchAsync and cudaMemAdvise modify allocated unified memory. We will discuss their impact on performance later in this article.

We define a parameter called "oversubscription factor" which controls the fraction of available GPU memory allocated to the test.

A value of 1.0 means that all available memory on the GPU is allocated.

A value less than 1.0 indicates that the GPU is not oversubscribed

Values ​​greater than 1.0 can be interpreted as oversubscription for a given GPU. For example, an oversubscription factor value of 1.5 for a GPU with 32 GB of memory means that 48 GB of memory is allocated using unified memory.

We tested three memory access kernels in microbenchmarks: grid stride, block edge, and random per warp. Grid strides and block strides are the most common sequential access patterns in many CUDA applications. However, unstructured or random access is also very popular in emerging CUDA workloads such as graph applications, hash tables, and embeddings in recommender systems. We decided to test all three.

grid step

Each thread block accesses elements in adjacent memory regions in loop iterations, followed by grid strides (blockDim.x * gridDim.x).

Figure 1 Grid Access Mode

template__global__ void read_thread(data_type *ptr, const size_t size) {     size_t n = size / sizeof(data_type);     data_type accum = 0;       for(size_t tid = threadIdx.x + blockIdx.x * blockDim.x; tid < n;="" tid="" +="blockDim.x" *="" griddim.x)="" accum="" +="ptr[tid];" if="" (threadidx.x="=" 0)="" ptr[0]="accum;" }="">

stand in the way

Each thread block accesses a large block of contiguous memory, which is determined based on the total memory size allocated. At any given time, resident blocks on the SM can access different memory pages because the memory domain allocated to each block is large.

Figure 2 Block stride access mode

template__global__ void read_thread_blockCont(data_type *ptr, const size_t size) {   size_t n = size / sizeof(data_type);   data_type accum = 0;     size_t elements_per_block = ((n + (gridDim.x - 1)) / gridDim.x) + 1;   size_t startIdx = elements_per_block * blockIdx.x;     for (size_t rid = threadIdx.x; rid < elements_per_block;="" rid="" +="blockDim.x)" {="" if="" ((rid="" +="" startidx)="">< n)="" accum="" +="ptr[rid" +="" startidx];="" }="" if="" (threadidx.x="=" 0)="" ptr[0]="accum;" }="">

random warping

In this access mode, for each loop iteration of the warp, a random page is selected and then a contiguous 128B (32 elements of 4B) region is accessed. This will cause each warp of a thread block to access random pages across all thread blocks. The loop count for warps is determined by the total number of warps and the total memory allocated.

Figure 3 Random warp access pattern, each loop iteration of warp selects a random page and accesses a random 128B region in the page

Kernels are launched with thread blocks and grid parameters to achieve 100% occupancy. All blocks of the kernel always reside on the GPU.

hardware setup

The benchmarks in this article used one of the following three different hardware setups for GPUs.

We investigate different memory-resident techniques to improve oversubscription performance for these access patterns. Fundamentally, we're trying to eliminate unified memory page faults and find the best data partitioning strategy to get the best read bandwidth for the benchmark. In this article, we will discuss the following memory modes:

On-Demand MIG Quota

Zero-copy

Data partitioning between CPU and GPU

In the following sections, we'll dive into performance analysis and an explanation of all optimizations. We also discussed which workloads work well with unified memory to address oversubscription.

Baseline Implementation: On-Demand MIG Quota

In this test case, memory allocation is performed using cudaMallocManaged and then pages are populated on system (CPU) memory as follows:

cudaMallocManaged(&uvm_alloc_ptr, allocation_size); // all the pages are initialized on CPU   for (int i = 0; i < num_elements;="" i++)="" uvm_alloc_ptr[i]="">

Then, execute the GPU kernel and measure the performance of the kernel:

read_thread>>((float*)uvm_alloc_ptr, allocation_size);

We used one of the three access modes described in the previous section. This is the easiest way to use unified memory for oversubscription, since programmers don't need to be prompted.

On a kernel call, the GPU attempts to access a virtual memory address resident on the host. This triggers a page fault event, causing the memory page MIG to be allocated to GPU memory via the CPU-GPU interconnect. Kernel performance is affected by the resulting page fault pattern and the speed of the CPU-GPU interconnect.

The page fault pattern is dynamic because it depends on the scheduling of blocks and warps on streaming multiprocessors. Then comes the memory load instruction issued by the GPU thread.

Figure 4 NVIDIA NSight system timeline view of grid stride `read_thread` kernel execution. The HtoD and DtoH transfers shown on the memory line are due to MIG quantification and eviction from the GPU on page faults.

Figure 5 shows how page faults are handled on empty and oversubscribed GPUs. On oversubscription, memory pages are first moved from GPU memory to system memory, and then the requested memory is moved from CPU to GPU.

Figure 5 Page fault service and data eviction mechanism.

Figure 6 shows the memory bandwidth obtained with different access modes on V100, A100 and V100 using Power9 CPU.

Figure 6 Read bandwidth for baseline memory allocation

sequential access analysis

Differences in access patterns and page fault-driven memory read bandwidth between different platforms can be explained by the following factors:

Impact of Access Patterns: Traditionally, grid stride access patterns have been known to achieve maximum memory bandwidth when accessing GPU-resident memory. Here, the block strided access pattern achieves higher memory bandwidth due to the page fault traffic generated by this pattern. It's also worth noting that the default system memory page size on Power9 CPUs is 64 KB, compared to 4 KB on x86 systems. This helps the Unified Memory Error MIG to move larger blocks of memory from the CPU to the GPU when a page fault event is triggered.

Sensitivity to GPU architecture and interconnect: The DGX A100 has a faster PCIe Gen4 interconnect between the CPU and GPU. This may be the reason why the A100 achieves higher bandwidth. However, the interconnect bandwidth is not saturated. The main factor for the higher bandwidth is that the A100 GPU and 108 streaming multiprocessors can generate more page faults because there are more active thread blocks on the GPU. The P9 test also confirmed this understanding, although the theoretical peak bandwidth of the NVLink connection between GPU-CPU is 75 GB/s, the read bandwidth is lower than that of the A100.

Tip: During experiments in this post, we found that streaming grid and block striding kernel access patterns are insensitive to thread block size and intra-block synchronization. However, for better performance using the other optimizations discussed, we used 128 threads in a block, with intra-block synchronization on each loop unrolling. This ensures that all twists of the block effectively use the SM's address translation unit. To understand the kernel design for intra-block synchronization, see the source code published with this article. Try synced and unsynced variants with different block sizes.

random access analysis

In the oversubscribed domain of the x86 platform, the random warp access pattern yields only a few hundred KB/s of read bandwidth due to many page faults and the resulting GPU-to-GPU memory MIG ratio. Since the access is random, a fraction of the MIG rated memory is used. Memory rated MIG may eventually be evicted back to the CPU to make room for other memory fragments.

However, access counters are enabled on Power9 systems, resulting in CPU-mapped memory accesses from the GPU, and not all accessed memory fragments are immediately MIG-rated to the GPU. This results in consistent memory read bandwidth and less memory thrashing compared to x86 systems.

Optimization 1: Direct access to system memory (zero copy)

In addition to moving memory pages from system memory to GPU memory through the interconnect, you can also access pinned system memory directly from the GPU. This method of memory allocation is also known as zero-copy memory.

Pinned system memory can be allocated from the unified memory interface using the CUDA API call cudaMallocHost or by setting the preferred location of the virtual address range to the CPU.

cudaMemAdvise(uvm_alloc_ptr, allocation_size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);

cudaMemAdvise(uvm_alloc_ptr, allocation_size, cudaMemAdviseSetAccessedBy, current_gpu_device);

Figure 7 NVIDIA NSight system timeline view of grid stride `read_thread` kernel directly accessing pinned system memory. There are no page fault events or memory transfers in any direction.

Figure 8 Data access path of zero-copy memory

Figure 9 shows the memory bandwidth achieved by the read core. On x86 platforms, the A100 GPU can achieve higher bandwidth than the V100 because of the faster PCIe Gen4 interconnect between the CPU and GPU on the DGX A100. Similarly, Power9 systems achieve peak bandwidths close to the interconnect bandwidth through the grid stride access pattern. Grid stride bandwidth mode on A100 GPUs decreases with oversubscription as GPU MMU address translation misses increase the latency of load instructions.

Figure 9 Memory read bandwidth of zero-copy memory

For all tested systems, random warped access yields a constant bandwidth of 3-4 GB/s in oversubscribed domains. This is much better than the failure-driven scenario presented earlier.

reward

It is clear from the data that the zero-copy approach achieves higher bandwidth than the baseline. Pinning system memory is beneficial when you want to avoid unmapping and mapping memory from the CPU and GPU. If the application only uses the allocated data once, then it is better to use zero-copy memory for direct access. However, if there is data reuse in the application, the error and MIG rating data for the GPU can result in higher aggregate bandwidth, depending on the access pattern and reuse.

Optimization 2: Direct memory access for data partitioning between CPU – GPU

For the fault-driven MIG ratio explained earlier, the GPU MMU system incurs additional pause overhead until the required memory range is reached on the GPU. To overcome this overhead, you can allocate memory between CPU and GPU, and map memory from GPU to CPU for trouble-free memory access.

There are several ways to allocate memory between the CPU and GPU:

A cudaMemPrefetchAsync API call with the SetAccessedBy uniform memory hint set for memory allocation.

Manual mixed memory allocation between CPU and GPU, with manual prefetching and hints using SetPreferredLocation and SetAccessedBy.

We find that the two approaches perform similarly across many combinations of access patterns and architectures, with only a few exceptions. In this section, we mainly discuss manual page distribution. You can find the code for both in the unified-memory-oversubscription GitHub repo.

Figure 10 Memory access paths for pages allocated to GPU and CPU memory

In a hybrid memory distribution, few memory pages can be pinned to the CPU, and the setAccessedBy hint is set to the GPU device explicitly mapped memory using the cudaMemAdvise API call. In our test case, we map the excess memory pages to the CPU in a round-robin fashion, where the mapping to the CPU depends on the oversubscription of the GPU. For example, the oversubscription factor has a value of 1. At 5, every third page is mapped to the CPU. The oversubscription factor is 2. When 0, every other page will be mapped to the CPU.

In our experiments, memory pages were set to 2MB, which is the maximum page size that the GPU MMU can operate on.

Figure 11. 2MB pages spread across CPU and GPU. The Y-axis uses a logarithmic scale.

for less than 1. With an oversubscription value of 0, all memory pages reside on the GPU. with an oversubscription ratio greater than 1. You can see higher bandwidth compared to the case of 0. for greater than 1. Factors such as an oversubscription value of 0, base HBM memory bandwidth, and CPU-GPU interconnect speed control the final memory read bandwidth.

Tip: When testing on Power9 systems, we encountered interesting behavior with explicit bulk memory prefetching (option a). Because access counters are enabled on P9 systems, the memory moved out is not always pinned to the GPU, the unified memory driver can initiate MIG allocation of data from the CPU to the GPU. This will result in an eviction from the GPU, and the loop will continue for the lifetime of the kernel. This process negatively affects flow block and grid-stride kernels, and they gain lower bandwidth than manual page distribution.

Solution: Single GPU Oversubscription

Of the three different memory allocation strategies for GPU oversubscription using unified memory, the best choice for a given application's allocation method depends on memory access patterns and reuse of GPU memory.

When you choose between glitches and fixed system memory allocation, the latter always performs better on all platforms and GPUs. If GPU residency of memory sub-regions benefits from overall application speed, then memory page allocation between GPU and CPU is a better allocation strategy.

Try Unified Memory Optimization

In this post, we review a benchmark with some common access patterns and analyze performance on various platforms from x86 to P9, and V100 and A100 GPU s. You can use this data as a reference to make predictions and consider whether it would be beneficial to use unified memory in your code. We also introduce various data distribution patterns and unified memory patterns that sometimes provide significant performance benefits. For more information, see the unified-memory-oversubscription microbenchmark source code on GitHub.

In the previous article, we demonstrated that unified memory-based oversubscription is particularly effective for big data analytics and large deep learning models. Please try oversubscribing with unified memory in your code and let us know how it helps your application performance.

About the author

Chirayu Garg is a Senior Artificial Intelligence Development Technology Engineer at NVIDIA. He works on accelerating deep learning and machine learning applications on GPUs. Previously, he developed video and image processing algorithms for NVIDIA's game streaming service.

Reviewing Editor: Guo Ting

Tagged:

Leave a Reply

Your email address will not be published. Required fields are marked *