Optimizing Performance with Intel Resource Director Technology (RDT)

Intel® Resource Director Technology (RDT) provides a set of resource allocation (or control) capabilities that provide control over how shared resources such as last-level cache (LLC) and memory bandwidth are used by applications. These capabilities include Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA):

RDT introduces an intermediate construct called a Class of Service (CLOS) which acts as a resource control tag into which a thread can be grouped. The CLOS has associated resource capacity bitmasks (CBMs) indicating how much of the cache can be used by a given CLOS. RDT uses CLOS to configure the L3/L2 cache size and memory throttle (delay). As a matter of RTX64 policy for extensibility, CLOS 0 is considered and configured as the highest priority CLOS, followed by CLOS 1, and so on. In the RTX64 context, CLOS is different from thread priority but can impact the behavior of a thread.

Available RDT Modes

As a primary setup for Intel® RDT, RTX64 separates L3/L2 caches space between Windows cores and RTSS cores, thereby removing cache contention from Windows or other system activities. RTX64 sets Windows cores with maximum memory throttle and RTSS cores with zero memory throttle. To further differentiate the performance among the parallel running RTSS threads, RTX64 introduces two RDT modes for CAT and MBA: Flat performance mode and Priority-based CLOS performance mode.

NOTE: Windows performance may be degraded when performance optimization with Intel® RDT is enabled. If Windows performance degradation does not satisfy your Windows requirements, you may need to disable performance optimization with Intel RDT in the RTX64 Control Panel.

Cache Allocation Technology (CAT) Modes

Mode Description
Flat performance mode All RTSS logical processors are equally configured with all RTSS L3/L2 caches.
Priority-based CLOS performance mode The L3/L2 caches of each RTSS logical processor are based on the CLOS of the running thread (see Understanding Class of Service below), thus optimizing the performance of the thread with higher priority CLOS by reducing the L3/L2 cache contention from the thread with the lower priority CLOS.

Memory Bandwidth Allocation (MBA) Modes

Mode Description
Flat performance mode All RTSS cores are configured with minimum memory delay.
Priority-based CLOS performance mode

The memory delay of each RTSS logical processor is based on the CLOS of the running thread (see Understanding Class of Service below); thus, avoiding the performance degradation of the bandwidth-intense thread with higher priority CLOS by throttling the thread that may be over-utilizing memory bandwidth relative to its priority.

Since MBA uses a programmable rate controller between the cores and the last-level shared caches and memory controller, bandwidth to these caches may also be reduced.

NOTE: We recommend you only throttle bandwidth-intense threads that do not use the off-core caches effectively.

You can configure the CAT/MBA modes through the RTX64 Control Panel. For more information, see Configuring Intel® Resource Director Technology (RDT) Settings.

Understanding Class of Service (CLOS)

The Class of Service (CLOS) of an RTSS thread is based on the thread’s priority when the Subsystem is configured to use Priority-based CLOS performance mode. The CLOS and priority are inversely mapped with the RTSS priority range (0~127) by the number of CLOS available to RTSS processors/cores. Use real-time function RtGetRDTCapability to determine the number of CLOS. With Flat performance mode, all thread CLOS defaults to 0.

Windows and RTSS CLOS do not overlap L3 Cache Capability Bitmasks (CBM), which means they do not compete for space in the last-level cache (LLC). RTSS CLOS uses overlapping bitmasks, meaning they are not completely separated. This is done since it is often possible to achieve higher throughputs when threads are running concurrently, and relative priorities can be preserved.

In the example below, the mapping is for the number of CLOS equal to 7. The table also shows the configuration of L3 Cache Capability Bitmasks (CBM) with MBA delay for each CLOS.

CLOS L3 CBM MBA DELAY RTSS Priority Range
0 001,1111,1111 0 109~127
1 000,0111,1111 20 90~108
2 000,0001,1111 40 72~89
3 000,0000,1111 50 54~71
4 000,0000,0111 60 36~53
5 000,0000,0011 70 18~35
6 000,0000,0001 80 0~17
7 110,0000,0000 90 Windows cores

NOTE: The number of CLOS, the number of CBM bits, and the range of MBA delay is dependent on processor family. The above example is based on a Skylake i9-7900X system. If Flat performance mode is configured for both CAT and MBA in the RTX64 Control Panel, the number of CLOS available to RTSS is one.

In the CLOS example above:

Based on the CLOS example above:

The CLOS of RTSS thread can be overwritten by the real-time function RtSetThreadCLOS, available from the RTX64 SDK.

NOTE: The CLOS of an RTSS thread does not impact the thread scheduler’s selection of current thread, which is based on a thread’s priority. The CLOS of a thread only impacts its performance when scheduled as the current thread.

Related topics: