RDTPerformance Sample

Description

This sample provides an example of how to use RTX64-supported Intel^® Resource Director Technology (RDT) to optimize the performance of RTSS threads with high performance requirements.

This sample program creates one timer on each RTSS processor within the process affinity mask. Each timer handler is a memory-intensive routine that reads, modifies, and writes back each double floating-point element within a separate block of memory, thereby creating Last Level Cache (LLC) and memory bus contention among parallel running timer threads.

By setting higher priority Class of Services (CLOS) for a particular timer thread, one can promote its performance by reducing the LLC and memory bus contention from timer threads with lower priority CLOS.

The program uses architectural performance monitoring MSR (model-specific register) to monitor the LLC missing count. Optionally, and if available, the program also uses RDT monitoring features — Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM) — to monitor L3 Cache occupancy, L3 Total external bandwidth, and L3 Local external bandwidth (see Monitor resource below).

NOTE: This sample can only be built in the RTSSRelease and RTSSDebug configurations.

NOTE: This sample may hang when debugged on machines with a high number of cores.

Source Files

RDTPerformance.c

Usage

RDTPerformance.rtss <Number of double elements> <Timer period (us)> <Session (seconds)> <Monitoring resource> <CLOS array>

Number of double elements: The number of elements to be read, modified, and written back. Default: 100000.

Timer period (us): The timer period in micro-seconds. Default: 1000.

Session (seconds): The number of seconds for the sample session. Default: 30.

Monitor resource: The resource type to monitor: 0 = No monitoring; 1 = L3 Cache occupancy; 2 = L3 Total external bandwidth; 3 = L3 Local external bandwidth. Default: 2 (L3 Total external bandwidth).

CLOS array: The CLOS value by which to overwrite each timer thread’s default CLOS.

NOTE: The program parameter list can only be omitted from end to header.

Examples

The following examples were run on a Skylake i9-7900X system where L1 and L2 (1 MB) was unique on each core. L3 (13 MB) was shared among 10 cores. The system was configured with 5 cores for Windows and 5 cores for RTSS. The RDT CAT and MBA modes were configured through the RTX64 Control Panel as follows:

Cache Allocation Technology (CAT) – Priority-based CLOS performance mode
Memory Bandwidth Allocation (MBA) – Flat performance mode

In the following examples, the acronym RMW stands for Read, Modify, Write.

Example 1: Performance among Timer Threads in the Same CLOS

Each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set the same CLOS, thereby using the same space of L3 cache with contention possible.

Program Parameters

2000000 double elements
1000 micro-seconds timer period
30 seconds per session
Monitor L3 Total external bandwidth
CLOS set to 0, 0, 0, 0, 0 respectively for each timer thread

In this scenario, we used the following command line syntax:

Rtssrun RDTPerformance.rtss 2000000 1000 2 30 0 0 0 0 0

Output

RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 RTSS processors, starting at 5
Number of CLOS: 7
L3 CAT capability: Yes
L2 CAT capability: No
MBA capability: Yes

-----------------------------------------------------------

		
RTSS processor #5, Timer priority: 127, CLOS: 0

		
|     RMW Time Slice 	|	 Count	|
|     279 ->    319	|	     1	|
|     319 ->    359	|	  2418	|
|     359 ->    399	|	 27572	|
|     399 ->    439	|	     2	|
|     439 ->    479	|	     1	|
|     479 ->    519	|	     0	|
|     519 ->    559	|	     2	|
|     559 ->    599	|	     2	|
|     599 ->    639	|	     0	|
|     639 ->    679	|	     1	|

		
Total iteration: 29999
Minimum RMW Time: 279 us (occurred at iteration: 3)
Maximum RMW Time: 671 us (occurred at iteration: 23674)
Average RMW Time: 362 us
Last level cache miss: 0x00000000138c04a5 (Cache Line)
Total memory bandwidth: 0x0000000500f96000 (Bytes)

-----------------------------------------------------------

		
RTSS processor #6, Timer priority: 102, CLOS: 0

		
|     RMW Time Slice 	|	 Count	|
|     349 ->    375	|	 29971	|
|     375 ->    401	|	    22	|
|     401 ->    427	|	     1	|
|     427 ->    453	|	     0	|
|     453 ->    479	|	     0	|
|     479 ->    505	|	     0	|
|     505 ->    531	|	     1	|
|     531 ->    557	|	     1	|
|     557 ->    583	|	     1	|
|     583 ->    609	|	     1	|

		
Total iteration: 29998
Minimum RMW Time: 349 us (occurred at iteration: 2)
Maximum RMW Time: 600 us (occurred at iteration: 23673)
Average RMW Time: 360 us
Last level cache miss: 0x0000000013feb4ca (Cache Line)
Total memory bandwidth: 0x00000003d5360000 (Bytes)

...

Example 2: Performance among Timer Threads using Different CLOS

In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set with a different CLOS, resulting in different spaces of L3 cache.

Program Parameters

2000000 double elements
1000 micro-seconds timer period
30 seconds per session
Monitor L3 Total external bandwidth
CLOS set to 0, 1, 2, 3, 4 respectively for each timer thread

In this scenario, we used the following command line syntax:

Rtssrun RDTPerformance.rtss 2000000 1000 30 2 0 1 2 3 4

Output

RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 RTSS processors, starting at 5
Number of CLOS: 7
L3 CAT capability: Yes
L2 CAT capability: No
MBA capability: Yes

-----------------------------------------------------------

		
RTSS processor #5, Timer priority: 127, CLOS: 0

		
|     RMW Time Slice 	|	 Count	|
|     283 ->    309	|	 29936	|
|     309 ->    335	|	    34	|
|     335 ->    361	|	     0	|
|     361 ->    387	|	     0	|
|     387 ->    413	|	     4	|
|     413 ->    439	|	     0	|
|     439 ->    465	|	     0	|
|     465 ->    491	|	    10	|
|     491 ->    517	|	     6	|
|     517 ->    543	|	     9	|

		
Total iteration: 29999
Minimum RMW Time: 283 us (occurred at iteration: 3)
Maximum RMW Time: 539 us (occurred at iteration: 20577)
Average RMW Time: 294 us
Last level cache miss: 0x00000000045dcc53 (Cache Line)
Total memory bandwidth: 0x00000001ff54a000 (Bytes)

-----------------------------------------------------------

		
RTSS processor #6, Timer priority: 102, CLOS: 1

		
|     RMW Time Slice 	|	 Count	|
|     314 ->    340	|	 29962	|
|     340 ->    366	|	    13	|
|     366 ->    392	|	     0	|
|     392 ->    418	|	     2	|
|     418 ->    444	|	     6	|
|     444 ->    470	|	     0	|
|     470 ->    496	|	     0	|
|     496 ->    522	|	     5	|
|     522 ->    548	|	     4	|
|     548 ->    574	|	     6	|

		
Total iteration: 29998
Minimum RMW Time: 314 us (occurred at iteration: 17710)
Maximum RMW Time: 569 us (occurred at iteration: 129)
Average RMW Time: 324 us
Last level cache miss: 0x000000000ad809e0 (Cache Line)
Total memory bandwidth: 0x0000000285e88000 (Bytes)

...

Based on the above test results, the average RMW time of timer thread on RTSS processor 5 is reduced from 362 us to 294 us, resulting in an approximate 18.8% performance gain by reducing L3 cache contention form other timer threads.

NOTE: The LLC miss of timer thread on RTSS processor 5 is reduced from 0x138c04a5 to 0x45dcc53 (i.e., a decrease of about 78%). L3 Total memory bandwidth is reduced from 0x500f96000 to 0x1ff54a000 (a decrease of about 60%).

Example 3: Performance with Intel^® RDT disabled

This example requires Intel^® RDT performance optimization to be disabled in the RTX64 Control Panel.

In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point.

Program Parameters

2000000 double elements
1000 micro-seconds timer period
30 seconds session
Monitor L3 Total external bandwidth

In this scenario, we used the following command line syntax:

Rtssrun RDTPerformance.rtss 2000000 1000 30 2

Output

RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 RTSS processors, starting at 5
Number of CLOS: 0
L3 CAT capability: No
L2 CAT capability: No
MBA capability: No
There is no performance difference through CLOS!

-----------------------------------------------------------

		
RTSS processor #5, Timer priority: 127, CLOS: -1

		
|     RMW Time Slice 	|	 Count	|
|     307 ->    359	|	 29855	|
|     359 ->    411	|	    71	|
|     411 ->    463	|	    21	|
|     463 ->    515	|	    10	|
|     515 ->    567	|	    12	|
|     567 ->    619	|	    12	|
|     619 ->    671	|	     4	|
|     671 ->    723	|	     8	|
|     723 ->    775	|	     5	|
|     775 ->    827	|	     1	|

		
Total iteration: 29999
Minimum RMW Time: 307 us (occurred at iteration: 20827)
Maximum RMW Time: 824 us (occurred at iteration: 16713)
Average RMW Time: 318 us
Last level cache miss: 0x00000000084ffd32 (Cache Line)
Total memory bandwidth: 0x00000002855e6000 (Bytes)

-----------------------------------------------------------

		
RTSS processor #6, Timer priority: 102, CLOS: -1

		
|     RMW Time Slice 	|	 Count	|
|     305 ->    352	|	 29813	|
|     352 ->    399	|	   103	|
|     399 ->    446	|	    20	|
|     446 ->    493	|	    14	|
|     493 ->    540	|	    10	|
|     540 ->    587	|	    16	|
|     587 ->    634	|	     6	|
|     634 ->    681	|	     7	|
|     681 ->    728	|	     3	|
|     728 ->    775	|	     6	|

		
Total iteration: 29998
Minimum RMW Time: 305 us (occurred at iteration: 11629)
Maximum RMW Time: 774 us (occurred at iteration: 7805)
Average RMW Time: 316 us
Last level cache miss: 0x00000000082c0d52 (Cache Line)
Total memory bandwidth: 0x00000002f06e8000 (Bytes)

...

Summary

	Example 1: Each timer thread is set with the same CLOS	Example 2: Each timer thread is set with a different CLOS	Example 3: Intel^®RDT is disabled
Total iteration	29999	29999	29999
Minimum RMW Time	279 us	283 us	307 us
Maximum RMW Time	671 us	539 us	824 us
Average RMW Time	362 us	294 us	318 us
Last level cache miss	0x138c04a5 (Cache Line)	0x45dcc53 (Cache Line)	0x84ffd32 (Cache Line)
Total memory bandwidth	0x500f96000 (Bytes)	0x1ff54a000 (Bytes)	0x2855e6000 (Bytes)

For Examples 1 and 3, there is no differentiation in performance among timer threads. The difference is that the L3 cache space in Example 1 is separated between Windows cores and RTSS cores, thereby removing L3 cache contention between Windows cores and RTSS cores. For maximum RMW time, Example 1 is smaller than Example 3. For average RMW time, Example 1 is larger than Example 3 because of smaller L3 cache space for RTSS cores, which is evident in the larger LLC miss and total memory bandwidth from RTSS cores.

NOTE: The performance gained by enabling Intel^® RDT is much more significant in real-time response latency. This gain can be measured using the SRTM sample.

APIs Referenced

RTAPI

RDTPerformance Sample

Example 1: Performance among Timer Threads in the Same CLOS

Program Parameters

Output

Example 2: Performance among Timer Threads using Different CLOS

Program Parameters

Output

Example 3: Performance with Intel® RDT disabled

Program Parameters

Output

Summary

Example 3: Performance with Intel^® RDT disabled