RDTPerformance Sample

Description

This sample provides an example of how to use RTX64-supported Intel® Resource Director Technology (RDT) to optimize the performance of RTSS threads with high performance requirements.

This sample program creates one timer on each RTSS processor within the process affinity mask. Each timer handler is a memory-intensive routine that reads, modifies, and writes back each double floating-point element within a separate block of memory, thereby creating Last Level Cache (LLC) and memory bus contention among parallel running timer threads.

By setting higher priority Class of Services (CLOS) for a particular timer thread, one can promote its performance by reducing the LLC and memory bus contention from timer threads with lower priority CLOS.

The program uses architectural performance monitoring MSR (model-specific register) to monitor the LLC missing count. Optionally, and if available, the program also uses RDT monitoring features — Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM) — to monitor L3 Cache occupancy, L3 Total external bandwidth, and L3 Local external bandwidth (see Monitor resource below).

NOTE: This sample can only be built in the RTSSRelease and RTSSDebug configurations.

NOTE: This sample may hang when debugged on machines with a high number of cores.

Source Files

Usage

RDTPerformance.rtss <Number of double elements> <Timer period (us)> <Session (seconds)> <Monitoring resource> <CLOS array>

Number of double elements: The number of elements to be read, modified, and written back. Default: 100000.

Timer period (us): The timer period in micro-seconds. Default: 1000.

Session (seconds): The number of seconds for the sample session. Default: 30.

Monitor resource: The resource type to monitor: 0 = No monitoring; 1 = L3 Cache occupancy; 2 = L3 Total external bandwidth; 3 = L3 Local external bandwidth. Default: 2 (L3 Total external bandwidth).

CLOS array: The CLOS value by which to overwrite each timer thread’s default CLOS.

NOTE: The program parameter list can only be omitted from end to header.

Examples

The following examples were run on a Skylake i9-7900X system where L1 and L2 (1 MB) was unique on each core. L3 (13 MB) was shared among 10 cores. The system was configured with 5 cores for Windows and 5 cores for RTSS. The RDT CAT and MBA modes were configured through the RTX64 Control Panel as follows:

In the following examples, the acronym RMW stands for Read, Modify, Write.


Example 1: Performance among Timer Threads in the Same CLOS

Each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set the same CLOS, thereby using the same space of L3 cache with contention possible.

Program Parameters

In this scenario, we used the following command line syntax:

Rtssrun RDTPerformance.rtss 2000000 1000 2 30 0 0 0 0 0

Output
RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 RTSS processors, starting at 5
Number of CLOS: 7
L3 CAT capability: Yes
L2 CAT capability: No
MBA capability: Yes
-----------------------------------------------------------
		
RTSS processor #5, Timer priority: 127, CLOS: 0
		
|     RMW Time Slice 	|	 Count	|
|     279 ->    319	|	     1	|
|     319 ->    359	|	  2418	|
|     359 ->    399	|	 27572	|
|     399 ->    439	|	     2	|
|     439 ->    479	|	     1	|
|     479 ->    519	|	     0	|
|     519 ->    559	|	     2	|
|     559 ->    599	|	     2	|
|     599 ->    639	|	     0	|
|     639 ->    679	|	     1	|
		
Total iteration: 29999
Minimum RMW Time: 279 us (occurred at iteration: 3)
Maximum RMW Time: 671 us (occurred at iteration: 23674)
Average RMW Time: 362 us
Last level cache miss: 0x00000000138c04a5 (Cache Line)
Total memory bandwidth: 0x0000000500f96000 (Bytes)
-----------------------------------------------------------
		
RTSS processor #6, Timer priority: 102, CLOS: 0
		
|     RMW Time Slice 	|	 Count	|
|     349 ->    375	|	 29971	|
|     375 ->    401	|	    22	|
|     401 ->    427	|	     1	|
|     427 ->    453	|	     0	|
|     453 ->    479	|	     0	|
|     479 ->    505	|	     0	|
|     505 ->    531	|	     1	|
|     531 ->    557	|	     1	|
|     557 ->    583	|	     1	|
|     583 ->    609	|	     1	|
		
Total iteration: 29998
Minimum RMW Time: 349 us (occurred at iteration: 2)
Maximum RMW Time: 600 us (occurred at iteration: 23673)
Average RMW Time: 360 us
Last level cache miss: 0x0000000013feb4ca (Cache Line)
Total memory bandwidth: 0x00000003d5360000 (Bytes)
...

 


Example 2: Performance among Timer Threads using Different CLOS

In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set with a different CLOS, resulting in different spaces of L3 cache.

Program Parameters

In this scenario, we used the following command line syntax:

Rtssrun RDTPerformance.rtss 2000000 1000 30 2 0 1 2 3 4

Output
RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 RTSS processors, starting at 5
Number of CLOS: 7
L3 CAT capability: Yes
L2 CAT capability: No
MBA capability: Yes
-----------------------------------------------------------
		
RTSS processor #5, Timer priority: 127, CLOS: 0
		
|     RMW Time Slice 	|	 Count	|
|     283 ->    309	|	 29936	|
|     309 ->    335	|	    34	|
|     335 ->    361	|	     0	|
|     361 ->    387	|	     0	|
|     387 ->    413	|	     4	|
|     413 ->    439	|	     0	|
|     439 ->    465	|	     0	|
|     465 ->    491	|	    10	|
|     491 ->    517	|	     6	|
|     517 ->    543	|	     9	|
		
Total iteration: 29999
Minimum RMW Time: 283 us (occurred at iteration: 3)
Maximum RMW Time: 539 us (occurred at iteration: 20577)
Average RMW Time: 294 us
Last level cache miss: 0x00000000045dcc53 (Cache Line)
Total memory bandwidth: 0x00000001ff54a000 (Bytes)
-----------------------------------------------------------
		
RTSS processor #6, Timer priority: 102, CLOS: 1
		
|     RMW Time Slice 	|	 Count	|
|     314 ->    340	|	 29962	|
|     340 ->    366	|	    13	|
|     366 ->    392	|	     0	|
|     392 ->    418	|	     2	|
|     418 ->    444	|	     6	|
|     444 ->    470	|	     0	|
|     470 ->    496	|	     0	|
|     496 ->    522	|	     5	|
|     522 ->    548	|	     4	|
|     548 ->    574	|	     6	|
		
Total iteration: 29998
Minimum RMW Time: 314 us (occurred at iteration: 17710)
Maximum RMW Time: 569 us (occurred at iteration: 129)
Average RMW Time: 324 us
Last level cache miss: 0x000000000ad809e0 (Cache Line)
Total memory bandwidth: 0x0000000285e88000 (Bytes)
...

Based on the above test results, the average RMW time of timer thread on RTSS processor 5 is reduced from 362 us to 294 us, resulting in an approximate 18.8% performance gain by reducing L3 cache contention form other timer threads.

NOTE: The LLC miss of timer thread on RTSS processor 5 is reduced from 0x138c04a5 to 0x45dcc53 (i.e., a decrease of about 78%). L3 Total memory bandwidth is reduced from 0x500f96000 to 0x1ff54a000 (a decrease of about 60%).

 


Example 3: Performance with Intel® RDT disabled

This example requires Intel® RDT performance optimization to be disabled in the RTX64 Control Panel.

In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point.

Program Parameters

In this scenario, we used the following command line syntax:

Rtssrun RDTPerformance.rtss 2000000 1000 30 2

Output
RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 RTSS processors, starting at 5
Number of CLOS: 0
L3 CAT capability: No
L2 CAT capability: No
MBA capability: No
There is no performance difference through CLOS!
-----------------------------------------------------------
		
RTSS processor #5, Timer priority: 127, CLOS: -1
		
|     RMW Time Slice 	|	 Count	|
|     307 ->    359	|	 29855	|
|     359 ->    411	|	    71	|
|     411 ->    463	|	    21	|
|     463 ->    515	|	    10	|
|     515 ->    567	|	    12	|
|     567 ->    619	|	    12	|
|     619 ->    671	|	     4	|
|     671 ->    723	|	     8	|
|     723 ->    775	|	     5	|
|     775 ->    827	|	     1	|
		
Total iteration: 29999
Minimum RMW Time: 307 us (occurred at iteration: 20827)
Maximum RMW Time: 824 us (occurred at iteration: 16713)
Average RMW Time: 318 us
Last level cache miss: 0x00000000084ffd32 (Cache Line)
Total memory bandwidth: 0x00000002855e6000 (Bytes)
-----------------------------------------------------------
		
RTSS processor #6, Timer priority: 102, CLOS: -1
		
|     RMW Time Slice 	|	 Count	|
|     305 ->    352	|	 29813	|
|     352 ->    399	|	   103	|
|     399 ->    446	|	    20	|
|     446 ->    493	|	    14	|
|     493 ->    540	|	    10	|
|     540 ->    587	|	    16	|
|     587 ->    634	|	     6	|
|     634 ->    681	|	     7	|
|     681 ->    728	|	     3	|
|     728 ->    775	|	     6	|
		
Total iteration: 29998
Minimum RMW Time: 305 us (occurred at iteration: 11629)
Maximum RMW Time: 774 us (occurred at iteration: 7805)
Average RMW Time: 316 us
Last level cache miss: 0x00000000082c0d52 (Cache Line)
Total memory bandwidth: 0x00000002f06e8000 (Bytes)
...

 


Summary

 

 

Example 1: Each timer thread is set with the same CLOS

Example 2: Each timer thread is set with a different CLOS

Example 3: Intel®RDT is disabled

Total iteration 29999 29999 29999
Minimum RMW Time 279 us 283 us 307 us
Maximum RMW Time 671 us 539 us 824 us
Average RMW Time 362 us 294 us 318 us
Last level cache miss 0x138c04a5 (Cache Line) 0x45dcc53 (Cache Line) 0x84ffd32 (Cache Line)
Total memory bandwidth 0x500f96000 (Bytes) 0x1ff54a000 (Bytes) 0x2855e6000 (Bytes)

 

For Examples 1 and 3, there is no differentiation in performance among timer threads. The difference is that the L3 cache space in Example 1 is separated between Windows cores and RTSS cores, thereby removing L3 cache contention between Windows cores and RTSS cores. For maximum RMW time, Example 1 is smaller than Example 3. For average RMW time, Example 1 is larger than Example 3 because of smaller L3 cache space for RTSS cores, which is evident in the larger LLC miss and total memory bandwidth from RTSS cores.

NOTE: The performance gained by enabling Intel® RDT is much more significant in real-time response latency. This gain can be measured using the SRTM sample.

APIs Referenced

RTAPI