RDTPerformance Sample

Description

This sample provides an example of how to use eRTOS-supported Intel^® Resource Director Technology (RDT) to optimize the performance of particular threads with high-performance requirements.

This sample program creates one timer on each processor within the process affinity mask. Each timer handler is a memory-intensive routine that reads, modifies, and writes back each double floating-point element within a separate block of memory, thereby creating Last Level Cache (LLC) and memory bus contention among parallel running timer threads.

By setting a higher priority Class of Services (CLOS) for a particular timer thread, one can promote its performance by reducing the LLC and memory bus contention from timer threads with lower priority CLOS.

The program uses architectural performance monitoring MSR (model-specific register) to monitor the LLC missing count. Optionally, and if available, the program also uses RDT monitoring features — Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM) — to monitor L3 Cache occupancy, L3 Total external bandwidth, and L3 Local external bandwidth (see Monitor resource below).

Note: This sample can only be built in the eRTOSRelease and eRTOSDebug configurations.

Source Files

File	Description
RDTPerformance.cpp

Usage

RDTPerformance.ertos <Number of double elements> <Timer period (us)> <Session (seconds)> <Monitoring resource> <CLOS array>

Number of double elements: The number of elements to be read, modified, and written back. Default: 100000.

Timer period (us): The timer period in micro-seconds. Default: 1000.

Session (seconds): The number of seconds for the sample session. Default: 30.

Monitor resource: The resource type to monitor: 0 = No monitoring; 1 = L3 Cache occupancy; 2 = L3 Total external bandwidth; 3 = L3 Local external bandwidth. Default: 2 (L3 Total external bandwidth).

CLOS array: The CLOS value by which to overwrite each timer thread’s default CLOS.

Note: The program parameter list can only be omitted from end to header.

Running the Sample

To run the sample:

Boot the system in its Windows Boot Configuration.
Navigate to <InstallDrive>\MaxRT\eRTOS\.
Right-click AutoStart.bat and select Edit.
Find or write the Run command(s) for RDTPerformance.

Note: For more information on Run commands, see Run.

If using pre-written Run commands, remove the comment characters (: :) to enable the Run command(s).
Re-boot the system from a GRUB bootable USB drive or hard drive.
Select the desired GRUB boot configuration. See GRUB Boot Configurations for more information.
Upon system boot, the sample(s) will run automatically after the eRTOS Kernel startup.
Sample output will be displayed on the screen when the program ends.
Re-boot the system in its Windows Boot Configuration.
Navigate to <InstallDrive>\MaxRT\eRTOS\.
Open the RtLogFile.txt log file to view sample output.

Examples

The following examples were run on a Skylake i9-7900X system where L1 and L2 (1 MB) was unique on each core. L3 (13 MB) was shared among 10 cores. The system was configured with 5 cores for Windows and 5 cores for Process. The RDT CAT and MBA modes were configured as follows:

Cache Allocation Technology (CAT) – Priority-based CLOS performance mode
Memory Bandwidth Allocation (MBA) – Flat performance mode

In the following examples, the acronym RMW stands for Read, Modify, Write.

Example 1: Performance among Timer Threads in the Same CLOS

Each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set to the same CLOS, thereby using the same space of L3 cache with contention possible.

Program Parameters

2000000 double elements
1000 micro-seconds timer period
30 seconds per session
Monitor L3 Total external bandwidth
CLOS set to 0, 0, 0, 0, 0 respectively for each timer thread

In this scenario, we used the following command line syntax:

Run RDTPerformance.ertos 2000000 1000 2 30 0 0 0 0 0

Output

RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 Process processors, starting at 5
Number of CLOS: 7
L3 CAT capability: Yes
L2 CAT capability: No
MBA capability: Yes

-----------------------------------------------------------

		
Process processor #5, Timer priority: 127, CLOS: 0

		
|     RMW Time Slice 	|	 Count	|
|     279 ->    319	|	     1	|
|     319 ->    359	|	  2418	|
|     359 ->    399	|	 27572	|
|     399 ->    439	|	     2	|
|     439 ->    479	|	     1	|
|     479 ->    519	|	     0	|
|     519 ->    559	|	     2	|
|     559 ->    599	|	     2	|
|     599 ->    639	|	     0	|
|     639 ->    679	|	     1	|

		
Total iteration: 29999
Minimum RMW Time: 279 us (occurred at iteration: 3)
Maximum RMW Time: 671 us (occurred at iteration: 23674)
Average RMW Time: 362 us
Last level cache miss: 0x00000000138c04a5 (Cache Line)
Total memory bandwidth: 0x0000000500f96000 (Bytes)

-----------------------------------------------------------

		
Process processor #6, Timer priority: 102, CLOS: 0

		
|     RMW Time Slice 	|	 Count	|
|     349 ->    375	|	 29971	|
|     375 ->    401	|	    22	|
|     401 ->    427	|	     1	|
|     427 ->    453	|	     0	|
|     453 ->    479	|	     0	|
|     479 ->    505	|	     0	|
|     505 ->    531	|	     1	|
|     531 ->    557	|	     1	|
|     557 ->    583	|	     1	|
|     583 ->    609	|	     1	|

		
Total iteration: 29998
Minimum RMW Time: 349 us (occurred at iteration: 2)
Maximum RMW Time: 600 us (occurred at iteration: 23673)
Average RMW Time: 360 us
Last level cache miss: 0x0000000013feb4ca (Cache Line)
Total memory bandwidth: 0x00000003d5360000 (Bytes)

...

Example 2: Performance among Timer Threads using Different CLOS

In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set with a different CLOS, resulting in different spaces of L3 cache.

Program Parameters

2000000 double elements
1000 micro-seconds timer period
30 seconds per session
Monitor L3 Total external bandwidth
CLOS set to 0, 1, 2, 3, 4 respectively for each timer thread

In this scenario, we used the following command line syntax:

Run RDTPerformance.ertos 2000000 1000 30 2 0 1 2 3 4

Output

RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 Process processors, starting at 5
Number of CLOS: 7
L3 CAT capability: Yes
L2 CAT capability: No
MBA capability: Yes

-----------------------------------------------------------

		
Process processor #5, Timer priority: 127, CLOS: 0

		
|     RMW Time Slice 	|	 Count	|
|     283 ->    309	|	 29936	|
|     309 ->    335	|	    34	|
|     335 ->    361	|	     0	|
|     361 ->    387	|	     0	|
|     387 ->    413	|	     4	|
|     413 ->    439	|	     0	|
|     439 ->    465	|	     0	|
|     465 ->    491	|	    10	|
|     491 ->    517	|	     6	|
|     517 ->    543	|	     9	|

		
Total iteration: 29999
Minimum RMW Time: 283 us (occurred at iteration: 3)
Maximum RMW Time: 539 us (occurred at iteration: 20577)
Average RMW Time: 294 us
Last level cache miss: 0x00000000045dcc53 (Cache Line)
Total memory bandwidth: 0x00000001ff54a000 (Bytes)

-----------------------------------------------------------

		
Process processor #6, Timer priority: 102, CLOS: 1

		
|     RMW Time Slice 	|	 Count	|
|     314 ->    340	|	 29962	|
|     340 ->    366	|	    13	|
|     366 ->    392	|	     0	|
|     392 ->    418	|	     2	|
|     418 ->    444	|	     6	|
|     444 ->    470	|	     0	|
|     470 ->    496	|	     0	|
|     496 ->    522	|	     5	|
|     522 ->    548	|	     4	|
|     548 ->    574	|	     6	|

		
Total iteration: 29998
Minimum RMW Time: 314 us (occurred at iteration: 17710)
Maximum RMW Time: 569 us (occurred at iteration: 129)
Average RMW Time: 324 us
Last level cache miss: 0x000000000ad809e0 (Cache Line)
Total memory bandwidth: 0x0000000285e88000 (Bytes)

...

Based on the above test results, the average RMW time of the timer thread on Process processor 5 is reduced from 362 us to 294 us, resulting in an approximate 18.8% performance gain by reducing L3 cache contention from other timer threads.

Note: The LLC miss of timer thread on Process processor 5 is reduced from 0x138c04a5 to 0x45dcc53 (i.e., a decrease of about 78%). L3 Total memory bandwidth is reduced from 0x500f96000 to 0x1ff54a000 (a decrease of about 60%).

Example 3: Performance with Intel^® RDT disabled

In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point.

Program Parameters

2000000 double elements
1000 micro-seconds timer period
30 seconds session
Monitor L3 Total external bandwidth

In this scenario, we used the following command line syntax:

Run RDTPerformance.ertos 2000000 1000 30 2

Output

RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
5 Process processors, starting at 5
Number of CLOS: 0
L3 CAT capability: No
L2 CAT capability: No
MBA capability: No
There is no performance difference through CLOS!

-----------------------------------------------------------

		
Process processor #5, Timer priority: 127, CLOS: -1

		
|     RMW Time Slice 	|	 Count	|
|     307 ->    359	|	 29855	|
|     359 ->    411	|	    71	|
|     411 ->    463	|	    21	|
|     463 ->    515	|	    10	|
|     515 ->    567	|	    12	|
|     567 ->    619	|	    12	|
|     619 ->    671	|	     4	|
|     671 ->    723	|	     8	|
|     723 ->    775	|	     5	|
|     775 ->    827	|	     1	|

		
Total iteration: 29999
Minimum RMW Time: 307 us (occurred at iteration: 20827)
Maximum RMW Time: 824 us (occurred at iteration: 16713)
Average RMW Time: 318 us
Last level cache miss: 0x00000000084ffd32 (Cache Line)
Total memory bandwidth: 0x00000002855e6000 (Bytes)

-----------------------------------------------------------

		
Process processor #6, Timer priority: 102, CLOS: -1

		
|     RMW Time Slice 	|	 Count	|
|     305 ->    352	|	 29813	|
|     352 ->    399	|	   103	|
|     399 ->    446	|	    20	|
|     446 ->    493	|	    14	|
|     493 ->    540	|	    10	|
|     540 ->    587	|	    16	|
|     587 ->    634	|	     6	|
|     634 ->    681	|	     7	|
|     681 ->    728	|	     3	|
|     728 ->    775	|	     6	|

		
Total iteration: 29998
Minimum RMW Time: 305 us (occurred at iteration: 11629)
Maximum RMW Time: 774 us (occurred at iteration: 7805)
Average RMW Time: 316 us
Last level cache miss: 0x00000000082c0d52 (Cache Line)
Total memory bandwidth: 0x00000002f06e8000 (Bytes)

...

Summary

	Example 1: Each timer thread is set with the same CLOS	Example 2: Each timer thread is set with a different CLOS	Example 3: Intel®RDT is disabled
Total iteration	29999	29999	29999
Minimum RMW Time	279 us	283 us	307 us
Maximum RMW Time	671 us	539 us	824 us
Average RMW Time	362 us	294 us	318 us
Last level cache miss	0x138c04a5 (Cache Line)	0x45dcc53 (Cache Line)	0x84ffd32 (Cache Line)
Total memory bandwidth	0x500f96000 (Bytes)	0x1ff54a000 (Bytes)	0x2855e6000 (Bytes)

For Examples 1 and 3, there is no differentiation in performance among timer threads. The difference is that the L3 cache space in Example 1 is separated between Windows cores and Process cores, thereby removing L3 cache contention between Windows cores and Process cores. For maximum RMW time, Example 1 is smaller than Example 3. For average RMW time, Example 1 is larger than Example 3 because of the smaller L3 cache space for RTprocess cores, evident in the larger LLC miss and total memory bandwidth from Process cores.

Note: The performance gained by enabling Intel^® RDT is much more significant in real-time response latency. This gain can be measured using the SRTM sample.

APIs Referenced

RDTPerformance Sample

Example 1: Performance among Timer Threads in the Same CLOS

Program Parameters

Output

Example 2: Performance among Timer Threads using Different CLOS

Program Parameters

Output

Example 3: Performance with Intel® RDT disabled

Program Parameters

Output

Summary

Example 3: Performance with Intel^® RDT disabled