RDTPerformance Sample

Description

This sample provides an example of how to use eRTOS-supported Intel® Resource Director Technology (RDT) to optimize the performance of particular threads with high-performance requirements.

This sample program creates one timer on each processor within the process affinity mask. Each timer handler is a memory-intensive routine that reads, modifies, and writes back each double floating-point element within a separate block of memory, thereby creating Last Level Cache (LLC) and memory bus contention among parallel running timer threads.

By setting a higher priority Class of Services (CLOS) for a particular timer thread, one can promote its performance by reducing the LLC and memory bus contention from timer threads with lower priority CLOS.

The program uses architectural performance monitoring MSR (model-specific register) to monitor the LLC missing count. Optionally, and if available, the program also uses RDT monitoring features — Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM) — to monitor L3 Cache occupancy, L3 Total external bandwidth, and L3 Local external bandwidth (see Monitor resource below).

Note: This sample can only be built in the eRTOSRelease and eRTOSDebug configurations.

Source Files

File Description
  • RDTPerformance.cpp
  •  

    Usage

    RDTPerformance.ertos <Number of double elements> <Timer period (us)> <Session (seconds)> <Monitoring resource> <CLOS array>

    Number of double elements: The number of elements to be read, modified, and written back. Default: 100000.

    Timer period (us): The timer period in micro-seconds. Default: 1000.

    Session (seconds): The number of seconds for the sample session. Default: 30.

    Monitor resource: The resource type to monitor: 0 = No monitoring; 1 = L3 Cache occupancy; 2 = L3 Total external bandwidth; 3 = L3 Local external bandwidth. Default: 2 (L3 Total external bandwidth).

    CLOS array: The CLOS value by which to overwrite each timer thread’s default CLOS.

    Note: The program parameter list can only be omitted from end to header.

    Running the Sample

    To run the sample:

    1. Boot the system in its Windows Boot Configuration.
    2. Navigate to <InstallDrive>\MaxRT\eRTOS\.
    3. Right-click AutoStart.bat and select Edit.
    4. Find or write the Run command(s) for RDTPerformance.

    Note: For more information on Run commands, see Run.

    1. If using pre-written Run commands, remove the comment characters (: :) to enable the Run command(s).
    2. Re-boot the system from a GRUB bootable USB drive or hard drive.
    3. Select the desired GRUB boot configuration. See GRUB Boot Configurations for more information.
    4. Upon system boot, the sample(s) will run automatically after the eRTOS Kernel startup.
    5. Sample output will be displayed on the screen when the program ends.
    6. Re-boot the system in its Windows Boot Configuration.
    7. Navigate to <InstallDrive>\MaxRT\eRTOS\.
    8. Open the RtLogFile.txt log file to view sample output.

    Examples

    The following examples were run on a Skylake i9-7900X system where L1 and L2 (1 MB) was unique on each core. L3 (13 MB) was shared among 10 cores. The system was configured with 5 cores for Windows and 5 cores for Process. The RDT CAT and MBA modes were configured as follows:

    In the following examples, the acronym RMW stands for Read, Modify, Write.


    Example 1: Performance among Timer Threads in the Same CLOS

    Each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set to the same CLOS, thereby using the same space of L3 cache with contention possible.

    Program Parameters

    In this scenario, we used the following command line syntax:

    Run RDTPerformance.ertos 2000000 1000 2 30 0 0 0 0 0

    Output
    RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
    Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
    5 Process processors, starting at 5
    Number of CLOS: 7
    L3 CAT capability: Yes
    L2 CAT capability: No
    MBA capability: Yes
    -----------------------------------------------------------
    		
    Process processor #5, Timer priority: 127, CLOS: 0
    		
    |     RMW Time Slice 	|	 Count	|
    |     279 ->    319	|	     1	|
    |     319 ->    359	|	  2418	|
    |     359 ->    399	|	 27572	|
    |     399 ->    439	|	     2	|
    |     439 ->    479	|	     1	|
    |     479 ->    519	|	     0	|
    |     519 ->    559	|	     2	|
    |     559 ->    599	|	     2	|
    |     599 ->    639	|	     0	|
    |     639 ->    679	|	     1	|
    		
    Total iteration: 29999
    Minimum RMW Time: 279 us (occurred at iteration: 3)
    Maximum RMW Time: 671 us (occurred at iteration: 23674)
    Average RMW Time: 362 us
    Last level cache miss: 0x00000000138c04a5 (Cache Line)
    Total memory bandwidth: 0x0000000500f96000 (Bytes)
    -----------------------------------------------------------
    		
    Process processor #6, Timer priority: 102, CLOS: 0
    		
    |     RMW Time Slice 	|	 Count	|
    |     349 ->    375	|	 29971	|
    |     375 ->    401	|	    22	|
    |     401 ->    427	|	     1	|
    |     427 ->    453	|	     0	|
    |     453 ->    479	|	     0	|
    |     479 ->    505	|	     0	|
    |     505 ->    531	|	     1	|
    |     531 ->    557	|	     1	|
    |     557 ->    583	|	     1	|
    |     583 ->    609	|	     1	|
    		
    Total iteration: 29998
    Minimum RMW Time: 349 us (occurred at iteration: 2)
    Maximum RMW Time: 600 us (occurred at iteration: 23673)
    Average RMW Time: 360 us
    Last level cache miss: 0x0000000013feb4ca (Cache Line)
    Total memory bandwidth: 0x00000003d5360000 (Bytes)
    ...

     


    Example 2: Performance among Timer Threads using Different CLOS

    In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point. Each timer thread is set with a different CLOS, resulting in different spaces of L3 cache.

    Program Parameters

    In this scenario, we used the following command line syntax:

    Run RDTPerformance.ertos 2000000 1000 30 2 0 1 2 3 4

    Output
    RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
    Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
    5 Process processors, starting at 5
    Number of CLOS: 7
    L3 CAT capability: Yes
    L2 CAT capability: No
    MBA capability: Yes
    -----------------------------------------------------------
    		
    Process processor #5, Timer priority: 127, CLOS: 0
    		
    |     RMW Time Slice 	|	 Count	|
    |     283 ->    309	|	 29936	|
    |     309 ->    335	|	    34	|
    |     335 ->    361	|	     0	|
    |     361 ->    387	|	     0	|
    |     387 ->    413	|	     4	|
    |     413 ->    439	|	     0	|
    |     439 ->    465	|	     0	|
    |     465 ->    491	|	    10	|
    |     491 ->    517	|	     6	|
    |     517 ->    543	|	     9	|
    		
    Total iteration: 29999
    Minimum RMW Time: 283 us (occurred at iteration: 3)
    Maximum RMW Time: 539 us (occurred at iteration: 20577)
    Average RMW Time: 294 us
    Last level cache miss: 0x00000000045dcc53 (Cache Line)
    Total memory bandwidth: 0x00000001ff54a000 (Bytes)
    -----------------------------------------------------------
    		
    Process processor #6, Timer priority: 102, CLOS: 1
    		
    |     RMW Time Slice 	|	 Count	|
    |     314 ->    340	|	 29962	|
    |     340 ->    366	|	    13	|
    |     366 ->    392	|	     0	|
    |     392 ->    418	|	     2	|
    |     418 ->    444	|	     6	|
    |     444 ->    470	|	     0	|
    |     470 ->    496	|	     0	|
    |     496 ->    522	|	     5	|
    |     522 ->    548	|	     4	|
    |     548 ->    574	|	     6	|
    		
    Total iteration: 29998
    Minimum RMW Time: 314 us (occurred at iteration: 17710)
    Maximum RMW Time: 569 us (occurred at iteration: 129)
    Average RMW Time: 324 us
    Last level cache miss: 0x000000000ad809e0 (Cache Line)
    Total memory bandwidth: 0x0000000285e88000 (Bytes)
    ...

    Based on the above test results, the average RMW time of the timer thread on Process processor 5 is reduced from 362 us to 294 us, resulting in an approximate 18.8% performance gain by reducing L3 cache contention from other timer threads.

    Note: The LLC miss of timer thread on Process processor 5 is reduced from 0x138c04a5 to 0x45dcc53 (i.e., a decrease of about 78%). L3 Total memory bandwidth is reduced from 0x500f96000 to 0x1ff54a000 (a decrease of about 60%).

     


    Example 3: Performance with Intel® RDT disabled

    In this example, each timer thread reads, modifies, and writes about 3 MB of double floating-point.

    Program Parameters

    In this scenario, we used the following command line syntax:

    Run RDTPerformance.ertos 2000000 1000 30 2

    Output
    RDTPerformance - Sample of performance optimization using Intel RDT (CAT/MBA)
    Number of double elements: 2000000, Timer period: 1000 (us), Session: 30 (seconds), Monitor resource: 2
    5 Process processors, starting at 5
    Number of CLOS: 0
    L3 CAT capability: No
    L2 CAT capability: No
    MBA capability: No
    There is no performance difference through CLOS!
    -----------------------------------------------------------
    		
    Process processor #5, Timer priority: 127, CLOS: -1
    		
    |     RMW Time Slice 	|	 Count	|
    |     307 ->    359	|	 29855	|
    |     359 ->    411	|	    71	|
    |     411 ->    463	|	    21	|
    |     463 ->    515	|	    10	|
    |     515 ->    567	|	    12	|
    |     567 ->    619	|	    12	|
    |     619 ->    671	|	     4	|
    |     671 ->    723	|	     8	|
    |     723 ->    775	|	     5	|
    |     775 ->    827	|	     1	|
    		
    Total iteration: 29999
    Minimum RMW Time: 307 us (occurred at iteration: 20827)
    Maximum RMW Time: 824 us (occurred at iteration: 16713)
    Average RMW Time: 318 us
    Last level cache miss: 0x00000000084ffd32 (Cache Line)
    Total memory bandwidth: 0x00000002855e6000 (Bytes)
    -----------------------------------------------------------
    		
    Process processor #6, Timer priority: 102, CLOS: -1
    		
    |     RMW Time Slice 	|	 Count	|
    |     305 ->    352	|	 29813	|
    |     352 ->    399	|	   103	|
    |     399 ->    446	|	    20	|
    |     446 ->    493	|	    14	|
    |     493 ->    540	|	    10	|
    |     540 ->    587	|	    16	|
    |     587 ->    634	|	     6	|
    |     634 ->    681	|	     7	|
    |     681 ->    728	|	     3	|
    |     728 ->    775	|	     6	|
    		
    Total iteration: 29998
    Minimum RMW Time: 305 us (occurred at iteration: 11629)
    Maximum RMW Time: 774 us (occurred at iteration: 7805)
    Average RMW Time: 316 us
    Last level cache miss: 0x00000000082c0d52 (Cache Line)
    Total memory bandwidth: 0x00000002f06e8000 (Bytes)
    ...

     


    Summary

     

     

    Example 1: Each timer thread is set with the same CLOS

    Example 2: Each timer thread is set with a different CLOS

    Example 3: Intel®RDT is disabled

    Total iteration 29999 29999 29999
    Minimum RMW Time 279 us 283 us 307 us
    Maximum RMW Time 671 us 539 us 824 us
    Average RMW Time 362 us 294 us 318 us
    Last level cache miss 0x138c04a5 (Cache Line) 0x45dcc53 (Cache Line) 0x84ffd32 (Cache Line)
    Total memory bandwidth 0x500f96000 (Bytes) 0x1ff54a000 (Bytes) 0x2855e6000 (Bytes)

     

    For Examples 1 and 3, there is no differentiation in performance among timer threads. The difference is that the L3 cache space in Example 1 is separated between Windows cores and Process cores, thereby removing L3 cache contention between Windows cores and Process cores. For maximum RMW time, Example 1 is smaller than Example 3. For average RMW time, Example 1 is larger than Example 3 because of the smaller L3 cache space for RTprocess cores, evident in the larger LLC miss and total memory bandwidth from Process cores.

    Note: The performance gained by enabling Intel® RDT is much more significant in real-time response latency. This gain can be measured using the SRTM sample.

    APIs Referenced