AMX_IPClatency Sample

Description

This program measures the switching latency between two threads using Intel® AMX instructions. The main thread creates two threads (T1, T2) with different priorities and establishes two events for synchronization between them. Each thread creates an AMX tile configuration, initializes two source matrices, and then enters a loop.

Within each loop, both threads zero a destination matrix and load source matrices into two tiles (tmm2, tmm3), and the destination matrix into tile tmm1. The T2 thread measures the start time (p1) and signals the T1 thread to run. A thread switch occurs if T1 has affinity on the same core as T2. If T1 has affinity on a different core and is the highest priority runnable thread on that core, T1 will run immediately as well. Once T1 runs, it measures the switching end time (p2), computes the dot-product of bytes in tmm2 and tmm3, and stores the results in tmm1 to memory. The results are validated against the expected data.

When thread T2 is switched back or continues to run, it computes the dot-product of bytes in tmm2 and tmm3 and stores the results in tmm1 to memory, validating against the expected data. During the threads switch, the thread’s metadata in the control register (TILECFG) and tile registers in TILEDATA are switched by OS with the help of xsaves/xrestors instructions. If OS switches TILECFG and TILEDATA incorrectly, the program will generate general protection fault during execution, or the matrix multiplication results will not pass the validation check.

The measured latency (p2-p1) represents the time taken to switch threads with AMX usage, including the time for RtSetEvent, RtWaitForSingleObject, xsaves/xrestors, etc.

Source Files

File Description
AMX_IPClatency.c  

Usage

Run AMX_IPClatency.ertos <Loop Count> <T1 Ideal Core #> <T2 Ideal Core #>

Loop Count: Integer specifying the number of times the loop is to be performed. The default value is 10000.

T1 Ideal Core #: Integer specifying the ideal core number for thread T1. The default value is the core number where this program is running.

T2 Ideal Core #: Integer specifying the ideal core number for thread T2. The default value is the core number where this program is running.

Example

In this example, the sample is running on XEON Scalable 4 system. In this scenario, we use the following command line:

run AMX_IPClatency.ertos 10000 0 0
Output
AMX_IPClatency runs. Samples: 10000, T1 ideal core #: 0, T2 ideal core #: 0
T1 thread run on core #: 0
Main thread enters SUSPEND!!!
T2 thread run on core #: 0
T1 thread end on core #: 0
T2 thread end on core #: 0
**** AMX_IPClatency: PASS ****
samples = 10000
min is 6 (100ns)
max is 130 (100ns) (occurred at loop 4503)
ave is 6 (100ns).

 

 
HISTOGRAM
0  -     6 (100ns):       0 ***
6  -     7 (100ns):    9899
7  -     8 (100ns):       4
8  -     9 (100ns):       1
9  -    15 (100ns):       0 ***
15  -    16 (100ns):       5
16  -    17 (100ns):       1
17  -    18 (100ns):       2
18  -    19 (100ns):       0 ***
19  -    20 (100ns):       1
20  -    22 (100ns):       0 ***
22  -    23 (100ns):       1
23  -    24 (100ns):       0 ***
24  -    25 (100ns):       1
25  -    29 (100ns):       0 ***
29  -    30 (100ns):       1
30  -    31 (100ns):       1
31  -    33 (100ns):       0 ***
33  -    34 (100ns):       2
34  -    44 (100ns):       0 ***
44  -    45 (100ns):       1
45  -    50 (100ns):       0 ***
50  -    51 (100ns):       1
51  -    52 (100ns):       1
52  -    54 (100ns):       0 ***
54  -    55 (100ns):       1
55  -    56 (100ns):       0 ***
56  -    57 (100ns):       1
57  -    58 (100ns):       1
58  -    59 (100ns):       0 ***
59  -    60 (100ns):       1
60  -    61 (100ns):       1
61  -    63 (100ns):       0 ***
63  -    64 (100ns):       1
64  -    66 (100ns):       0 ***
66  -    67 (100ns):       1
67  -    68 (100ns):       3
68  -    72 (100ns):       0 ***
72  -    73 (100ns):       1
73  -    74 (100ns):       1
74  -    75 (100ns):       2
75  -    80 (100ns):       0 ***
80  -    81 (100ns):       1
81  -    84 (100ns):       0 ***
84  -    85 (100ns):       1
85  -    86 (100ns):       3
86  -    88 (100ns):       0 ***
88  -    89 (100ns):       1
89  -    90 (100ns):       0 ***
90  -    91 (100ns):       1
91  -    95 (100ns):       0 ***
95  -    96 (100ns):       1
96  -    97 (100ns):       1
97  -    98 (100ns):       5
98  -    99 (100ns):       1
99  -   100 (100ns):       6
100  -   101 (100ns):       4
101  -   102 (100ns):       1
102  -   103 (100ns):       0 ***
103  -   104 (100ns):       5
104  -   105 (100ns):       1
105  -   106 (100ns):       2
106  -   107 (100ns):       1
107  -   108 (100ns):       5
108  -   109 (100ns):       3
109  -   110 (100ns):       1
110  -   111 (100ns):       4
111  -   112 (100ns):       2
112  -   113 (100ns):       1
113  -   114 (100ns):       2
114  -   115 (100ns):       1
115  -   116 (100ns):       1
116  -   117 (100ns):       1
117  -   118 (100ns):       0 ***
118  -   119 (100ns):       1
119  -   120 (100ns):       1
120  -   121 (100ns):       2
121  -   123 (100ns):       0 ***
123  -   124 (100ns):       1
124  -   125 (100ns):       1
125  -   128 (100ns):       0 ***
128  -   129 (100ns):       1
129  -   130 (100ns):       0 ***
130  -   131 (100ns):       1
131  - 10000 (100ns):       0 ***                

APIs Referenced