Documentation/x86: Update resctrl.rst for new features
[linux-2.6-block.git] / Documentation / x86 / resctrl.rst
CommitLineData
1cd7af50
CD
1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4===========================================
a6f771c9 5User Interface for Resource Control feature
1cd7af50 6===========================================
a6f771c9 7
1cd7af50
CD
8:Copyright: |copy| 2016 Intel Corporation
9:Authors: - Fenghua Yu <fenghua.yu@intel.com>
10 - Tony Luck <tony.luck@intel.com>
11 - Vikas Shivappa <vikas.shivappa@intel.com>
f20e5789 12
f20e5789 13
1cd7af50
CD
14Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
15AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
f20e5789 16
e6d42931 17This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
a6f771c9 18flag bits:
f20e5789 19
0a363fb2 20=============================================== ================================
1cd7af50
CD
21RDT (Resource Director Technology) Allocation "rdt_a"
22CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
23CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
24CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
25MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
26MBA (Memory Bandwidth Allocation) "mba"
0a363fb2
BM
27SMBA (Slow Memory Bandwidth Allocation) ""
28BMEC (Bandwidth Monitoring Event Configuration) ""
29=============================================== ================================
30
31Historically, new features were made visible by default in /proc/cpuinfo. This
32resulted in the feature flags becoming hard to parse by humans. Adding a new
33flag to /proc/cpuinfo should be avoided if user space can obtain information
34about the feature from resctrl's info directory.
1cd7af50
CD
35
36To use the feature mount the file system::
f20e5789 37
d6c64a4f 38 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
f20e5789
FY
39
40mount options are:
41
1cd7af50
CD
42"cdp":
43 Enable code/data prioritization in L3 cache allocations.
44"cdpl2":
45 Enable code/data prioritization in L2 cache allocations.
46"mba_MBps":
47 Enable the MBA Software Controller(mba_sc) to specify MBA
48 bandwidth in MBps
aa55d5a4 49
57794aab 50L2 and L3 CDP are controlled separately.
f20e5789 51
1640ae94 52RDT features are orthogonal. A particular system may support only
e17e7330
RC
53monitoring, only control, or both monitoring and control. Cache
54pseudo-locking is a unique way of using cache control to "pin" or
55"lock" data in the cache. Details can be found in
56"Cache Pseudo-Locking".
57
1640ae94
VS
58
59The mount succeeds if either of allocation or monitoring is present, but
60only those files and directories supported by the system will be created.
61For more details on the behavior of the interface during monitoring
62and allocation, see the "Resource alloc and monitor groups" section.
f20e5789 63
458b0d6e 64Info directory
1cd7af50 65==============
458b0d6e
TG
66
67The 'info' directory contains information about the enabled
68resources. Each resource has its own subdirectory. The subdirectory
a9cad3d4 69names reflect the resource names.
1640ae94
VS
70
71Each subdirectory contains the following files with respect to
72allocation:
73
74Cache resource(L3/L2) subdirectory contains the following files
75related to allocation:
458b0d6e 76
1cd7af50
CD
77"num_closids":
78 The number of CLOSIDs which are valid for this
79 resource. The kernel uses the smallest number of
80 CLOSIDs of all enabled resources as limit.
81"cbm_mask":
82 The bitmask which is valid for this resource.
83 This mask is equivalent to 100%.
84"min_cbm_bits":
85 The minimum number of consecutive bits which
86 must be set when writing a mask.
87
88"shareable_bits":
89 Bitmask of shareable resource with other executing
90 entities (e.g. I/O). User can use this when
91 setting up exclusive cache partitions. Note that
92 some platforms support devices that have their
93 own settings for cache use which can over-ride
94 these bits.
95"bit_usage":
96 Annotated capacity bitmasks showing how all
97 instances of the resource are used. The legend is:
98
99 "0":
100 Corresponding region is unused. When the system's
cba1aab8
RC
101 resources have been allocated and a "0" is found
102 in "bit_usage" it is a sign that resources are
103 wasted.
1cd7af50
CD
104
105 "H":
106 Corresponding region is used by hardware only
cba1aab8
RC
107 but available for software use. If a resource
108 has bits set in "shareable_bits" but not all
109 of these bits appear in the resource groups'
110 schematas then the bits appearing in
111 "shareable_bits" but no resource group will
112 be marked as "H".
1cd7af50
CD
113 "X":
114 Corresponding region is available for sharing and
cba1aab8
RC
115 used by hardware and software. These are the
116 bits that appear in "shareable_bits" as
117 well as a resource group's allocation.
1cd7af50
CD
118 "S":
119 Corresponding region is used by software
cba1aab8 120 and available for sharing.
1cd7af50
CD
121 "E":
122 Corresponding region is used exclusively by
cba1aab8 123 one resource group. No sharing allowed.
1cd7af50
CD
124 "P":
125 Corresponding region is pseudo-locked. No
e17e7330 126 sharing allowed.
0dd2d749 127
57794aab 128Memory bandwidth(MB) subdirectory contains the following files
1640ae94 129with respect to allocation:
a9cad3d4 130
1cd7af50
CD
131"min_bandwidth":
132 The minimum memory bandwidth percentage which
133 user can request.
a9cad3d4 134
1cd7af50
CD
135"bandwidth_gran":
136 The granularity in which the memory bandwidth
137 percentage is allocated. The allocated
138 b/w percentage is rounded off to the next
139 control step available on the hardware. The
140 available bandwidth control steps are:
141 min_bandwidth + N * bandwidth_gran.
a9cad3d4 142
1cd7af50
CD
143"delay_linear":
144 Indicates if the delay scale is linear or
145 non-linear. This field is purely informational
146 only.
458b0d6e 147
29b6bd41
FY
148"thread_throttle_mode":
149 Indicator on Intel systems of how tasks running on threads
150 of a physical core are throttled in cases where they
151 request different memory bandwidth percentages:
152
153 "max":
154 the smallest percentage is applied
155 to all threads
156 "per-thread":
157 bandwidth percentages are directly applied to
158 the threads running on the core
159
1640ae94
VS
160If RDT monitoring is available there will be an "L3_MON" directory
161with the following files:
162
1cd7af50
CD
163"num_rmids":
164 The number of RMIDs available. This is the
165 upper bound for how many "CTRL_MON" + "MON"
166 groups can be created.
1640ae94 167
1cd7af50
CD
168"mon_features":
169 Lists the monitoring events if
170 monitoring is enabled for the resource.
0a363fb2
BM
171 Example::
172
173 # cat /sys/fs/resctrl/info/L3_MON/mon_features
174 llc_occupancy
175 mbm_total_bytes
176 mbm_local_bytes
177
178 If the system supports Bandwidth Monitoring Event
179 Configuration (BMEC), then the bandwidth events will
180 be configurable. The output will be::
181
182 # cat /sys/fs/resctrl/info/L3_MON/mon_features
183 llc_occupancy
184 mbm_total_bytes
185 mbm_total_bytes_config
186 mbm_local_bytes
187 mbm_local_bytes_config
188
189"mbm_total_bytes_config", "mbm_local_bytes_config":
190 Read/write files containing the configuration for the mbm_total_bytes
191 and mbm_local_bytes events, respectively, when the Bandwidth
192 Monitoring Event Configuration (BMEC) feature is supported.
193 The event configuration settings are domain specific and affect
194 all the CPUs in the domain. When either event configuration is
195 changed, the bandwidth counters for all RMIDs of both events
196 (mbm_total_bytes as well as mbm_local_bytes) are cleared for that
197 domain. The next read for every RMID will report "Unavailable"
198 and subsequent reads will report the valid value.
199
200 Following are the types of events supported:
201
202 ==== ========================================================
203 Bits Description
204 ==== ========================================================
205 6 Dirty Victims from the QOS domain to all types of memory
206 5 Reads to slow memory in the non-local NUMA domain
207 4 Reads to slow memory in the local NUMA domain
208 3 Non-temporal writes to non-local NUMA domain
209 2 Non-temporal writes to local NUMA domain
210 1 Reads to memory in the non-local NUMA domain
211 0 Reads to memory in the local NUMA domain
212 ==== ========================================================
213
214 By default, the mbm_total_bytes configuration is set to 0x7f to count
215 all the event types and the mbm_local_bytes configuration is set to
216 0x15 to count all the local memory events.
217
218 Examples:
219
220 * To view the current configuration::
221 ::
222
223 # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
224 0=0x7f;1=0x7f;2=0x7f;3=0x7f
225
226 # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
227 0=0x15;1=0x15;3=0x15;4=0x15
228
229 * To change the mbm_total_bytes to count only reads on domain 0,
230 the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
231 (in hexadecimal 0x33):
232 ::
233
234 # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
235
236 # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
237 0=0x33;1=0x7f;2=0x7f;3=0x7f
238
239 * To change the mbm_local_bytes to count all the slow memory reads on
240 domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
241 in binary (in hexadecimal 0x30):
242 ::
243
244 # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
245
246 # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
247 0=0x30;1=0x30;3=0x15;4=0x15
1640ae94
VS
248
249"max_threshold_occupancy":
1cd7af50
CD
250 Read/write file provides the largest value (in
251 bytes) at which a previously used LLC_occupancy
252 counter can be considered for re-use.
1640ae94 253
165d3ad8
TL
254Finally, in the top level of the "info" directory there is a file
255named "last_cmd_status". This is reset with every "command" issued
256via the file system (making new directories or writing to any of the
257control files). If the command was successful, it will read as "ok".
258If the command failed, it will provide more information that can be
259conveyed in the error returns from file operations. E.g.
1cd7af50 260::
165d3ad8
TL
261
262 # echo L3:0=f7 > schemata
263 bash: echo: write error: Invalid argument
264 # cat info/last_cmd_status
265 mask f7 has non-consecutive 1-bits
1640ae94
VS
266
267Resource alloc and monitor groups
1cd7af50 268=================================
1640ae94 269
f20e5789 270Resource groups are represented as directories in the resctrl file
1640ae94
VS
271system. The default group is the root directory which, immediately
272after mounting, owns all the tasks and cpus in the system and can make
273full use of all resources.
274
275On a system with RDT control features additional directories can be
276created in the root directory that specify different amounts of each
277resource (see "schemata" below). The root and these additional top level
278directories are referred to as "CTRL_MON" groups below.
279
280On a system with RDT monitoring the root directory and other top level
281directories contain a directory named "mon_groups" in which additional
282directories can be created to monitor subsets of tasks in the CTRL_MON
283group that is their ancestor. These are called "MON" groups in the rest
284of this document.
285
286Removing a directory will move all tasks and cpus owned by the group it
287represents to the parent. Removing one of the created CTRL_MON groups
288will automatically remove all MON groups below it.
289
290All groups contain the following files:
291
292"tasks":
293 Reading this file shows the list of all tasks that belong to
294 this group. Writing a task id to the file will add a task to the
295 group. If the group is a CTRL_MON group the task is removed from
296 whichever previous CTRL_MON group owned the task and also from
297 any MON group that owned the task. If the group is a MON group,
298 then the task must already belong to the CTRL_MON parent of this
299 group. The task is removed from any previous MON group.
300
301
302"cpus":
303 Reading this file shows a bitmask of the logical CPUs owned by
304 this group. Writing a mask to this file will add and remove
305 CPUs to/from this group. As with the tasks file a hierarchy is
306 maintained where MON groups may only include CPUs owned by the
307 parent CTRL_MON group.
57794aab 308 When the resource group is in pseudo-locked mode this file will
33dc3e41
RC
309 only be readable, reflecting the CPUs associated with the
310 pseudo-locked region.
1640ae94
VS
311
312
313"cpus_list":
314 Just like "cpus", only using ranges of CPUs instead of bitmasks.
f20e5789 315
f20e5789 316
1640ae94 317When control is enabled all CTRL_MON groups will also contain:
f20e5789 318
1640ae94
VS
319"schemata":
320 A list of all the resources available to this group.
321 Each resource has its own line and format - see below for details.
f20e5789 322
cba1aab8
RC
323"size":
324 Mirrors the display of the "schemata" file to display the size in
325 bytes of each allocation instead of the bits representing the
326 allocation.
327
328"mode":
329 The "mode" of the resource group dictates the sharing of its
330 allocations. A "shareable" resource group allows sharing of its
e17e7330
RC
331 allocations while an "exclusive" resource group does not. A
332 cache pseudo-locked region is created by first writing
333 "pseudo-locksetup" to the "mode" file before writing the cache
334 pseudo-locked region's schemata to the resource group's "schemata"
335 file. On successful pseudo-locked region creation the mode will
336 automatically change to "pseudo-locked".
cba1aab8 337
1640ae94 338When monitoring is enabled all MON groups will also contain:
4ffa3c97 339
1640ae94
VS
340"mon_data":
341 This contains a set of files organized by L3 domain and by
342 RDT event. E.g. on a system with two L3 domains there will
343 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
344 directories have one file per event (e.g. "llc_occupancy",
345 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
346 files provide a read out of the current value of the event for
347 all tasks in the group. In CTRL_MON groups these files provide
348 the sum for all tasks in the CTRL_MON group and all tasks in
349 MON groups. Please see example section for more details on usage.
f20e5789 350
1640ae94
VS
351Resource allocation rules
352-------------------------
1cd7af50 353
1640ae94
VS
354When a task is running the following rules define which resources are
355available to it:
f20e5789
FY
356
3571) If the task is a member of a non-default group, then the schemata
1640ae94 358 for that group is used.
f20e5789
FY
359
3602) Else if the task belongs to the default group, but is running on a
1640ae94
VS
361 CPU that is assigned to some specific group, then the schemata for the
362 CPU's group is used.
f20e5789
FY
363
3643) Otherwise the schemata for the default group is used.
365
1640ae94
VS
366Resource monitoring rules
367-------------------------
3681) If a task is a member of a MON group, or non-default CTRL_MON group
369 then RDT events for the task will be reported in that group.
370
3712) If a task is a member of the default CTRL_MON group, but is running
372 on a CPU that is assigned to some specific group, then the RDT events
373 for the task will be reported in that group.
374
3753) Otherwise RDT events for the task will be reported in the root level
376 "mon_data" group.
377
378
379Notes on cache occupancy monitoring and control
1cd7af50 380===============================================
1640ae94
VS
381When moving a task from one group to another you should remember that
382this only affects *new* cache allocations by the task. E.g. you may have
383a task in a monitor group showing 3 MB of cache occupancy. If you move
384to a new group and immediately check the occupancy of the old and new
385groups you will likely see that the old group is still showing 3 MB and
386the new group zero. When the task accesses locations still in cache from
387before the move, the h/w does not update any counters. On a busy system
388you will likely see the occupancy in the old group go down as cache lines
389are evicted and re-used while the occupancy in the new group rises as
390the task accesses memory and loads into the cache are counted based on
391membership in the new group.
392
393The same applies to cache allocation control. Moving a task to a group
394with a smaller cache partition will not evict any cache lines. The
395process may continue to use them from the old partition.
396
397Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
398to identify a control group and a monitoring group respectively. Each of
399the resource groups are mapped to these IDs based on the kind of group. The
400number of CLOSid and RMID are limited by the hardware and hence the creation of
401a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
402and creation of "MON" group may fail if we run out of RMIDs.
403
404max_threshold_occupancy - generic concepts
405------------------------------------------
406
407Note that an RMID once freed may not be immediately available for use as
408the RMID is still tagged the cache lines of the previous user of RMID.
409Hence such RMIDs are placed on limbo list and checked back if the cache
410occupancy has gone down. If there is a time when system has a lot of
411limbo RMIDs but which are not ready to be used, user may see an -EBUSY
412during mkdir.
413
414max_threshold_occupancy is a user configurable value to determine the
415occupancy at which an RMID can be freed.
f20e5789
FY
416
417Schemata files - general concepts
418---------------------------------
419Each line in the file describes one resource. The line starts with
420the name of the resource, followed by specific values to be applied
421in each of the instances of that resource on the system.
422
423Cache IDs
424---------
425On current generation systems there is one L3 cache per socket and L2
426caches are generally just shared by the hyperthreads on a core, but this
427isn't an architectural requirement. We could have multiple separate L3
428caches on a socket, multiple cores could share an L2 cache. So instead
429of using "socket" or "core" to define the set of logical cpus sharing
430a resource we use a "Cache ID". At a given cache level this will be a
431unique number across the whole system (but it isn't guaranteed to be a
432contiguous sequence, there may be gaps). To find the ID for each logical
433CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
434
435Cache Bit Masks (CBM)
436---------------------
437For cache resources we describe the portion of the cache that is available
438for allocation using a bitmask. The maximum value of the mask is defined
439by each cpu model (and may be different for different cache levels). It
440is found using CPUID, but is also provided in the "info" directory of
eb8ed28f 441the resctrl file system in "info/{resource}/cbm_mask". Intel hardware
f20e5789
FY
442requires that these masks have all the '1' bits in a contiguous block. So
4430x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
444and 0xA are not. On a system with a 20-bit mask each bit represents 5%
445of the capacity of the cache. You could partition the cache into four
446equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
447
d6c64a4f 448Memory bandwidth Allocation and monitoring
1cd7af50 449==========================================
d6c64a4f
VS
450
451For Memory bandwidth resource, by default the user controls the resource
452by indicating the percentage of total memory bandwidth.
a9cad3d4
VS
453
454The minimum bandwidth percentage value for each cpu model is predefined
455and can be looked up through "info/MB/min_bandwidth". The bandwidth
456granularity that is allocated is also dependent on the cpu model and can
457be looked up at "info/MB/bandwidth_gran". The available bandwidth
458control steps are: min_bw + N * bw_gran. Intermediate values are rounded
459to the next control step available on the hardware.
460
461The bandwidth throttling is a core specific mechanism on some of Intel
462SKUs. Using a high bandwidth and a low bandwidth setting on two threads
29b6bd41
FY
463sharing a core may result in both threads being throttled to use the
464low bandwidth (see "thread_throttle_mode").
465
466The fact that Memory bandwidth allocation(MBA) may be a core
d6c64a4f
VS
467specific mechanism where as memory bandwidth monitoring(MBM) is done at
468the package level may lead to confusion when users try to apply control
469via the MBA and then monitor the bandwidth to see if the controls are
470effective. Below are such scenarios:
471
4721. User may *not* see increase in actual bandwidth when percentage
473 values are increased:
474
475This can occur when aggregate L2 external bandwidth is more than L3
476external bandwidth. Consider an SKL SKU with 24 cores on a package and
477where L2 external is 10GBps (hence aggregate L2 external bandwidth is
478240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
479threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
480bandwidth of 100GBps although the percentage value specified is only 50%
57794aab 481<< 100%. Hence increasing the bandwidth percentage will not yield any
d6c64a4f
VS
482more bandwidth. This is because although the L2 external bandwidth still
483has capacity, the L3 external bandwidth is fully used. Also note that
484this would be dependent on number of cores the benchmark is run on.
485
4862. Same bandwidth percentage may mean different actual bandwidth
487 depending on # of threads:
488
489For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
490thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
491they have same percentage bandwidth of 10%. This is simply because as
492threads start using more cores in an rdtgroup, the actual bandwidth may
493increase or vary although user specified bandwidth percentage is same.
494
495In order to mitigate this and make the interface more user friendly,
496resctrl added support for specifying the bandwidth in MBps as well. The
497kernel underneath would use a software feedback mechanism or a "Software
498Controller(mba_sc)" which reads the actual bandwidth using MBM counters
57794aab 499and adjust the memory bandwidth percentages to ensure::
d6c64a4f
VS
500
501 "actual bandwidth < user specified bandwidth".
502
503By default, the schemata would take the bandwidth percentage values
504where as user can switch to the "MBA software controller" mode using
505a mount option 'mba_MBps'. The schemata format is specified in the below
506sections.
f20e5789 507
1640ae94
VS
508L3 schemata file details (code and data prioritization disabled)
509----------------------------------------------------------------
1cd7af50 510With CDP disabled the L3 schemata format is::
f20e5789
FY
511
512 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
513
1640ae94
VS
514L3 schemata file details (CDP enabled via mount option to resctrl)
515------------------------------------------------------------------
f20e5789 516When CDP is enabled L3 control is split into two separate resources
1cd7af50 517so you can specify independent masks for code and data like this::
f20e5789 518
7c7a4995
JM
519 L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
520 L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
f20e5789 521
1640ae94
VS
522L2 schemata file details
523------------------------
7c7a4995
JM
524CDP is supported at L2 using the 'cdpl2' mount option. The schemata
525format is either::
f20e5789
FY
526
527 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
528
7c7a4995
JM
529or
530
531 L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
532 L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
533
534
d6c64a4f
VS
535Memory bandwidth Allocation (default mode)
536------------------------------------------
a9cad3d4
VS
537
538Memory b/w domain is L3 cache.
1cd7af50 539::
a9cad3d4
VS
540
541 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
542
d6c64a4f
VS
543Memory bandwidth Allocation specified in MBps
544---------------------------------------------
545
546Memory bandwidth domain is L3 cache.
1cd7af50 547::
d6c64a4f
VS
548
549 MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
550
0a363fb2
BM
551Slow Memory Bandwidth Allocation (SMBA)
552---------------------------------------
553AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
554CXL.memory is the only supported "slow" memory device. With the
555support of SMBA, the hardware enables bandwidth allocation on
556the slow memory devices. If there are multiple such devices in
557the system, the throttling logic groups all the slow sources
558together and applies the limit on them as a whole.
559
560The presence of SMBA (with CXL.memory) is independent of slow memory
561devices presence. If there are no such devices on the system, then
562configuring SMBA will have no impact on the performance of the system.
563
564The bandwidth domain for slow memory is L3 cache. Its schemata file
565is formatted as:
566::
567
568 SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
569
c4026b7b
TL
570Reading/writing the schemata file
571---------------------------------
572Reading the schemata file will show the state of all resources
573on all domains. When writing you only need to specify those values
574which you wish to change. E.g.
1cd7af50 575::
c4026b7b 576
1cd7af50
CD
577 # cat schemata
578 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
579 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
580 # echo "L3DATA:2=3c0;" > schemata
581 # cat schemata
582 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
583 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
c4026b7b 584
0a363fb2
BM
585Reading/writing the schemata file (on AMD systems)
586--------------------------------------------------
587Reading the schemata file will show the current bandwidth limit on all
588domains. The allocated resources are in multiples of one eighth GB/s.
589When writing to the file, you need to specify what cache id you wish to
590configure the bandwidth limit.
591
592For example, to allocate 2GB/s limit on the first cache id:
593
594::
595
596 # cat schemata
597 MB:0=2048;1=2048;2=2048;3=2048
598 L3:0=ffff;1=ffff;2=ffff;3=ffff
599
600 # echo "MB:1=16" > schemata
601 # cat schemata
602 MB:0=2048;1= 16;2=2048;3=2048
603 L3:0=ffff;1=ffff;2=ffff;3=ffff
604
605Reading/writing the schemata file (on AMD systems) with SMBA feature
606--------------------------------------------------------------------
607Reading and writing the schemata file is the same as without SMBA in
608above section.
609
610For example, to allocate 8GB/s limit on the first cache id:
611
612::
613
614 # cat schemata
615 SMBA:0=2048;1=2048;2=2048;3=2048
616 MB:0=2048;1=2048;2=2048;3=2048
617 L3:0=ffff;1=ffff;2=ffff;3=ffff
618
619 # echo "SMBA:1=64" > schemata
620 # cat schemata
621 SMBA:0=2048;1= 64;2=2048;3=2048
622 MB:0=2048;1=2048;2=2048;3=2048
623 L3:0=ffff;1=ffff;2=ffff;3=ffff
624
e17e7330 625Cache Pseudo-Locking
1cd7af50 626====================
e17e7330
RC
627CAT enables a user to specify the amount of cache space that an
628application can fill. Cache pseudo-locking builds on the fact that a
629CPU can still read and write data pre-allocated outside its current
630allocated area on a cache hit. With cache pseudo-locking, data can be
631preloaded into a reserved portion of cache that no application can
632fill, and from that point on will only serve cache hits. The cache
633pseudo-locked memory is made accessible to user space where an
634application can map it into its virtual address space and thus have
635a region of memory with reduced average read latency.
636
637The creation of a cache pseudo-locked region is triggered by a request
638from the user to do so that is accompanied by a schemata of the region
639to be pseudo-locked. The cache pseudo-locked region is created as follows:
1cd7af50 640
e17e7330
RC
641- Create a CAT allocation CLOSNEW with a CBM matching the schemata
642 from the user of the cache region that will contain the pseudo-locked
643 memory. This region must not overlap with any current CAT allocation/CLOS
644 on the system and no future overlap with this cache region is allowed
645 while the pseudo-locked region exists.
646- Create a contiguous region of memory of the same size as the cache
647 region.
648- Flush the cache, disable hardware prefetchers, disable preemption.
649- Make CLOSNEW the active CLOS and touch the allocated memory to load
650 it into the cache.
651- Set the previous CLOS as active.
652- At this point the closid CLOSNEW can be released - the cache
653 pseudo-locked region is protected as long as its CBM does not appear in
654 any CAT allocation. Even though the cache pseudo-locked region will from
655 this point on not appear in any CBM of any CLOS an application running with
656 any CLOS will be able to access the memory in the pseudo-locked region since
657 the region continues to serve cache hits.
658- The contiguous region of memory loaded into the cache is exposed to
659 user-space as a character device.
660
661Cache pseudo-locking increases the probability that data will remain
662in the cache via carefully configuring the CAT feature and controlling
663application behavior. There is no guarantee that data is placed in
664cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
665“locked” data from cache. Power management C-states may shrink or
6fc0de37
RC
666power off cache. Deeper C-states will automatically be restricted on
667pseudo-locked region creation.
e17e7330
RC
668
669It is required that an application using a pseudo-locked region runs
670with affinity to the cores (or a subset of the cores) associated
671with the cache on which the pseudo-locked region resides. A sanity check
672within the code will not allow an application to map pseudo-locked memory
673unless it runs with affinity to cores associated with the cache on which the
674pseudo-locked region resides. The sanity check is only done during the
675initial mmap() handling, there is no enforcement afterwards and the
676application self needs to ensure it remains affine to the correct cores.
677
678Pseudo-locking is accomplished in two stages:
1cd7af50 679
e17e7330
RC
6801) During the first stage the system administrator allocates a portion
681 of cache that should be dedicated to pseudo-locking. At this time an
682 equivalent portion of memory is allocated, loaded into allocated
683 cache portion, and exposed as a character device.
6842) During the second stage a user-space application maps (mmap()) the
685 pseudo-locked memory into its address space.
686
687Cache Pseudo-Locking Interface
688------------------------------
689A pseudo-locked region is created using the resctrl interface as follows:
690
6911) Create a new resource group by creating a new directory in /sys/fs/resctrl.
6922) Change the new resource group's mode to "pseudo-locksetup" by writing
693 "pseudo-locksetup" to the "mode" file.
6943) Write the schemata of the pseudo-locked region to the "schemata" file. All
695 bits within the schemata should be "unused" according to the "bit_usage"
696 file.
697
698On successful pseudo-locked region creation the "mode" file will contain
699"pseudo-locked" and a new character device with the same name as the resource
700group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
701by user space in order to obtain access to the pseudo-locked memory region.
702
703An example of cache pseudo-locked region creation and usage can be found below.
704
705Cache Pseudo-Locking Debugging Interface
1cd7af50 706----------------------------------------
e17e7330
RC
707The pseudo-locking debugging interface is enabled by default (if
708CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
709
710There is no explicit way for the kernel to test if a provided memory
711location is present in the cache. The pseudo-locking debugging interface uses
712the tracing infrastructure to provide two ways to measure cache residency of
713the pseudo-locked region:
1cd7af50 714
e17e7330
RC
7151) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
716 from these measurements are best visualized using a hist trigger (see
717 example below). In this test the pseudo-locked region is traversed at
718 a stride of 32 bytes while hardware prefetchers and preemption
719 are disabled. This also provides a substitute visualization of cache
720 hits and misses.
7212) Cache hit and miss measurements using model specific precision counters if
722 available. Depending on the levels of cache on the system the pseudo_lock_l2
723 and pseudo_lock_l3 tracepoints are available.
e17e7330
RC
724
725When a pseudo-locked region is created a new debugfs directory is created for
726it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
727write-only file, pseudo_lock_measure, is present in this directory. The
dd45407c
RC
728measurement of the pseudo-locked region depends on the number written to this
729debugfs file:
1cd7af50
CD
730
7311:
732 writing "1" to the pseudo_lock_measure file will trigger the latency
dd45407c
RC
733 measurement captured in the pseudo_lock_mem_latency tracepoint. See
734 example below.
1cd7af50
CD
7352:
736 writing "2" to the pseudo_lock_measure file will trigger the L2 cache
dd45407c
RC
737 residency (cache hits and misses) measurement captured in the
738 pseudo_lock_l2 tracepoint. See example below.
1cd7af50
CD
7393:
740 writing "3" to the pseudo_lock_measure file will trigger the L3 cache
dd45407c
RC
741 residency (cache hits and misses) measurement captured in the
742 pseudo_lock_l3 tracepoint.
743
744All measurements are recorded with the tracing infrastructure. This requires
745the relevant tracepoints to be enabled before the measurement is triggered.
e17e7330 746
1cd7af50
CD
747Example of latency debugging interface
748~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
e17e7330
RC
749In this example a pseudo-locked region named "newlock" was created. Here is
750how we can measure the latency in cycles of reading from this region and
751visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
1cd7af50
CD
752is set::
753
754 # :> /sys/kernel/debug/tracing/trace
755 # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
756 # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
757 # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
758 # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
759 # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist
760
761 # event histogram
762 #
763 # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
764 #
765
766 { latency: 456 } hitcount: 1
767 { latency: 50 } hitcount: 83
768 { latency: 36 } hitcount: 96
769 { latency: 44 } hitcount: 174
770 { latency: 48 } hitcount: 195
771 { latency: 46 } hitcount: 262
772 { latency: 42 } hitcount: 693
773 { latency: 40 } hitcount: 3204
774 { latency: 38 } hitcount: 3484
775
776 Totals:
777 Hits: 8192
778 Entries: 9
779 Dropped: 0
780
781Example of cache hits/misses debugging
782~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
e17e7330
RC
783In this example a pseudo-locked region named "newlock" was created on the L2
784cache of a platform. Here is how we can obtain details of the cache hits
785and misses using the platform's precision counters.
1cd7af50 786::
e17e7330 787
1cd7af50
CD
788 # :> /sys/kernel/debug/tracing/trace
789 # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
790 # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
791 # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
792 # cat /sys/kernel/debug/tracing/trace
e17e7330 793
1cd7af50
CD
794 # tracer: nop
795 #
796 # _-----=> irqs-off
797 # / _----=> need-resched
798 # | / _---=> hardirq/softirq
799 # || / _--=> preempt-depth
800 # ||| / delay
801 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
802 # | | | |||| | |
803 pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
e17e7330
RC
804
805
1cd7af50
CD
806Examples for RDT allocation usage
807~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
808
8091) Example 1
1640ae94 810
f20e5789 811On a two socket machine (one L3 cache per socket) with just four bits
a9cad3d4 812for cache bit masks, minimum b/w of 10% with a memory bandwidth
1cd7af50
CD
813granularity of 10%.
814::
f20e5789 815
1cd7af50
CD
816 # mount -t resctrl resctrl /sys/fs/resctrl
817 # cd /sys/fs/resctrl
818 # mkdir p0 p1
819 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
820 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
f20e5789
FY
821
822The default resource group is unmodified, so we have access to all parts
823of all caches (its schemata file reads "L3:0=f;1=f").
824
825Tasks that are under the control of group "p0" may only allocate from the
826"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
827Tasks in group "p1" use the "lower" 50% of cache on both sockets.
828
a9cad3d4
VS
829Similarly, tasks that are under the control of group "p0" may use a
830maximum memory b/w of 50% on socket0 and 50% on socket 1.
831Tasks in group "p1" may also use 50% memory b/w on both sockets.
832Note that unlike cache masks, memory b/w cannot specify whether these
833allocations can overlap or not. The allocations specifies the maximum
834b/w that the group may be able to use and the system admin can configure
835the b/w accordingly.
836
b5453a8e
JM
837If resctrl is using the software controller (mba_sc) then user can enter the
838max b/w in MB rather than the percentage values.
1cd7af50 839::
d6c64a4f 840
1cd7af50
CD
841 # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
842 # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
d6c64a4f
VS
843
844In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
845of 1024MB where as on socket 1 they would use 500MB.
846
1cd7af50
CD
8472) Example 2
848
f20e5789
FY
849Again two sockets, but this time with a more realistic 20-bit mask.
850
851Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
852processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
853neighbors, each of the two real-time tasks exclusively occupies one quarter
854of L3 cache on socket 0.
1cd7af50 855::
f20e5789 856
1cd7af50
CD
857 # mount -t resctrl resctrl /sys/fs/resctrl
858 # cd /sys/fs/resctrl
f20e5789
FY
859
860First we reset the schemata for the default group so that the "upper"
a9cad3d4 86150% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
1cd7af50 862ordinary tasks::
f20e5789 863
1cd7af50 864 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
f20e5789
FY
865
866Next we make a resource group for our first real time task and give
867it access to the "top" 25% of the cache on socket 0.
1cd7af50 868::
f20e5789 869
1cd7af50
CD
870 # mkdir p0
871 # echo "L3:0=f8000;1=fffff" > p0/schemata
f20e5789
FY
872
873Finally we move our first real time task into this resource group. We
874also use taskset(1) to ensure the task always runs on a dedicated CPU
875on socket 0. Most uses of resource groups will also constrain which
876processors tasks run on.
1cd7af50 877::
f20e5789 878
1cd7af50
CD
879 # echo 1234 > p0/tasks
880 # taskset -cp 1 1234
f20e5789 881
1cd7af50 882Ditto for the second real time task (with the remaining 25% of cache)::
f20e5789 883
1cd7af50
CD
884 # mkdir p1
885 # echo "L3:0=7c00;1=fffff" > p1/schemata
886 # echo 5678 > p1/tasks
887 # taskset -cp 2 5678
f20e5789 888
a9cad3d4
VS
889For the same 2 socket system with memory b/w resource and CAT L3 the
890schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
89110):
892
1cd7af50
CD
893For our first real time task this would request 20% memory b/w on socket 0.
894::
a9cad3d4 895
1cd7af50 896 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
a9cad3d4
VS
897
898For our second real time task this would request an other 20% memory b/w
899on socket 0.
1cd7af50 900::
a9cad3d4 901
1cd7af50 902 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
a9cad3d4 903
1cd7af50 9043) Example 3
f20e5789
FY
905
906A single socket system which has real-time tasks running on core 4-7 and
907non real-time workload assigned to core 0-3. The real-time tasks share text
908and data, so a per task association is not required and due to interaction
909with the kernel it's desired that the kernel on these cores shares L3 with
910the tasks.
1cd7af50 911::
f20e5789 912
1cd7af50
CD
913 # mount -t resctrl resctrl /sys/fs/resctrl
914 # cd /sys/fs/resctrl
f20e5789
FY
915
916First we reset the schemata for the default group so that the "upper"
a9cad3d4 91750% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
1cd7af50 918cannot be used by ordinary tasks::
f20e5789 919
1cd7af50 920 # echo "L3:0=3ff\nMB:0=50" > schemata
f20e5789 921
a9cad3d4
VS
922Next we make a resource group for our real time cores and give it access
923to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
924socket 0.
1cd7af50 925::
f20e5789 926
1cd7af50
CD
927 # mkdir p0
928 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
f20e5789
FY
929
930Finally we move core 4-7 over to the new group and make sure that the
a9cad3d4
VS
931kernel and the tasks running there get 50% of the cache. They should
932also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
933siblings and only the real time threads are scheduled on the cores 4-7.
1cd7af50 934::
f20e5789 935
1cd7af50 936 # echo F0 > p0/cpus
3c2a769d 937
1cd7af50 9384) Example 4
cba1aab8
RC
939
940The resource groups in previous examples were all in the default "shareable"
941mode allowing sharing of their cache allocations. If one resource group
942configures a cache allocation then nothing prevents another resource group
943to overlap with that allocation.
944
945In this example a new exclusive resource group will be created on a L2 CAT
946system with two L2 cache instances that can be configured with an 8-bit
947capacity bitmask. The new exclusive resource group will be configured to use
94825% of each cache instance.
1cd7af50 949::
cba1aab8 950
1cd7af50
CD
951 # mount -t resctrl resctrl /sys/fs/resctrl/
952 # cd /sys/fs/resctrl
cba1aab8
RC
953
954First, we observe that the default group is configured to allocate to all L2
1cd7af50 955cache::
cba1aab8 956
1cd7af50
CD
957 # cat schemata
958 L2:0=ff;1=ff
cba1aab8
RC
959
960We could attempt to create the new resource group at this point, but it will
1cd7af50
CD
961fail because of the overlap with the schemata of the default group::
962
963 # mkdir p0
964 # echo 'L2:0=0x3;1=0x3' > p0/schemata
965 # cat p0/mode
966 shareable
967 # echo exclusive > p0/mode
968 -sh: echo: write error: Invalid argument
969 # cat info/last_cmd_status
970 schemata overlaps
cba1aab8
RC
971
972To ensure that there is no overlap with another resource group the default
973resource group's schemata has to change, making it possible for the new
974resource group to become exclusive.
1cd7af50
CD
975::
976
977 # echo 'L2:0=0xfc;1=0xfc' > schemata
978 # echo exclusive > p0/mode
979 # grep . p0/*
980 p0/cpus:0
981 p0/mode:exclusive
982 p0/schemata:L2:0=03;1=03
983 p0/size:L2:0=262144;1=262144
cba1aab8
RC
984
985A new resource group will on creation not overlap with an exclusive resource
1cd7af50
CD
986group::
987
988 # mkdir p1
989 # grep . p1/*
990 p1/cpus:0
991 p1/mode:shareable
992 p1/schemata:L2:0=fc;1=fc
993 p1/size:L2:0=786432;1=786432
994
995The bit_usage will reflect how the cache is used::
996
997 # cat info/L2/bit_usage
998 0=SSSSSSEE;1=SSSSSSEE
999
1000A resource group cannot be forced to overlap with an exclusive resource group::
1001
1002 # echo 'L2:0=0x1;1=0x1' > p1/schemata
1003 -sh: echo: write error: Invalid argument
1004 # cat info/last_cmd_status
1005 overlaps with exclusive group
cba1aab8 1006
e17e7330 1007Example of Cache Pseudo-Locking
1cd7af50 1008~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
e17e7330
RC
1009Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1010region is exposed at /dev/pseudo_lock/newlock that can be provided to
1011application for argument to mmap().
1cd7af50 1012::
e17e7330 1013
1cd7af50
CD
1014 # mount -t resctrl resctrl /sys/fs/resctrl/
1015 # cd /sys/fs/resctrl
e17e7330
RC
1016
1017Ensure that there are bits available that can be pseudo-locked, since only
1018unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1cd7af50
CD
1019removed from the default resource group's schemata::
1020
1021 # cat info/L2/bit_usage
1022 0=SSSSSSSS;1=SSSSSSSS
1023 # echo 'L2:1=0xfc' > schemata
1024 # cat info/L2/bit_usage
1025 0=SSSSSSSS;1=SSSSSS00
e17e7330
RC
1026
1027Create a new resource group that will be associated with the pseudo-locked
1028region, indicate that it will be used for a pseudo-locked region, and
1cd7af50 1029configure the requested pseudo-locked region capacity bitmask::
e17e7330 1030
1cd7af50
CD
1031 # mkdir newlock
1032 # echo pseudo-locksetup > newlock/mode
1033 # echo 'L2:1=0x3' > newlock/schemata
e17e7330
RC
1034
1035On success the resource group's mode will change to pseudo-locked, the
1036bit_usage will reflect the pseudo-locked region, and the character device
1cd7af50
CD
1037exposing the pseudo-locked region will exist::
1038
1039 # cat newlock/mode
1040 pseudo-locked
1041 # cat info/L2/bit_usage
1042 0=SSSSSSSS;1=SSSSSSPP
1043 # ls -l /dev/pseudo_lock/newlock
1044 crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
1045
1046::
1047
1048 /*
1049 * Example code to access one page of pseudo-locked cache region
1050 * from user space.
1051 */
1052 #define _GNU_SOURCE
1053 #include <fcntl.h>
1054 #include <sched.h>
1055 #include <stdio.h>
1056 #include <stdlib.h>
1057 #include <unistd.h>
1058 #include <sys/mman.h>
1059
1060 /*
1061 * It is required that the application runs with affinity to only
1062 * cores associated with the pseudo-locked region. Here the cpu
1063 * is hardcoded for convenience of example.
1064 */
1065 static int cpuid = 2;
1066
1067 int main(int argc, char *argv[])
1068 {
1069 cpu_set_t cpuset;
1070 long page_size;
1071 void *mapping;
1072 int dev_fd;
1073 int ret;
1074
1075 page_size = sysconf(_SC_PAGESIZE);
1076
1077 CPU_ZERO(&cpuset);
1078 CPU_SET(cpuid, &cpuset);
1079 ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
1080 if (ret < 0) {
1081 perror("sched_setaffinity");
1082 exit(EXIT_FAILURE);
1083 }
1084
1085 dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
1086 if (dev_fd < 0) {
1087 perror("open");
1088 exit(EXIT_FAILURE);
1089 }
1090
1091 mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
1092 dev_fd, 0);
1093 if (mapping == MAP_FAILED) {
1094 perror("mmap");
1095 close(dev_fd);
1096 exit(EXIT_FAILURE);
1097 }
1098
1099 /* Application interacts with pseudo-locked memory @mapping */
1100
1101 ret = munmap(mapping, page_size);
1102 if (ret < 0) {
1103 perror("munmap");
1104 close(dev_fd);
1105 exit(EXIT_FAILURE);
1106 }
1107
1108 close(dev_fd);
1109 exit(EXIT_SUCCESS);
1110 }
e17e7330 1111
cba1aab8
RC
1112Locking between applications
1113----------------------------
3c2a769d
MT
1114
1115Certain operations on the resctrl filesystem, composed of read/writes
1116to/from multiple files, must be atomic.
1117
1118As an example, the allocation of an exclusive reservation of L3 cache
1119involves:
1120
cba1aab8 1121 1. Read the cbmmasks from each directory or the per-resource "bit_usage"
3c2a769d
MT
1122 2. Find a contiguous set of bits in the global CBM bitmask that is clear
1123 in any of the directory cbmmasks
1124 3. Create a new directory
1125 4. Set the bits found in step 2 to the new directory "schemata" file
1126
1127If two applications attempt to allocate space concurrently then they can
1128end up allocating the same bits so the reservations are shared instead of
1129exclusive.
1130
1131To coordinate atomic operations on the resctrlfs and to avoid the problem
1132above, the following locking procedure is recommended:
1133
1134Locking is based on flock, which is available in libc and also as a shell
1135script command
1136
1137Write lock:
1138
1139 A) Take flock(LOCK_EX) on /sys/fs/resctrl
1140 B) Read/write the directory structure.
1141 C) funlock
1142
1143Read lock:
1144
1145 A) Take flock(LOCK_SH) on /sys/fs/resctrl
1146 B) If success read the directory structure.
1147 C) funlock
1148
1cd7af50
CD
1149Example with bash::
1150
1151 # Atomically read directory structure
1152 $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1153
1154 # Read directory contents and create new subdirectory
1155
1156 $ cat create-dir.sh
1157 find /sys/fs/resctrl/ > output.txt
1158 mask = function-of(output.txt)
1159 mkdir /sys/fs/resctrl/newres/
1160 echo mask > /sys/fs/resctrl/newres/schemata
1161
1162 $ flock /sys/fs/resctrl/ ./create-dir.sh
1163
1164Example with C::
1165
1166 /*
1167 * Example code do take advisory locks
1168 * before accessing resctrl filesystem
1169 */
1170 #include <sys/file.h>
1171 #include <stdlib.h>
1172
1173 void resctrl_take_shared_lock(int fd)
1174 {
1175 int ret;
1176
1177 /* take shared lock on resctrl filesystem */
1178 ret = flock(fd, LOCK_SH);
1179 if (ret) {
1180 perror("flock");
1181 exit(-1);
1182 }
1183 }
1184
1185 void resctrl_take_exclusive_lock(int fd)
1186 {
1187 int ret;
1188
1189 /* release lock on resctrl filesystem */
1190 ret = flock(fd, LOCK_EX);
1191 if (ret) {
1192 perror("flock");
1193 exit(-1);
1194 }
1195 }
1196
1197 void resctrl_release_lock(int fd)
1198 {
1199 int ret;
1200
1201 /* take shared lock on resctrl filesystem */
1202 ret = flock(fd, LOCK_UN);
1203 if (ret) {
1204 perror("flock");
1205 exit(-1);
1206 }
1207 }
1208
1209 void main(void)
1210 {
1211 int fd, ret;
1212
1213 fd = open("/sys/fs/resctrl", O_DIRECTORY);
1214 if (fd == -1) {
1215 perror("open");
1216 exit(-1);
1217 }
1218 resctrl_take_shared_lock(fd);
1219 /* code to read directory contents */
1220 resctrl_release_lock(fd);
1221
1222 resctrl_take_exclusive_lock(fd);
1223 /* code to read and write directory contents */
1224 resctrl_release_lock(fd);
1225 }
1226
1227Examples for RDT Monitoring along with allocation usage
1228=======================================================
1640ae94
VS
1229Reading monitored data
1230----------------------
1231Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1232show the current snapshot of LLC occupancy of the corresponding MON
1233group or CTRL_MON group.
1234
1235
1236Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1cd7af50 1237------------------------------------------------------------------------
1640ae94 1238On a two socket machine (one L3 cache per socket) with just four bits
1cd7af50 1239for cache bit masks::
1640ae94 1240
1cd7af50
CD
1241 # mount -t resctrl resctrl /sys/fs/resctrl
1242 # cd /sys/fs/resctrl
1243 # mkdir p0 p1
1244 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1245 # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1246 # echo 5678 > p1/tasks
1247 # echo 5679 > p1/tasks
1640ae94
VS
1248
1249The default resource group is unmodified, so we have access to all parts
1250of all caches (its schemata file reads "L3:0=f;1=f").
1251
1252Tasks that are under the control of group "p0" may only allocate from the
1253"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1254Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1255
1256Create monitor groups and assign a subset of tasks to each monitor group.
1cd7af50 1257::
1640ae94 1258
1cd7af50
CD
1259 # cd /sys/fs/resctrl/p1/mon_groups
1260 # mkdir m11 m12
1261 # echo 5678 > m11/tasks
1262 # echo 5679 > m12/tasks
1640ae94
VS
1263
1264fetch data (data shown in bytes)
1cd7af50 1265::
1640ae94 1266
1cd7af50
CD
1267 # cat m11/mon_data/mon_L3_00/llc_occupancy
1268 16234000
1269 # cat m11/mon_data/mon_L3_01/llc_occupancy
1270 14789000
1271 # cat m12/mon_data/mon_L3_00/llc_occupancy
1272 16789000
1640ae94
VS
1273
1274The parent ctrl_mon group shows the aggregated data.
1cd7af50 1275::
1640ae94 1276
1cd7af50
CD
1277 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1278 31234000
1640ae94
VS
1279
1280Example 2 (Monitor a task from its creation)
1cd7af50
CD
1281--------------------------------------------
1282On a two socket machine (one L3 cache per socket)::
1640ae94 1283
1cd7af50
CD
1284 # mount -t resctrl resctrl /sys/fs/resctrl
1285 # cd /sys/fs/resctrl
1286 # mkdir p0 p1
1640ae94
VS
1287
1288An RMID is allocated to the group once its created and hence the <cmd>
1289below is monitored from its creation.
1cd7af50 1290::
1640ae94 1291
1cd7af50
CD
1292 # echo $$ > /sys/fs/resctrl/p1/tasks
1293 # <cmd>
1640ae94 1294
1cd7af50 1295Fetch the data::
1640ae94 1296
1cd7af50
CD
1297 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1298 31789000
1640ae94
VS
1299
1300Example 3 (Monitor without CAT support or before creating CAT groups)
1cd7af50 1301---------------------------------------------------------------------
1640ae94
VS
1302
1303Assume a system like HSW has only CQM and no CAT support. In this case
1304the resctrl will still mount but cannot create CTRL_MON directories.
1305But user can create different MON groups within the root group thereby
1306able to monitor all tasks including kernel threads.
1307
1308This can also be used to profile jobs cache size footprint before being
1309able to allocate them to different allocation groups.
1cd7af50 1310::
1640ae94 1311
1cd7af50
CD
1312 # mount -t resctrl resctrl /sys/fs/resctrl
1313 # cd /sys/fs/resctrl
1314 # mkdir mon_groups/m01
1315 # mkdir mon_groups/m02
1640ae94 1316
1cd7af50
CD
1317 # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1318 # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1640ae94
VS
1319
1320Monitor the groups separately and also get per domain data. From the
1321below its apparent that the tasks are mostly doing work on
1322domain(socket) 0.
1cd7af50 1323::
1640ae94 1324
1cd7af50
CD
1325 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1326 31234000
1327 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1328 34555
1329 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1330 31234000
1331 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1332 32789
1640ae94
VS
1333
1334
1335Example 4 (Monitor real time tasks)
1336-----------------------------------
1337
1338A single socket system which has real time tasks running on cores 4-7
1339and non real time tasks on other cpus. We want to monitor the cache
1340occupancy of the real time threads on these cores.
1cd7af50
CD
1341::
1342
1343 # mount -t resctrl resctrl /sys/fs/resctrl
1344 # cd /sys/fs/resctrl
1345 # mkdir p1
1640ae94 1346
1cd7af50 1347Move the cpus 4-7 over to p1::
1640ae94 1348
1cd7af50 1349 # echo f0 > p1/cpus
1640ae94 1350
1cd7af50 1351View the llc occupancy snapshot::
1640ae94 1352
1cd7af50
CD
1353 # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1354 11234000
d1b22e36
FY
1355
1356Intel RDT Errata
1357================
1358
1359Intel MBM Counters May Report System Memory Bandwidth Incorrectly
1360-----------------------------------------------------------------
1361
1362Errata SKX99 for Skylake server and BDF102 for Broadwell server.
1363
1364Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
1365according to the assigned Resource Monitor ID (RMID) for that logical
1366core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
1367metrics, may report incorrect system bandwidth for certain RMID values.
1368
1369Implication: Due to the errata, system memory bandwidth may not match
1370what is reported.
1371
1372Workaround: MBM total and local readings are corrected according to the
1373following correction factor table:
1374
1375+---------------+---------------+---------------+-----------------+
1376|core count |rmid count |rmid threshold |correction factor|
1377+---------------+---------------+---------------+-----------------+
1378|1 |8 |0 |1.000000 |
1379+---------------+---------------+---------------+-----------------+
1380|2 |16 |0 |1.000000 |
1381+---------------+---------------+---------------+-----------------+
1382|3 |24 |15 |0.969650 |
1383+---------------+---------------+---------------+-----------------+
1384|4 |32 |0 |1.000000 |
1385+---------------+---------------+---------------+-----------------+
1386|6 |48 |31 |0.969650 |
1387+---------------+---------------+---------------+-----------------+
1388|7 |56 |47 |1.142857 |
1389+---------------+---------------+---------------+-----------------+
1390|8 |64 |0 |1.000000 |
1391+---------------+---------------+---------------+-----------------+
1392|9 |72 |63 |1.185115 |
1393+---------------+---------------+---------------+-----------------+
1394|10 |80 |63 |1.066553 |
1395+---------------+---------------+---------------+-----------------+
1396|11 |88 |79 |1.454545 |
1397+---------------+---------------+---------------+-----------------+
1398|12 |96 |0 |1.000000 |
1399+---------------+---------------+---------------+-----------------+
1400|13 |104 |95 |1.230769 |
1401+---------------+---------------+---------------+-----------------+
1402|14 |112 |95 |1.142857 |
1403+---------------+---------------+---------------+-----------------+
1404|15 |120 |95 |1.066667 |
1405+---------------+---------------+---------------+-----------------+
1406|16 |128 |0 |1.000000 |
1407+---------------+---------------+---------------+-----------------+
1408|17 |136 |127 |1.254863 |
1409+---------------+---------------+---------------+-----------------+
1410|18 |144 |127 |1.185255 |
1411+---------------+---------------+---------------+-----------------+
1412|19 |152 |0 |1.000000 |
1413+---------------+---------------+---------------+-----------------+
1414|20 |160 |127 |1.066667 |
1415+---------------+---------------+---------------+-----------------+
1416|21 |168 |0 |1.000000 |
1417+---------------+---------------+---------------+-----------------+
1418|22 |176 |159 |1.454334 |
1419+---------------+---------------+---------------+-----------------+
1420|23 |184 |0 |1.000000 |
1421+---------------+---------------+---------------+-----------------+
1422|24 |192 |127 |0.969744 |
1423+---------------+---------------+---------------+-----------------+
1424|25 |200 |191 |1.280246 |
1425+---------------+---------------+---------------+-----------------+
1426|26 |208 |191 |1.230921 |
1427+---------------+---------------+---------------+-----------------+
1428|27 |216 |0 |1.000000 |
1429+---------------+---------------+---------------+-----------------+
1430|28 |224 |191 |1.143118 |
1431+---------------+---------------+---------------+-----------------+
1432
1433If rmid > rmid threshold, MBM total and local values should be multiplied
1434by the correction factor.
1435
1436See:
1437
14381. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
1439http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1440
14412. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1442http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1443
14443. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
1445https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
1446
1447for further information.