Documentation/admin-guide/mm/numaperf.rst

   1 .. _numaperf:
   2
   3 =======================
   4 NUMA Memory Performance
   5 =======================
   6
   7 NUMA Locality
   8 =============
   9
  10 Some platforms may have multiple types of memory attached to a compute
  11 node. These disparate memory ranges may share some characteristics, such
  12 as CPU cache coherence, but may have different performance. For example,
  13 different media types and buses affect bandwidth and latency.
  14
  15 A system supports such heterogeneous memory by grouping each memory type
  16 under different domains, or "nodes", based on locality and performance
  17 characteristics.  Some memory may share the same node as a CPU, and others
  18 are provided as memory only nodes. While memory only nodes do not provide
  19 CPUs, they may still be local to one or more compute nodes relative to
  20 other nodes. The following diagram shows one such example of two compute
  21 nodes with local memory and a memory only node for each of compute node::
  22
  23  +------------------+     +------------------+
  24  | Compute Node 0   +-----+ Compute Node 1   |
  25  | Local Node0 Mem  |     | Local Node1 Mem  |
  26  +--------+---------+     +--------+---------+
  27           |                        |
  28  +--------+---------+     +--------+---------+
  29  | Slower Node2 Mem |     | Slower Node3 Mem |
  30  +------------------+     +--------+---------+
  31
  32 A "memory initiator" is a node containing one or more devices such as
  33 CPUs or separate memory I/O devices that can initiate memory requests.
  34 A "memory target" is a node containing one or more physical address
  35 ranges accessible from one or more memory initiators.
  36
  37 When multiple memory initiators exist, they may not all have the same
  38 performance when accessing a given memory target. Each initiator-target
  39 pair may be organized into different ranked access classes to represent
  40 this relationship. The highest performing initiator to a given target
  41 is considered to be one of that target's local initiators, and given
  42 the highest access class, 0. Any given target may have one or more
  43 local initiators, and any given initiator may have multiple local
  44 memory targets.
  45
  46 To aid applications matching memory targets with their initiators, the
  47 kernel provides symlinks to each other. The following example lists the
  48 relationship for the access class "0" memory initiators and targets::
  49
  50         # symlinks -v /sys/devices/system/node/nodeX/access0/targets/
  51         relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
  52
  53         # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
  54         relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
  55
  56 A memory initiator may have multiple memory targets in the same access
  57 class. The target memory's initiators in a given class indicate the
  58 nodes' access characteristics share the same performance relative to other
  59 linked initiator nodes. Each target within an initiator's access class,
  60 though, do not necessarily perform the same as each other.
  61
  62 The access class "1" is used to allow differentiation between initiators
  63 that are CPUs and hence suitable for generic task scheduling, and
  64 IO initiators such as GPUs and NICs.  Unlike access class 0, only
  65 nodes containing CPUs are considered.
  66
  67 NUMA Performance
  68 ================
  69
  70 Applications may wish to consider which node they want their memory to
  71 be allocated from based on the node's performance characteristics. If
  72 the system provides these attributes, the kernel exports them under the
  73 node sysfs hierarchy by appending the attributes directory under the
  74 memory node's access class 0 initiators as follows::
  75
  76         /sys/devices/system/node/nodeY/access0/initiators/
  77
  78 These attributes apply only when accessed from nodes that have the
  79 are linked under the this access's initiators.
  80
  81 The performance characteristics the kernel provides for the local initiators
  82 are exported are as follows::
  83
  84         # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
  85         /sys/devices/system/node/nodeY/access0/initiators/
  86         |-- read_bandwidth
  87         |-- read_latency
  88         |-- write_bandwidth
  89         `-- write_latency
  90
  91 The bandwidth attributes are provided in MiB/second.
  92
  93 The latency attributes are provided in nanoseconds.
  94
  95 The values reported here correspond to the rated latency and bandwidth
  96 for the platform.
  97
  98 Access class 1 takes the same form but only includes values for CPU to
  99 memory activity.
 100
 101 NUMA Cache
 102 ==========
 103
 104 System memory may be constructed in a hierarchy of elements with various
 105 performance characteristics in order to provide large address space of
 106 slower performing memory cached by a smaller higher performing memory. The
 107 system physical addresses memory  initiators are aware of are provided
 108 by the last memory level in the hierarchy. The system meanwhile uses
 109 higher performing memory to transparently cache access to progressively
 110 slower levels.
 111
 112 The term "far memory" is used to denote the last level memory in the
 113 hierarchy. Each increasing cache level provides higher performing
 114 initiator access, and the term "near memory" represents the fastest
 115 cache provided by the system.
 116
 117 This numbering is different than CPU caches where the cache level (ex:
 118 L1, L2, L3) uses the CPU-side view where each increased level is lower
 119 performing. In contrast, the memory cache level is centric to the last
 120 level memory, so the higher numbered cache level corresponds to  memory
 121 nearer to the CPU, and further from far memory.
 122
 123 The memory-side caches are not directly addressable by software. When
 124 software accesses a system address, the system will return it from the
 125 near memory cache if it is present. If it is not present, the system
 126 accesses the next level of memory until there is either a hit in that
 127 cache level, or it reaches far memory.
 128
 129 An application does not need to know about caching attributes in order
 130 to use the system. Software may optionally query the memory cache
 131 attributes in order to maximize the performance out of such a setup.
 132 If the system provides a way for the kernel to discover this information,
 133 for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
 134 the kernel will append these attributes to the NUMA node memory target.
 135
 136 When the kernel first registers a memory cache with a node, the kernel
 137 will create the following directory::
 138
 139         /sys/devices/system/node/nodeX/memory_side_cache/
 140
 141 If that directory is not present, the system either does not provide
 142 a memory-side cache, or that information is not accessible to the kernel.
 143
 144 The attributes for each level of cache is provided under its cache
 145 level index::
 146
 147         /sys/devices/system/node/nodeX/memory_side_cache/indexA/
 148         /sys/devices/system/node/nodeX/memory_side_cache/indexB/
 149         /sys/devices/system/node/nodeX/memory_side_cache/indexC/
 150
 151 Each cache level's directory provides its attributes. For example, the
 152 following shows a single cache level and the attributes available for
 153 software to query::
 154
 155         # tree /sys/devices/system/node/node0/memory_side_cache/
 156         /sys/devices/system/node/node0/memory_side_cache/
 157         |-- index1
 158         |   |-- indexing
 159         |   |-- line_size
 160         |   |-- size
 161         |   `-- write_policy
 162
 163 The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
 164 for any other indexed based, multi-way associativity.
 165
 166 The "line_size" is the number of bytes accessed from the next cache
 167 level on a miss.
 168
 169 The "size" is the number of bytes provided by this cache level.
 170
 171 The "write_policy" will be 0 for write-back, and non-zero for
 172 write-through caching.
 173
 174 See Also
 175 ========
 176
 177 [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
 178 - Section 5.2.27