Commit | Line | Data |
---|---|---|
911ac797 FT |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============= | |
4 | False Sharing | |
5 | ============= | |
6 | ||
7 | What is False Sharing | |
8 | ===================== | |
9 | False sharing is related with cache mechanism of maintaining the data | |
10 | coherence of one cache line stored in multiple CPU's caches; then | |
11 | academic definition for it is in [1]_. Consider a struct with a | |
12 | refcount and a string:: | |
13 | ||
14 | struct foo { | |
15 | refcount_t refcount; | |
16 | ... | |
17 | char name[16]; | |
18 | } ____cacheline_internodealigned_in_smp; | |
19 | ||
20 | Member 'refcount'(A) and 'name'(B) _share_ one cache line like below:: | |
21 | ||
22 | +-----------+ +-----------+ | |
23 | | CPU 0 | | CPU 1 | | |
24 | +-----------+ +-----------+ | |
25 | / | | |
26 | / | | |
27 | V V | |
28 | +----------------------+ +----------------------+ | |
29 | | A B | Cache 0 | A B | Cache 1 | |
30 | +----------------------+ +----------------------+ | |
31 | | | | |
32 | ---------------------------+------------------+----------------------------- | |
33 | | | | |
34 | +----------------------+ | |
35 | | | | |
36 | +----------------------+ | |
37 | Main Memory | A B | | |
38 | +----------------------+ | |
39 | ||
40 | 'refcount' is modified frequently, but 'name' is set once at object | |
41 | creation time and is never modified. When many CPUs access 'foo' at | |
42 | the same time, with 'refcount' being only bumped by one CPU frequently | |
43 | and 'name' being read by other CPUs, all those reading CPUs have to | |
44 | reload the whole cache line over and over due to the 'sharing', even | |
45 | though 'name' is never changed. | |
46 | ||
47 | There are many real-world cases of performance regressions caused by | |
48 | false sharing. One of these is a rw_semaphore 'mmap_lock' inside | |
49 | mm_struct struct, whose cache line layout change triggered a | |
50 | regression and Linus analyzed in [2]_. | |
51 | ||
52 | There are two key factors for a harmful false sharing: | |
53 | ||
54 | * A global datum accessed (shared) by many CPUs | |
55 | * In the concurrent accesses to the data, there is at least one write | |
56 | operation: write/write or write/read cases. | |
57 | ||
58 | The sharing could be from totally unrelated kernel components, or | |
59 | different code paths of the same kernel component. | |
60 | ||
61 | ||
62 | False Sharing Pitfalls | |
63 | ====================== | |
64 | Back in time when one platform had only one or a few CPUs, hot data | |
65 | members could be purposely put in the same cache line to make them | |
66 | cache hot and save cacheline/TLB, like a lock and the data protected | |
67 | by it. But for recent large system with hundreds of CPUs, this may | |
68 | not work when the lock is heavily contended, as the lock owner CPU | |
69 | could write to the data, while other CPUs are busy spinning the lock. | |
70 | ||
71 | Looking at past cases, there are several frequently occurring patterns | |
72 | for false sharing: | |
73 | ||
74 | * lock (spinlock/mutex/semaphore) and data protected by it are | |
75 | purposely put in one cache line. | |
76 | * global data being put together in one cache line. Some kernel | |
77 | subsystems have many global parameters of small size (4 bytes), | |
78 | which can easily be grouped together and put into one cache line. | |
79 | * data members of a big data structure randomly sitting together | |
80 | without being noticed (cache line is usually 64 bytes or more), | |
81 | like 'mem_cgroup' struct. | |
82 | ||
83 | Following 'mitigation' section provides real-world examples. | |
84 | ||
85 | False sharing could easily happen unless they are intentionally | |
86 | checked, and it is valuable to run specific tools for performance | |
87 | critical workloads to detect false sharing affecting performance case | |
88 | and optimize accordingly. | |
89 | ||
90 | ||
91 | How to detect and analyze False Sharing | |
92 | ======================================== | |
93 | perf record/report/stat are widely used for performance tuning, and | |
94 | once hotspots are detected, tools like 'perf-c2c' and 'pahole' can | |
95 | be further used to detect and pinpoint the possible false sharing | |
96 | data structures. 'addr2line' is also good at decoding instruction | |
97 | pointer when there are multiple layers of inline functions. | |
98 | ||
99 | perf-c2c can capture the cache lines with most false sharing hits, | |
100 | decoded functions (line number of file) accessing that cache line, | |
101 | and in-line offset of the data. Simple commands are:: | |
102 | ||
103 | $ perf c2c record -ag sleep 3 | |
104 | $ perf c2c report --call-graph none -k vmlinux | |
105 | ||
106 | When running above during testing will-it-scale's tlb_flush1 case, | |
107 | perf reports something like:: | |
108 | ||
109 | Total records : 1658231 | |
110 | Locked Load/Store Operations : 89439 | |
111 | Load Operations : 623219 | |
112 | Load Local HITM : 92117 | |
113 | Load Remote HITM : 139 | |
114 | ||
115 | #---------------------------------------------------------------------- | |
116 | 4 0 2374 0 0 0 0xff1100088366d880 | |
117 | #---------------------------------------------------------------------- | |
118 | 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1 | |
119 | 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1 | |
120 | 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1 | |
121 | 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1 | |
122 | 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1 | |
123 | ||
124 | A nice introduction for perf-c2c is [3]_. | |
125 | ||
126 | 'pahole' decodes data structure layouts delimited in cache line | |
127 | granularity. Users can match the offset in perf-c2c output with | |
128 | pahole's decoding to locate the exact data members. For global | |
129 | data, users can search the data address in System.map. | |
130 | ||
131 | ||
132 | Possible Mitigations | |
133 | ==================== | |
134 | False sharing does not always need to be mitigated. False sharing | |
135 | mitigations should balance performance gains with complexity and | |
136 | space consumption. Sometimes, lower performance is OK, and it's | |
137 | unnecessary to hyper-optimize every rarely used data structure or | |
138 | a cold data path. | |
139 | ||
140 | False sharing hurting performance cases are seen more frequently with | |
141 | core count increasing. Because of these detrimental effects, many | |
142 | patches have been proposed across variety of subsystems (like | |
143 | networking and memory management) and merged. Some common mitigations | |
144 | (with examples) are: | |
145 | ||
146 | * Separate hot global data in its own dedicated cache line, even if it | |
147 | is just a 'short' type. The downside is more consumption of memory, | |
148 | cache line and TLB entries. | |
149 | ||
150 | - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated") | |
151 | ||
152 | * Reorganize the data structure, separate the interfering members to | |
153 | different cache lines. One downside is it may introduce new false | |
154 | sharing of other members. | |
155 | ||
156 | - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing") | |
157 | ||
158 | * Replace 'write' with 'read' when possible, especially in loops. | |
159 | Like for some global variable, use compare(read)-then-write instead | |
160 | of unconditional write. For example, use:: | |
161 | ||
162 | if (!test_bit(XXX)) | |
163 | set_bit(XXX); | |
164 | ||
165 | instead of directly "set_bit(XXX);", similarly for atomic_t data:: | |
166 | ||
167 | if (atomic_read(XXX) == AAA) | |
168 | atomic_set(XXX, BBB); | |
169 | ||
170 | - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing") | |
171 | - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP") | |
172 | ||
173 | * Turn hot global data to 'per-cpu data + global data' when possible, | |
174 | or reasonably increase the threshold for syncing per-cpu data to | |
175 | global data, to reduce or postpone the 'write' to that global data. | |
176 | ||
177 | - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses") | |
178 | - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") | |
179 | ||
180 | Surely, all mitigations should be carefully verified to not cause side | |
181 | effects. To avoid introducing false sharing when coding, it's better | |
182 | to: | |
183 | ||
184 | * Be aware of cache line boundaries | |
185 | * Group mostly read-only fields together | |
186 | * Group things that are written at the same time together | |
187 | * Separate frequently read and frequently written fields on | |
188 | different cache lines. | |
189 | ||
190 | and better add a comment stating the false sharing consideration. | |
191 | ||
192 | One note is, sometimes even after a severe false sharing is detected | |
193 | and solved, the performance may still have no obvious improvement as | |
194 | the hotspot switches to a new place. | |
195 | ||
196 | ||
197 | Miscellaneous | |
198 | ============= | |
199 | One open issue is that kernel has an optional data structure | |
200 | randomization mechanism, which also randomizes the situation of cache | |
201 | line sharing of data members. | |
202 | ||
203 | ||
204 | .. [1] https://en.wikipedia.org/wiki/False_sharing | |
205 | .. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/ | |
206 | .. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/ |