[linux-block.git] / Documentation / kernel-hacking / false-sharing.rst

.. SPDX-License-Identifier: GPL-2.0

=============
False Sharing
=============

What is False Sharing
=====================
False sharing is related with cache mechanism of maintaining the data
coherence of one cache line stored in multiple CPU's caches; then
academic definition for it is in [1]_. Consider a struct with a
refcount and a string::

	struct foo {
		refcount_t refcount;
		...
		char name[16];
	} ____cacheline_internodealigned_in_smp;

Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::

                +-----------+                     +-----------+
                |   CPU 0   |                     |   CPU 1   |
                +-----------+                     +-----------+
               /                                        |
              /                                         |
             V                                          V
         +----------------------+             +----------------------+
         | A      B             | Cache 0     | A       B            | Cache 1
         +----------------------+             +----------------------+
                             |                  |
  ---------------------------+------------------+-----------------------------
                             |                  |
                           +----------------------+
                           |                      |
                           +----------------------+
              Main Memory  | A       B            |
                           +----------------------+

'refcount' is modified frequently, but 'name' is set once at object
creation time and is never modified.  When many CPUs access 'foo' at
the same time, with 'refcount' being only bumped by one CPU frequently
and 'name' being read by other CPUs, all those reading CPUs have to
reload the whole cache line over and over due to the 'sharing', even
though 'name' is never changed.

There are many real-world cases of performance regressions caused by
false sharing.  One of these is a rw_semaphore 'mmap_lock' inside
mm_struct struct, whose cache line layout change triggered a
regression and Linus analyzed in [2]_.

There are two key factors for a harmful false sharing:

* A global datum accessed (shared) by many CPUs
* In the concurrent accesses to the data, there is at least one write
  operation: write/write or write/read cases.

The sharing could be from totally unrelated kernel components, or
different code paths of the same kernel component.


False Sharing Pitfalls
======================
Back in time when one platform had only one or a few CPUs, hot data
members could be purposely put in the same cache line to make them
cache hot and save cacheline/TLB, like a lock and the data protected
by it.  But for recent large system with hundreds of CPUs, this may
not work when the lock is heavily contended, as the lock owner CPU
could write to the data, while other CPUs are busy spinning the lock.

Looking at past cases, there are several frequently occurring patterns
for false sharing:

* lock (spinlock/mutex/semaphore) and data protected by it are
  purposely put in one cache line.
* global data being put together in one cache line. Some kernel
  subsystems have many global parameters of small size (4 bytes),
  which can easily be grouped together and put into one cache line.
* data members of a big data structure randomly sitting together
  without being noticed (cache line is usually 64 bytes or more),
  like 'mem_cgroup' struct.

Following 'mitigation' section provides real-world examples.

False sharing could easily happen unless they are intentionally
checked, and it is valuable to run specific tools for performance
critical workloads to detect false sharing affecting performance case
and optimize accordingly.


How to detect and analyze False Sharing
========================================
perf record/report/stat are widely used for performance tuning, and
once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
be further used to detect and pinpoint the possible false sharing
data structures.  'addr2line' is also good at decoding instruction
pointer when there are multiple layers of inline functions.

perf-c2c can capture the cache lines with most false sharing hits,
decoded functions (line number of file) accessing that cache line,
and in-line offset of the data. Simple commands are::

  $ perf c2c record -ag sleep 3
  $ perf c2c report --call-graph none -k vmlinux

When running above during testing will-it-scale's tlb_flush1 case,
perf reports something like::

  Total records                     :    1658231
  Locked Load/Store Operations      :      89439
  Load Operations                   :     623219
  Load Local HITM                   :      92117
  Load Remote HITM                  :        139

  #----------------------------------------------------------------------
      4        0     2374        0        0        0  0xff1100088366d880
  #----------------------------------------------------------------------
    0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
    0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
    0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
    0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
    0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1

A nice introduction for perf-c2c is [3]_.

'pahole' decodes data structure layouts delimited in cache line
granularity.  Users can match the offset in perf-c2c output with
pahole's decoding to locate the exact data members.  For global
data, users can search the data address in System.map.


Possible Mitigations
====================
False sharing does not always need to be mitigated.  False sharing
mitigations should balance performance gains with complexity and
space consumption.  Sometimes, lower performance is OK, and it's
unnecessary to hyper-optimize every rarely used data structure or
a cold data path.

False sharing hurting performance cases are seen more frequently with
core count increasing.  Because of these detrimental effects, many
patches have been proposed across variety of subsystems (like
networking and memory management) and merged.  Some common mitigations
(with examples) are:

* Separate hot global data in its own dedicated cache line, even if it
  is just a 'short' type. The downside is more consumption of memory,
  cache line and TLB entries.

  - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")

* Reorganize the data structure, separate the interfering members to
  different cache lines.  One downside is it may introduce new false
  sharing of other members.

  - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")

* Replace 'write' with 'read' when possible, especially in loops.
  Like for some global variable, use compare(read)-then-write instead
  of unconditional write. For example, use::

	if (!test_bit(XXX))
		set_bit(XXX);

  instead of directly "set_bit(XXX);", similarly for atomic_t data::

	if (atomic_read(XXX) == AAA)
		atomic_set(XXX, BBB);

  - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
  - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")

* Turn hot global data to 'per-cpu data + global data' when possible,
  or reasonably increase the threshold for syncing per-cpu data to
  global data, to reduce or postpone the 'write' to that global data.

  - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
  - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")

Surely, all mitigations should be carefully verified to not cause side
effects.  To avoid introducing false sharing when coding, it's better
to:

* Be aware of cache line boundaries
* Group mostly read-only fields together
* Group things that are written at the same time together
* Separate frequently read and frequently written fields on
  different cache lines.

and better add a comment stating the false sharing consideration.

One note is, sometimes even after a severe false sharing is detected
and solved, the performance may still have no obvious improvement as
the hotspot switches to a new place.


Miscellaneous
=============
One open issue is that kernel has an optional data structure
randomization mechanism, which also randomizes the situation of cache
line sharing of data members.


.. [1] https://en.wikipedia.org/wiki/False_sharing
.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
Commit	Line	Data
911ac797 FT	1	.. SPDX-License-Identifier: GPL-2.0
	2
	3	=============
	4	False Sharing
	5	=============
	6
	7	What is False Sharing
	8	=====================
	9	False sharing is related with cache mechanism of maintaining the data
	10	coherence of one cache line stored in multiple CPU's caches; then
	11	academic definition for it is in [1]_. Consider a struct with a
	12	refcount and a string::
	13
	14	struct foo {
	15	refcount_t refcount;
	16	...
	17	char name[16];
	18	} ____cacheline_internodealigned_in_smp;
	19
	20	Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
	21
	22	+-----------+ +-----------+
	23	\| CPU 0 \| \| CPU 1 \|
	24	+-----------+ +-----------+
	25	/ \|
	26	/ \|
	27	V V
	28	+----------------------+ +----------------------+
	29	\| A B \| Cache 0 \| A B \| Cache 1
	30	+----------------------+ +----------------------+
	31	\| \|
	32	---------------------------+------------------+-----------------------------
	33	\| \|
	34	+----------------------+
	35	\| \|
	36	+----------------------+
	37	Main Memory \| A B \|
	38	+----------------------+
	39
	40	'refcount' is modified frequently, but 'name' is set once at object
	41	creation time and is never modified. When many CPUs access 'foo' at
	42	the same time, with 'refcount' being only bumped by one CPU frequently
	43	and 'name' being read by other CPUs, all those reading CPUs have to
	44	reload the whole cache line over and over due to the 'sharing', even
	45	though 'name' is never changed.
	46
	47	There are many real-world cases of performance regressions caused by
	48	false sharing. One of these is a rw_semaphore 'mmap_lock' inside
	49	mm_struct struct, whose cache line layout change triggered a
	50	regression and Linus analyzed in [2]_.
	51
	52	There are two key factors for a harmful false sharing:
	53
	54	* A global datum accessed (shared) by many CPUs
	55	* In the concurrent accesses to the data, there is at least one write
	56	operation: write/write or write/read cases.
	57
	58	The sharing could be from totally unrelated kernel components, or
	59	different code paths of the same kernel component.
	60
	61
	62	False Sharing Pitfalls
	63	======================
	64	Back in time when one platform had only one or a few CPUs, hot data
65	members could be purposely put in the same cache line to make them
66	cache hot and save cacheline/TLB, like a lock and the data protected
67	by it. But for recent large system with hundreds of CPUs, this may
68	not work when the lock is heavily contended, as the lock owner CPU
69	could write to the data, while other CPUs are busy spinning the lock.
70
71	Looking at past cases, there are several frequently occurring patterns
72	for false sharing:
73
74	* lock (spinlock/mutex/semaphore) and data protected by it are
75	purposely put in one cache line.
76	* global data being put together in one cache line. Some kernel
77	subsystems have many global parameters of small size (4 bytes),
78	which can easily be grouped together and put into one cache line.
79	* data members of a big data structure randomly sitting together
80	without being noticed (cache line is usually 64 bytes or more),
81	like 'mem_cgroup' struct.
82
83	Following 'mitigation' section provides real-world examples.
84
85	False sharing could easily happen unless they are intentionally
86	checked, and it is valuable to run specific tools for performance
87	critical workloads to detect false sharing affecting performance case
88	and optimize accordingly.
89
90
91	How to detect and analyze False Sharing
92	========================================
93	perf record/report/stat are widely used for performance tuning, and
94	once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
95	be further used to detect and pinpoint the possible false sharing
96	data structures. 'addr2line' is also good at decoding instruction
97	pointer when there are multiple layers of inline functions.
98
99	perf-c2c can capture the cache lines with most false sharing hits,
100	decoded functions (line number of file) accessing that cache line,
101	and in-line offset of the data. Simple commands are::
102
103	$ perf c2c record -ag sleep 3
104	$ perf c2c report --call-graph none -k vmlinux
105
106	When running above during testing will-it-scale's tlb_flush1 case,
107	perf reports something like::
108
109	Total records : 1658231
110	Locked Load/Store Operations : 89439
111	Load Operations : 623219
112	Load Local HITM : 92117
113	Load Remote HITM : 139
114
115	#----------------------------------------------------------------------
116	4 0 2374 0 0 0 0xff1100088366d880
117	#----------------------------------------------------------------------
118	0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1
119	0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1
120	0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1
121	0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1
122	0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1
123
124	A nice introduction for perf-c2c is [3]_.
125
126	'pahole' decodes data structure layouts delimited in cache line
127	granularity. Users can match the offset in perf-c2c output with
128	pahole's decoding to locate the exact data members. For global
129	data, users can search the data address in System.map.
130
131
132	Possible Mitigations
133	====================
134	False sharing does not always need to be mitigated. False sharing
135	mitigations should balance performance gains with complexity and
136	space consumption. Sometimes, lower performance is OK, and it's
137	unnecessary to hyper-optimize every rarely used data structure or
138	a cold data path.
139
140	False sharing hurting performance cases are seen more frequently with
141	core count increasing. Because of these detrimental effects, many
142	patches have been proposed across variety of subsystems (like
143	networking and memory management) and merged. Some common mitigations
144	(with examples) are:
145
146	* Separate hot global data in its own dedicated cache line, even if it
147	is just a 'short' type. The downside is more consumption of memory,
148	cache line and TLB entries.
149
150	- Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
151
152	* Reorganize the data structure, separate the interfering members to
153	different cache lines. One downside is it may introduce new false
154	sharing of other members.
155
156	- Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
157
158	* Replace 'write' with 'read' when possible, especially in loops.
159	Like for some global variable, use compare(read)-then-write instead
160	of unconditional write. For example, use::
161
162	if (!test_bit(XXX))
163	set_bit(XXX);
164
165	instead of directly "set_bit(XXX);", similarly for atomic_t data::
166
167	if (atomic_read(XXX) == AAA)
168	atomic_set(XXX, BBB);
169
170	- Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
171	- Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
172
173	* Turn hot global data to 'per-cpu data + global data' when possible,
174	or reasonably increase the threshold for syncing per-cpu data to
175	global data, to reduce or postpone the 'write' to that global data.
176
177	- Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
178	- Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
179
180	Surely, all mitigations should be carefully verified to not cause side
181	effects. To avoid introducing false sharing when coding, it's better
182	to:
183
184	* Be aware of cache line boundaries
185	* Group mostly read-only fields together
186	* Group things that are written at the same time together
187	* Separate frequently read and frequently written fields on
188	different cache lines.
189
190	and better add a comment stating the false sharing consideration.
191
192	One note is, sometimes even after a severe false sharing is detected
193	and solved, the performance may still have no obvious improvement as
194	the hotspot switches to a new place.
195
196
197	Miscellaneous
198	=============
199	One open issue is that kernel has an optional data structure
200	randomization mechanism, which also randomizes the situation of cache
201	line sharing of data members.
202
203
204	.. [1] https://en.wikipedia.org/wiki/False_sharing
205	.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
206	.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/