January 24, 2017
This article was contributed by Jade Alglave,
Paul E. McKenney, Alan Stern, Luc Maranget, and Andrea Parri
Introduction
This document gives an example-driven intuitive introduction to the Linux
kernel memory model.
It is in no way a substitute for the
mathematical model,
in particular, these examples do not explore the many interesting
corner cases of that model.
However, these examples should prove useful for developers who
would like to write good concurrent Linux-kernel code, but who
don't mind giving up those last few percentage points of
performance by staying a bit back from the edge of memory-model
disaster.
This article assumes that you understand some of the material in the
A Formal Model of Linux-Kernel Memory Ordering
article, particularly the section entitled
Memory Models and The Role of Cycles.
This article is organized as follows:
- Communications Relations
- Ordering For Free
-
Local Execution-Based Ordering
- Release-Acquire Ordering
-
Full-Barrier Transitive Ordering
-
Restoring Sequential Consistency
- Special Cases
- Summary
These are followed by the inescapable
answers to the quick quizzes.
Before starting the examples, it is worthwhile reviewing the communications
relations.
These relations are as follows:
- rf: “read from”, where a read returns the value
stored by an earlier write.
This relation links the write to the read.
- co: “coherence”, where a write's value is
overwritten by another write.
This relation links the overwritten write to the overwriting write.
Note that the co relation is transitive: If write A
links to write B and write B links to write C, then
write A will also link to write C.
- fr: “from read”, where a write overwrites
the variable accessed by a read, but too late to prevent the read from
returning the overwritten value.
This relation links from the read to the write.
Note that the fr relation is extended by the co
relation, so that if fr links a read D to a
write E, and if co links write E to another
write F, then fr also links read D to
write F.
(More detail on fr may be found
here.)
Each of these relations is described in more detail in one of the
following sections.
Reads-From (rf)
Of these, the
rf relation is the only one that implies a temporal
constraint:
In the absence of value-speculation optimizations, the read must have executed
later than the write that supplied that read's value.
Quick Quiz 1:
So what will you do if compilers or CPUs start using value-speculation
optimizations?
Answer
A key point is that writing to a variable cannot result in an instantaneously
visible change in value, courtesy of the fact that the speed of light is
finite and that atoms are non-zero in size.
A change will instead propagate through the system, as fancifully depicted
below, with time advancing from left to right:
Quick Quiz 2:
If it is not possible to make a change instantaneously visible,
then how do sequentially consistent systems work?
Answer
The upshot is that if a read from a shared variable returns the value stored
by a write, then there will be an rf link from that write
to that read, and we know that the read must have executed after the write did.
Coherence (co)
Unlike the rf relation, the co relation does not
imply any sort of temporal constraint.
In fact, it is possible for an earlier write to overwrite a later write,
as shown below:
As can be seen in the figure, the write of the value 2
happened earlier in time, but it nevertheless overwrote the later
write of the value 1, so that the co relation
goes backwards in time.
This outcome is due to the fact that the written values take time to
propagate through the system, so that the system decides at some later
time which of the two values wins.
Quick Quiz 3:
Exactly how could the system “decide later” which write won?
Answer
Quick Quiz 4:
Wouldn't it be way simpler if the last write always won?
Answer
From-Read (fr)
Like the co relation, the fr relation does not
imply any sort of temporal constraint.
Again, due to the finite speed with which information propagates through
real systems, it is possible for a read to get a value from a write
whose value is already destined to be overwritten, as shown below:
In this example, the read returned the initialization value of zero
despite the write of the value 1 having already executed.
However, this new value had not yet propagated from CPU 0 to
CPU 3, so this later read, being unaware of the new value, returned
the old one.
The fr relation therefore goes backwards in time.
Relation Summary
In short, the rf relation implies a temporal constraint,
while the co and fr relations do not.
Therefore, as we will see in the remainder of this document,
as a rough rule of thumb, the more rf links your
code has, the less heavyweight ordering your code will need.
This rule of thumb is illustrated by the examples in the remainder
of this document.
This section describes types of ordering that are provided gratis by
most CPUs, even the infamous DEC Alpha.
But let's start with an example for which ordering is not
guaranteed:
Example Litmus Test #1
1 C C-LB+o-o+o-o+o-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 WRITE_ONCE(*b, 1);
12 }
13
14 P1(int *b, int *c)
15 {
16 int r1;
17
18 r1 = READ_ONCE(*b);
19 WRITE_ONCE(*c, 1);
20 }
21
22 P2(int *a, int *c)
23 {
24 int r1;
25
26 r1 = READ_ONCE(*c);
27 WRITE_ONCE(*a, 1);
28 }
29
30 exists
31 (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)
Because there is no natural ordering nor any sort of ordering constraint
on the CPU, the counter-intuitive outcome on the exists
clause on the last line can happen.
All that is needed is for the CPU to reorder the read and the write
from any of the three processes.
And the Linux-kernel memory model agrees with this assessment,
as confirmed by the Sometimes 1 7:
Outcome for Example Litmus Test #1 (strong model)
1 Test C-LB+o-o+o-o+o-o Allowed
2 States 8
3 0:r1=0; 1:r1=0; 2:r1=0;
4 0:r1=0; 1:r1=0; 2:r1=1;
5 0:r1=0; 1:r1=1; 2:r1=0;
6 0:r1=0; 1:r1=1; 2:r1=1;
7 0:r1=1; 1:r1=0; 2:r1=0;
8 0:r1=1; 1:r1=0; 2:r1=1;
9 0:r1=1; 1:r1=1; 2:r1=0;
10 0:r1=1; 1:r1=1; 2:r1=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)
15 Observation C-LB+o-o+o-o+o-o Sometimes 1 7
16 Hash=63d9717e69db4ea05bb2e6840d1d22d4
Quick Quiz 5:
Only the CPU?
Can't the compiler also reorder the reads and writes in
Example Litmus Test #1?
Answer
The key point of this unordered example is that even if your litmus test
has nothing but causal rf relations, it still needs at least some
ordering to prohibit the counter-intuitive cyclic outcome.
For example, if each process provides minimal
ordering from its read to its write, then the cyclic
outcome will be forbidden, as shown by the following litmus test:
Example Litmus Test #2
1 C C-LB+ldref-o+o-ctrl-o+o-dep-o.litmus
2
3 {
4 a=x0;
5 c=y0;
6 1:r2=b;
7 }
8
9 P0(int **a)
10 {
11 int *r1;
12
13 r1 = READ_ONCE(*a);
14 WRITE_ONCE(*r1, 1);
15 }
16
17 P1(int *b, int **c)
18 {
19 int r1;
20
21 r1 = READ_ONCE(*b);
22 if (r1)
23 WRITE_ONCE(*c, r2);
24 }
25
26 P2(int **a, int **c)
27 {
28 int *r1;
29
30 r1 = READ_ONCE(*c);
31 WRITE_ONCE(*a, r1);
32 }
33
34 exists
35 (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
In this litmus test,
P0() has an address dependency between its read and write,
P1() has a control dependency between its read and write, and
P2() has a data dependency between its read and write.
The initialization on lines 4-6 interacts with each process's
code to set up for the dependency in the next process.
Because there is an rf link between each adjacent pair of
processes and because each process maintains ordering between its
read and write, the counter-intuitive exists clause cannot
happen, as confirmed by the strong model:
Outcome for Example Litmus Test #2 (strong model)
1 Test C-LB+ldref-o+o-ctrl-o+o-dep-o Allowed
2 States 2
3 0:r1=x0; 1:r1=0; 2:r1=y0;
4 0:r1=y0; 1:r1=0; 2:r1=y0;
5 No
6 Witnesses
7 Positive: 0 Negative: 2
8 Condition exists (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
9 Observation C-LB+ldref-o+o-ctrl-o+o-dep-o Never 0 2
10 Hash=d8909b22b7196d3b91cd78e700c68cc6
Quick Quiz 6:
But wait!
DEC Alpha does not respect dependency ordering.
So wouldn't it fail to order P0() and P2() in
Example Litmus Test #2?
Answer
If any of the dependencies is removed, the cyclic outcome is allowed,
as shown in the following litmus test, which eliminates P1()'s
control dependency:
Example Litmus Test #3
1 C C-LB+ldref-o+o-o+o-dep-o.litmus
2
3 {
4 a=x0;
5 c=y0;
6 1:r2=b;
7 }
8
9 P0(int **a)
10 {
11 int *r1;
12
13 r1 = READ_ONCE(*a);
14 WRITE_ONCE(*r1, 1);
15 }
16
17 P1(int *b, int **c)
18 {
19 int r1;
20
21 r1 = READ_ONCE(*b);
22 WRITE_ONCE(*c, r2);
23 }
24
25 P2(int **a, int **c)
26 {
27 int *r1;
28
29 r1 = READ_ONCE(*c);
30 WRITE_ONCE(*a, r1);
31 }
32
33 exists
34 (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
The strong model confirms that the cycle is now allowed:
Outcome for Example Litmus Test #3 (strong model)
1 Test C-LB+ldref-o+o-o+o-dep-o Allowed
2 States 5
3 0:r1=b; 1:r1=0; 2:r1=b;
4 0:r1=b; 1:r1=1; 2:r1=b;
5 0:r1=x0; 1:r1=0; 2:r1=b;
6 0:r1=x0; 1:r1=0; 2:r1=y0;
7 0:r1=y0; 1:r1=0; 2:r1=y0;
8 Ok
9 Witnesses
10 Positive: 1 Negative: 4
11 Condition exists (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
12 Observation C-LB+ldref-o+o-o+o-dep-o Sometimes 1 4
13 Hash=0ee6280cfb52bb0721c6e110bbf79597
Note that dependencies enforce ordering only for trailing writes.
For example, consider the following trailing-read address-dependency
litmus test, which corresponds
to RCU-like insertion of an element into a linked list,
and is an example of the
message-passing pattern:
Example Litmus Test #4
1 C C-MP+o-assign+o-dep-o.litmus
2
3 {
4 x=y0;
5 0:r4=y;
6 }
7
8 P0(int *x, int *y)
9 {
10 WRITE_ONCE(*y, 1);
11 rcu_assign_pointer(*x, r4);
12 }
13
14 P1(int *x, int *y)
15 {
16 int r1;
17 int r2;
18
19 r1 = READ_ONCE(*x);
20 r2 = READ_ONCE(*r1);
21 }
22
23 exists
24 (1:r1=y /\ 1:r2=0)
The strong model shows that the cycle is permitted, and thus that
the above litmus test does not demonstrate a reliable way to
do RCU-like insertion, courtesy of DEC Alpha:
Outcome for Example Litmus Test #4 (strong model)
1 Test C-MP+o-assign+o-dep-o Allowed
2 States 3
3 1:r1=y; 1:r2=0;
4 1:r1=y; 1:r2=1;
5 1:r1=y0; 1:r2=0;
6 Ok
7 Witnesses
8 Positive: 1 Negative: 2
9 Condition exists (1:r1=y /\ 1:r2=0)
10 Observation C-MP+o-assign+o-dep-o Sometimes 1 2
11 Hash=15cc666886633d7ffe9f7d6129d54145
The following litmus test shows how to reliably carry out such an
insertion, using lockless_dereference(), which supplies the
memory barrier required by DEC Alpha, but only for DEC Alpha:
Example Litmus Test #5
1 C C-MP+o-assign+ldref-o.litmus
2
3 {
4 x=y0;
5 0:r4=y;
6 }
7
8 P0(int *x, int *y)
9 {
10 WRITE_ONCE(*y, 1);
11 rcu_assign_pointer(*x, r4);
12 }
13
14 P1(int *x, int *y)
15 {
16 int r1;
17 int r2;
18
19 r1 = lockless_dereference(*x);
20 r2 = READ_ONCE(*r1);
21 }
22
23 exists
24 (1:r1=y /\ 1:r2=0)
The strong model confirms that the bad outcome is prohibited:
Outcome for Example Litmus Test #5 (strong model)
1 Test C-MP+o-assign+ldref-o Allowed
2 States 2
3 1:r1=y; 1:r2=1;
4 1:r1=y0; 1:r2=0;
5 No
6 Witnesses
7 Positive: 0 Negative: 2
8 Condition exists (1:r1=y /\ 1:r2=0)
9 Observation C-MP+o-assign+ldref-o Never 0 2
10 Hash=6c282e8fa5aa6a72221fa5c9529d087b
This RCU insertion litmus test provides ordering for free on all CPU
families other than DEC Alpha.
Removing the control dependency allowed the cyclic results, as shown by
Example Litmus Test #3.
However, changing the READ_ONCE() in P1() to the
lightweight primitive
smp_load_acquire() can prohibit this cycle despite there being
no dependency, as shown below:
Example Litmus Test #6
1 C C-LB+ldref-o+acq-o+o-dep-o.litmus
2
3 {
4 a=x0;
5 c=y0;
6 1:r2=b;
7 }
8
9 P0(int **a)
10 {
11 int *r1;
12
13 r1 = READ_ONCE(*a);
14 WRITE_ONCE(*r1, 1);
15 }
16
17 P1(int *b, int **c)
18 {
19 int r1;
20
21 r1 = smp_load_acquire(b);
22 WRITE_ONCE(*c, r2);
23 }
24
25 P2(int **a, int **c)
26 {
27 int *r1;
28
29 r1 = READ_ONCE(*c);
30 WRITE_ONCE(*a, r1);
31 }
32
33 exists
34 (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
Again, the strong model confirms this:
Outcome for Example Litmus Test #6 (strong model)
1 Test C-LB+ldref-o+acq-o+o-dep-o Allowed
2 States 4
3 0:r1=b; 1:r1=0; 2:r1=b;
4 0:r1=x0; 1:r1=0; 2:r1=b;
5 0:r1=x0; 1:r1=0; 2:r1=y0;
6 0:r1=y0; 1:r1=0; 2:r1=y0;
7 No
8 Witnesses
9 Positive: 0 Negative: 4
10 Condition exists (0:r1=b /\ 1:r1=1 /\ 2:r1=b)
11 Observation C-LB+ldref-o+acq-o+o-dep-o Never 0 4
12 Hash=67f08dc71f9c2a03065bd1c425c24f09
The venerable smp_rmb() barrier also operates via
local execution-based ordering, as will be shown
later.
In the meantime, it is possible to substitute smp_load_acquire()
for all of the dependencies, and still prohibit the cycle:
Example Litmus Test #7
1 C C-LB+acq-o+acq-o+acq-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int *r1;
9
10 r1 = smp_load_acquire(a);
11 WRITE_ONCE(*b, 1);
12 }
13
14 P1(int *b, int *c)
15 {
16 int r1;
17
18 r1 = smp_load_acquire(b);
19 WRITE_ONCE(*c, 1);
20 }
21
22 P2(int *a, int *c)
23 {
24 int r1;
25
26 r1 = smp_load_acquire(c);
27 WRITE_ONCE(*a, 1);
28 }
29
30 exists
31 (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)
Again, the model agrees that the cycle is prohibited:
Outcome for Example Litmus Test #7 (strong model)
1 Test C-LB+acq-o+acq-o+acq-o Allowed
2 States 7
3 0:r1=0; 1:r1=0; 2:r1=0;
4 0:r1=0; 1:r1=0; 2:r1=1;
5 0:r1=0; 1:r1=1; 2:r1=0;
6 0:r1=0; 1:r1=1; 2:r1=1;
7 0:r1=1; 1:r1=0; 2:r1=0;
8 0:r1=1; 1:r1=0; 2:r1=1;
9 0:r1=1; 1:r1=1; 2:r1=0;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (0:r1=1 /\ 1:r1=1 /\ 2:r1=1)
14 Observation C-LB+acq-o+acq-o+acq-o Never 0 7
15 Hash=c7f0f09c322fb880cdd71087857a9071
However, this works only because of the causal rf-only nature of this
litmus test.
Removing the causality by replacing one of the rf links with a
write-to-write coherence or co link allows the
cycle, as can be seen from this litmus test:
Example Litmus Test #8
1 C C-WWC+o+acq-o+acq-o.litmus
2
3 {
4 }
5
6 P0(int *a)
7 {
8 WRITE_ONCE(*a, 1);
9 }
10
11 P1(int *a, int *b)
12 {
13 int r1;
14
15 r1 = smp_load_acquire(a);
16 WRITE_ONCE(*b, 1);
17 }
18
19 P2(int *a, int *b)
20 {
21 int r1;
22
23 r1 = smp_load_acquire(b);
24 WRITE_ONCE(*a, 2);
25 }
26
27 exists
28 (1:r1=1 /\ 2:r1=1 /\ a=1)
The Linux-kernel model shows that the cycle is allowed.
Outcome for Example Litmus Test #8 (strong model)
1 Test C-WWC+o+acq-o+acq-o Allowed
2 States 10
3 1:r1=0; 2:r1=0; a=1;
4 1:r1=0; 2:r1=0; a=2;
5 1:r1=0; 2:r1=1; a=1;
6 1:r1=0; 2:r1=1; a=2;
7 1:r1=1; 2:r1=0; a=1;
8 1:r1=1; 2:r1=0; a=2;
9 1:r1=1; 2:r1=1; a=1;
10 1:r1=1; 2:r1=1; a=2;
11 1:r1=2; 2:r1=0; a=1;
12 1:r1=2; 2:r1=0; a=2;
13 Ok
14 Witnesses
15 Positive: 1 Negative: 9
16 Condition exists (1:r1=1 /\ 2:r1=1 /\ a=1)
17 Observation C-WWC+o+acq-o+acq-o Sometimes 1 9
18 Hash=e8861be8533cb22ff500cef0e1e446ab
Forbidding this cycle requires transitivity, a weak form of which is
provided by release-acquire chains, which are the subject of the
next section.
Transitivity is a subtle concept in memory ordering, being provided by
a combination of A-cumulativity and B-cumulativity,
which is discussed in the
Memory barriers
section of the
strong model presentation.
In the meantime, for causal rf links,
transitivity can be obtained by pairing
smp_store_release() with smp_load_acquire(),
forming a link in a release-acquire chain.
As an added bonus, these two primitives generate no code on strongly
ordered systems such as x86 and s390.
In addition, the first smp_load_acquire() can be removed,
as shown below:
Example Litmus Test #9
1 C C-WWC+o+o-rel+acq-o.litmus
2
3 {
4 }
5
6 P0(int *a)
7 {
8 WRITE_ONCE(*a, 1);
9 }
10
11 P1(int *a, int *b)
12 {
13 int r1;
14
15 r1 = READ_ONCE(*a);
16 smp_store_release(b, 1);
17 }
18
19 P2(int *a, int *b)
20 {
21 int r1;
22
23 r1 = smp_load_acquire(b);
24 WRITE_ONCE(*a, 2);
25 }
26
27 exists
28 (1:r1=1 /\ 2:r1=1 /\ a=1)
And as the model says:
Outcome for Example Litmus Test #9 (strong model)
1 Test C-WWC+o+o-rel+acq-o Allowed
2 States 9
3 1:r1=0; 2:r1=0; a=1;
4 1:r1=0; 2:r1=0; a=2;
5 1:r1=0; 2:r1=1; a=1;
6 1:r1=0; 2:r1=1; a=2;
7 1:r1=1; 2:r1=0; a=1;
8 1:r1=1; 2:r1=0; a=2;
9 1:r1=1; 2:r1=1; a=2;
10 1:r1=2; 2:r1=0; a=1;
11 1:r1=2; 2:r1=0; a=2;
12 No
13 Witnesses
14 Positive: 0 Negative: 9
15 Condition exists (1:r1=1 /\ 2:r1=1 /\ a=1)
16 Observation C-WWC+o+o-rel+acq-o Never 0 9
17 Hash=d43b7fcba00cf52af389a7a87f51f46e
Note that although smp_store_release() and smp_load_acquire()
provide ordering to processes participating in the release-acquire chain,
they do not necessarily guarantee ordering from the perspective of processes
outside of that chain.
For example, in the following litmus test, P0() and P1()
form a (rather short) release-acquire chain and P2() is an
outside observer:
Example Litmus Test #10
1 C C-Z6.0+o-rel+acq-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_store_release(b, 1);
10 }
11
12 P1(int *b, int *c)
13 {
14 int r1;
15
16 r1 = smp_load_acquire(b);
17 WRITE_ONCE(*c, 1);
18 }
19
20 P2(int *a, int *c)
21 {
22 int r1;
23
24 WRITE_ONCE(*c, 2);
25 smp_mb();
26 r1 = READ_ONCE(*a);
27 }
28
29 exists
30 (1:r1=1 /\ 2:r1=0 /\ c=2)
Note that P2() must use smp_mb() in order to order the
prior write against the later read.
However, even that smp_mb() is not sufficient to provide ordering
in this case, given the co and fr links:
Outcome for Example Litmus Test #10 (strong model)
1 Test C-Z6.0+o-rel+acq-o+o-mb-o Allowed
2 States 8
3 1:r1=0; 2:r1=0; c=1;
4 1:r1=0; 2:r1=0; c=2;
5 1:r1=0; 2:r1=1; c=1;
6 1:r1=0; 2:r1=1; c=2;
7 1:r1=1; 2:r1=0; c=1;
8 1:r1=1; 2:r1=0; c=2;
9 1:r1=1; 2:r1=1; c=1;
10 1:r1=1; 2:r1=1; c=2;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (1:r1=1 /\ 2:r1=0 /\ c=2)
15 Observation C-Z6.0+o-rel+acq-o+o-mb-o Sometimes 1 7
16 Hash=30f1277a7ac1097b0934bce41caf6144
In this case, full ordering requires an smp_mb() on either
P0() or P1().
The following litmus test chooses P0():
Example Litmus Test #11
1 C C-Z6.0+o-mb-o+acq-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_mb();
10 WRITE_ONCE(*b, 1);
11 }
12
13 P1(int *b, int *c)
14 {
15 int r1;
16
17 r1 = smp_load_acquire(b);
18 WRITE_ONCE(*c, 1);
19 }
20
21 P2(int *a, int *c)
22 {
23 int r1;
24
25 WRITE_ONCE(*c, 2);
26 smp_mb();
27 r1 = READ_ONCE(*a);
28 }
29
30 exists
31 (1:r1=1 /\ 2:r1=0 /\ c=2)
The choice of P0() also allows the smp_store_release()
to be downgraded to a WRITE_ONCE(), as confirmed by the
model:
Outcome for Example Litmus Test #11 (strong model)
1 Test C-Z6.0+o-mb-o+acq-o+o-mb-o Allowed
2 States 7
3 1:r1=0; 2:r1=0; c=1;
4 1:r1=0; 2:r1=0; c=2;
5 1:r1=0; 2:r1=1; c=1;
6 1:r1=0; 2:r1=1; c=2;
7 1:r1=1; 2:r1=0; c=1;
8 1:r1=1; 2:r1=1; c=1;
9 1:r1=1; 2:r1=1; c=2;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (1:r1=1 /\ 2:r1=0 /\ c=2)
14 Observation C-Z6.0+o-mb-o+acq-o+o-mb-o Never 0 7
15 Hash=7cbe17aa8c52275b4b3a6c03658952b6
Because smp_store_release() is always a store and
smp_load_acquire() is always a load, a release-acquire chain
necessarily links from a store in one process to a load in the next.
The following section shows how to enforce ordering for other types of links.
The following litmus test uses smp_mb() to provide transitive
ordering despite the fact that all links from one process to the next
are fr links:
Example Litmus Test #12
1 C C-3.SB+o-mb-o+o-mb-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 int r1;
9
10 WRITE_ONCE(*x, 1);
11 smp_mb();
12 r1 = READ_ONCE(*y);
13 }
14
15 P1(int *y, int *z)
16 {
17 int r2;
18
19 WRITE_ONCE(*y, 1);
20 smp_mb();
21 r2 = READ_ONCE(*z);
22 }
23
24 P2(int *z, int *x)
25 {
26 int r2;
27
28 WRITE_ONCE(*z, 1);
29 smp_mb();
30 r2 = READ_ONCE(*x);
31 }
32
33 exists
34 (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)
The strong model says that this litmus test's cycle is forbidden:
Outcome for Example Litmus Test #12 (strong model)
1 Test C-3.SB+o-mb-o+o-mb-o+o-mb-o Allowed
2 States 7
3 0:r1=0; 1:r2=0; 2:r2=1;
4 0:r1=0; 1:r2=1; 2:r2=0;
5 0:r1=0; 1:r2=1; 2:r2=1;
6 0:r1=1; 1:r2=0; 2:r2=0;
7 0:r1=1; 1:r2=0; 2:r2=1;
8 0:r1=1; 1:r2=1; 2:r2=0;
9 0:r1=1; 1:r2=1; 2:r2=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)
14 Observation C-3.SB+o-mb-o+o-mb-o+o-mb-o Never 0 7
15 Hash=9e78339cf867c77d247d5a43f65645f7
However, omitting even one smp_mb() results in the cycle
being allowed:
Example Litmus Test #13
1 C C-3.SB+o-o+o-mb-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 int r1;
9
10 WRITE_ONCE(*x, 1);
11 r1 = READ_ONCE(*y);
12 }
13
14 P1(int *y, int *z)
15 {
16 int r2;
17
18 WRITE_ONCE(*y, 1);
19 smp_mb();
20 r2 = READ_ONCE(*z);
21 }
22
23 P2(int *z, int *x)
24 {
25 int r2;
26
27 WRITE_ONCE(*z, 1);
28 smp_mb();
29 r2 = READ_ONCE(*x);
30 }
31
32 exists
33 (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)
This is confirmed by the strong model:
Outcome for Example Litmus Test #13 (strong model)
1 Test C-3.SB+o-o+o-mb-o+o-mb-o Allowed
2 States 8
3 0:r1=0; 1:r2=0; 2:r2=0;
4 0:r1=0; 1:r2=0; 2:r2=1;
5 0:r1=0; 1:r2=1; 2:r2=0;
6 0:r1=0; 1:r2=1; 2:r2=1;
7 0:r1=1; 1:r2=0; 2:r2=0;
8 0:r1=1; 1:r2=0; 2:r2=1;
9 0:r1=1; 1:r2=1; 2:r2=0;
10 0:r1=1; 1:r2=1; 2:r2=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (0:r1=0 /\ 1:r2=0 /\ 2:r2=0)
15 Observation C-3.SB+o-o+o-mb-o+o-mb-o Sometimes 1 7
16 Hash=d42d9489b5750bd71d719a152bd76659
The next section shows how smp_mb() provides even stronger
guarantees than mere transitivity.
The canonical litmus test for sequential consistency is the celebrated
(though as far as we know, not used in practice) “independent reads
of independent writes”, or IRIW:
Example Litmus Test #14
1 C C-IRIW+o+o+o-mb-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *y)
12 {
13 WRITE_ONCE(*y, 1);
14 }
15
16 P2(int *x, int *y)
17 {
18 int r1;
19 int r2;
20
21 r1 = READ_ONCE(*x);
22 smp_mb();
23 r2 = READ_ONCE(*y);
24 }
25
26 P3(int *x, int *y)
27 {
28 int r1;
29 int r2;
30
31 r1 = READ_ONCE(*y);
32 smp_mb();
33 r2 = READ_ONCE(*x);
34 }
35
36 exists
37 (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)
If the cycle is permitted, then the two reader threads can disagree on
the order of the independent stores.
When both reader threads use smp_mb(), this cycle is forbidden:
Outcome for Example Litmus Test #14 (strong model)
1 Test C-IRIW+o+o+o-mb-o+o-mb-o Allowed
2 States 15
3 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=0;
4 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=1;
5 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=0;
6 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=1;
7 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=0;
8 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=1;
9 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=0;
10 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=1;
11 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=0;
12 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=1;
13 2:r1=1; 2:r2=0; 3:r1=1; 3:r2=1;
14 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=0;
15 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=1;
16 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=0;
17 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=1;
18 No
19 Witnesses
20 Positive: 0 Negative: 15
21 Condition exists (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)
22 Observation C-IRIW+o+o+o-mb-o+o-mb-o Never 0 15
23 Hash=b0840da8dd0f7482d111dab3f8dde214
The next litmus test replaces smp_mb() with release-acquire chains
from each writer to the corresponding reader:
Example Litmus Test #15
1 C C-IRIW+rel+rel+acq-o+acq-o.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 smp_store_release(x, 1);
9 }
10
11 P1(int *y)
12 {
13 smp_store_release(y, 1);
14 }
15
16 P2(int *x, int *y)
17 {
18 int r1;
19 int r2;
20
21 r1 = smp_load_acquire(x);
22 r2 = READ_ONCE(*y);
23 }
24
25 P3(int *x, int *y)
26 {
27 int r1;
28 int r2;
29
30 r1 = smp_load_acquire(y);
31 r2 = READ_ONCE(*x);
32 }
33
34 exists
35 (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)
In this case the cycle is allowed:
Outcome for Example Litmus Test #15 (strong model)
1 Test C-IRIW+rel+rel+acq-o+acq-o Allowed
2 States 16
3 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=0;
4 2:r1=0; 2:r2=0; 3:r1=0; 3:r2=1;
5 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=0;
6 2:r1=0; 2:r2=0; 3:r1=1; 3:r2=1;
7 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=0;
8 2:r1=0; 2:r2=1; 3:r1=0; 3:r2=1;
9 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=0;
10 2:r1=0; 2:r2=1; 3:r1=1; 3:r2=1;
11 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=0;
12 2:r1=1; 2:r2=0; 3:r1=0; 3:r2=1;
13 2:r1=1; 2:r2=0; 3:r1=1; 3:r2=0;
14 2:r1=1; 2:r2=0; 3:r1=1; 3:r2=1;
15 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=0;
16 2:r1=1; 2:r2=1; 3:r1=0; 3:r2=1;
17 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=0;
18 2:r1=1; 2:r2=1; 3:r1=1; 3:r2=1;
19 Ok
20 Witnesses
21 Positive: 1 Negative: 15
22 Condition exists (2:r1=1 /\ 2:r2=0 /\ 3:r1=1 /\ 3:r2=0)
23 Observation C-IRIW+rel+rel+acq-o+acq-o Sometimes 1 15
24 Hash=e740a02d7b6130c50c917c38998da1dc
In short, smp_mb() is much stronger than is release-acquire.
As well it should be, given that it is usually also much more expensive!
This section covers a some special cases, the first because it
is heavily used, the second because it is almost never used,
and the last two because they allow you to emulate primitives
that are not yet directly handled by the memory model:
- Message Passing
- Write-Only Scenarios
- Emulating Locking
- Emulating call_rcu()
The celebrated (and heavily used) message-passing
pattern is shown below:
Example Litmus Test #16
1 C C-MP+o-rel+acq-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_store_release(b, 1);
10 }
11
12 P1(int *a, int *b)
13 {
14 int r1;
15 int r2;
16
17 r1 = smp_load_acquire(b);
18 r2 = READ_ONCE(*a);
19 }
20
21 exists
22 (1:r1=1 /\ 1:r2=0)
This is of course forbidden:
Outcome for Example Litmus Test #16 (strong model)
1 Test C-MP+o-rel+acq-o Allowed
2 States 3
3 1:r1=0; 1:r2=0;
4 1:r1=0; 1:r2=1;
5 1:r1=1; 1:r2=1;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (1:r1=1 /\ 1:r2=0)
10 Observation C-MP+o-rel+acq-o Never 0 3
11 Hash=a238b2bd1fd4e5e95a6c2d1095975619
However, smp_store_release and smp_load_acquire()
can be replaced with
the old-style smp_wmb() and smp_rmb() primitives
as shown below:
Example Litmus Test #17
1 C C-MP+o-wmb-o+o-rmb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_wmb();
10 WRITE_ONCE(*b, 1);
11 }
12
13 P1(int *a, int *b)
14 {
15 int r1;
16 int r2;
17
18 r1 = READ_ONCE(*b);
19 smp_rmb();
20 r2 = READ_ONCE(*a);
21 }
22
23 exists
24 (1:r1=1 /\ 1:r2=0)
This is still forbidden:
Outcome for Example Litmus Test #17 (strong model)
1 Test C-MP+o-wmb-o+o-rmb-o Allowed
2 States 3
3 1:r1=0; 1:r2=0;
4 1:r1=0; 1:r2=1;
5 1:r1=1; 1:r2=1;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (1:r1=1 /\ 1:r2=0)
10 Observation C-MP+o-wmb-o+o-rmb-o Never 0 3
11 Hash=14c5bb4f4878756fe2adc3eaed5fd3ce
That said, smp_store_release() and smp_load_acquire()
are almost always easier to use and maintain, so they would normally
be preferred.
However, there are a few less-common cases where smp_wmb()
and smp_rmb() can produce more efficient code on some
weakly ordered systems:
Example Litmus Test #18
1 C C-MP2+o-o-wmb-o-o+o-rmb-o+o-rmb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b, int *c, int *d)
7 {
8 WRITE_ONCE(*a, 1);
9 WRITE_ONCE(*c, 1);
10 smp_wmb();
11 WRITE_ONCE(*b, 1);
12 WRITE_ONCE(*d, 1);
13 }
14
15 P1(int *a, int *b)
16 {
17 int r1;
18 int r2;
19
20 r1 = READ_ONCE(*b);
21 smp_rmb();
22 r2 = READ_ONCE(*a);
23 }
24
25 P2(int *c, int *d)
26 {
27 int r1;
28 int r2;
29
30 r1 = READ_ONCE(*d);
31 smp_rmb();
32 r2 = READ_ONCE(*c);
33 }
34
35 exists
36 (1:r1=1 /\ 1:r2=0) \/ (2:r1=1 /\ 2:r2=0)
The model shows that this is also forbidden:
Outcome for Example Litmus Test #18 (strong model)
1 Test C-MP2+o-o-wmb-o-o+o-rmb-o+o-rmb-o Allowed
2 States 9
3 1:r1=0; 1:r2=0; 2:r1=0; 2:r2=0;
4 1:r1=0; 1:r2=0; 2:r1=0; 2:r2=1;
5 1:r1=0; 1:r2=0; 2:r1=1; 2:r2=1;
6 1:r1=0; 1:r2=1; 2:r1=0; 2:r2=0;
7 1:r1=0; 1:r2=1; 2:r1=0; 2:r2=1;
8 1:r1=0; 1:r2=1; 2:r1=1; 2:r2=1;
9 1:r1=1; 1:r2=1; 2:r1=0; 2:r2=0;
10 1:r1=1; 1:r2=1; 2:r1=0; 2:r2=1;
11 1:r1=1; 1:r2=1; 2:r1=1; 2:r2=1;
12 No
13 Witnesses
14 Positive: 0 Negative: 9
15 Condition exists (1:r1=1 /\ 1:r2=0 \/ 2:r1=1 /\ 2:r2=0)
16 Observation C-MP2+o-o-wmb-o-o+o-rmb-o+o-rmb-o Never 0 9
17 Hash=5e1ec52282c90ed972723d7e044cab3e
Nevertheless, it is worth reiterating that smp_store_release()
and smp_load_acquire() are almost always preferable to
smp_wmb() and smp_rmb().
Write-only scenarios are interesting in that it is not clear that
reasonable hardware could fail to provide ordering, but it is also
not clear that there are any reasonable use cases.
Thus far, any useful write-only scenario has proven to have a more
conventional counterpart that is better and easier to use.
Nevertheless, here is the simplest write-only scenario:
Example Litmus Test #19
1 C C-2+2W+o-wmb-o+o-wmb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 2);
9 smp_wmb();
10 WRITE_ONCE(*b, 1);
11 }
12
13 P1(int *a, int *b)
14 {
15 WRITE_ONCE(*b, 2);
16 smp_wmb();
17 WRITE_ONCE(*a, 1);
18 }
19
20 exists
21 (b=2 /\ a=2)
The strong model prohibits the cycle:
Outcome for Example Litmus Test #19 (strong model)
1 Test C-2+2W+o-wmb-o+o-wmb-o Allowed
2 States 3
3 a=1; b=1;
4 a=1; b=2;
5 a=2; b=1;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (b=2 /\ a=2)
10 Observation C-2+2W+o-wmb-o+o-wmb-o Never 0 3
11 Hash=8264db947f1b73b8be16f98dd6bf1634
However, the weak model allows this same cycle:
Outcome for Example Litmus Test #19 (weak model)
1 Test C-2+2W+o-wmb-o+o-wmb-o Allowed
2 States 4
3 a=1; b=1;
4 a=1; b=2;
5 a=2; b=1;
6 a=2; b=2;
7 Ok
8 Witnesses
9 Positive: 1 Negative: 3
10 Condition exists (b=2 /\ a=2)
11 Observation C-2+2W+o-wmb-o+o-wmb-o Sometimes 1 3
12 Hash=8264db947f1b73b8be16f98dd6bf1634
This disagreement between the strong and the weak model documents the
current uncertainty as to whether or not this two-processes write-only
scenario should be supported.
The same uncertainty applies to the three-process extension of
Example Litmus Test #19:
Example Litmus Test #20
1 C C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 2);
9 smp_wmb();
10 WRITE_ONCE(*b, 1);
11 }
12
13 P1(int *b, int *c)
14 {
15 WRITE_ONCE(*b, 2);
16 smp_wmb();
17 WRITE_ONCE(*c, 1);
18 }
19
20 P2(int *c, int *a)
21 {
22 WRITE_ONCE(*c, 2);
23 smp_wmb();
24 WRITE_ONCE(*a, 1);
25 }
26
27 exists
28 (a=2 /\ b=2 /\ c=2)
This is therefore also prohibited by the strong model:
Outcome for Example Litmus Test #20 (strong model)
1 Test C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Allowed
2 States 7
3 a=1; b=1; c=1;
4 a=1; b=1; c=2;
5 a=1; b=2; c=1;
6 a=1; b=2; c=2;
7 a=2; b=1; c=1;
8 a=2; b=1; c=2;
9 a=2; b=2; c=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (a=2 /\ b=2 /\ c=2)
14 Observation C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Never 0 7
15 Hash=2054ae2b227f6081d169f318556b51a1
And similarly also allowed by the weak model:
Outcome for Example Litmus Test #20 (weak model)
1 Test C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Allowed
2 States 8
3 a=1; b=1; c=1;
4 a=1; b=1; c=2;
5 a=1; b=2; c=1;
6 a=1; b=2; c=2;
7 a=2; b=1; c=1;
8 a=2; b=1; c=2;
9 a=2; b=2; c=1;
10 a=2; b=2; c=2;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (a=2 /\ b=2 /\ c=2)
15 Observation C-3+2W+o-wmb-o+o-wmb-o+o-wmb-o Sometimes 1 7
16 Hash=2054ae2b227f6081d169f318556b51a1
Locking can be emulated using xchg(), for example as follows:
Example Litmus Test #21
1 C C-locktest.litmus
2
3 {
4 }
5
6 P0(int *x, int *l)
7 {
8 int r1;
9 int r2;
10
11 r1 = xchg_acquire(l, 1);
12 if (r1 == 0) {
13 r2 = READ_ONCE(*x);
14 WRITE_ONCE(*x, 1);
15 WRITE_ONCE(*x, 0);
16 smp_store_release(l, 0);
17 }
18 }
19
20 P1(int *x, int *l)
21 {
22 int r1;
23 int r2;
24
25 r1 = xchg_acquire(l, 1);
26 if (r1 == 0) {
27 r2 = READ_ONCE(*x);
28 WRITE_ONCE(*x, 1);
29 WRITE_ONCE(*x, 0);
30 smp_store_release(l, 0);
31 }
32 }
33
34 exists
35 ((0:r1=0 /\ 0:r2=1) \/ (1:r1 = 0 /\ 1:r2=1))
This litmus test contains code that checks that this really does act like
a lock, which the model confirms that it does, albeit a bit slowly:
Outcome for Example Litmus Test #21 (strong model)
1 Test C-locktest Allowed
2 States 3
3 0:r1=0; 0:r2=0; 1:r1=0; 1:r2=0;
4 0:r1=0; 0:r2=0; 1:r1=1; 1:r2=0;
5 0:r1=1; 0:r2=0; 1:r1=0; 1:r2=0;
6 No
7 Witnesses
8 Positive: 0 Negative: 4
9 Condition exists (0:r1=0 /\ 0:r2=1 \/ 1:r1=0 /\ 1:r2=1)
10 Observation C-locktest Never 0 4
11 Hash=46161c4de6c633ba172e91fb61ee7d54
However, two of the three states above correspond to rejected
executions, where the process went forward despite having failed
to acquire the lock.
This information is actually quite useful when debugging, but can
be very distracting otherwise.
Fortunately, there is a filter keyword, which can be used
to exclude states corresponding to rejected executions:
Example Litmus Test #22
1 C C-locktest.litmus
2
3 {
4 }
5
6 P0(int *x, int *l)
7 {
8 int r1;
9 int r2;
10
11 r1 = xchg_acquire(l, 1);
12 r2 = READ_ONCE(*x);
13 WRITE_ONCE(*x, 1);
14 WRITE_ONCE(*x, 0);
15 smp_store_release(l, 0);
16 }
17
18 P1(int *x, int *l)
19 {
20 int r1;
21 int r2;
22
23 r1 = xchg_acquire(l, 1);
24 r2 = READ_ONCE(*x);
25 WRITE_ONCE(*x, 1);
26 WRITE_ONCE(*x, 0);
27 smp_store_release(l, 0);
28 }
29
30 locations [0:r1;1:r1]
31 filter (0:r1=0 /\ 1:r1=0)
32 exists (0:r2=1 \/ 1:r2=1)
This results in only the single legal state for locking:
Outcome for Example Litmus Test #22 (strong model)
1 Test C-locktest Allowed
2 States 1
3 0:r1=0; 0:r2=0; 1:r1=0; 1:r2=0;
4 No
5 Witnesses
6 Positive: 0 Negative: 2
7 Condition exists (0:r2=1 \/ 1:r2=1)
8 Observation C-locktest Never 0 2
9 Hash=dd03f03dea03d8d2be6f678ffff703dc
The locations keyword causes the two r1
registers to be printed, despite their not appearing in the
exists clause.
The effect of call_rcu() can be emulated by adding a process, and
having that process check a flag, wait for a grace period, then execute
the code that would be in the callback function.
The location that would execute the call_rcu() instead uses
smp_store_release() to set the flag.
For example, the following litmus test emulates an RCU callback that sets
variable b to one, again using filter to reject
executions in which P2() executed before P0()
stored to r:
Example Litmus Test #23
1 C C-LB+o-rel+rl-o-o-rul+o-sync-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *r)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 smp_store_release(r, 1);
12 }
13
14 P1(int *b, int *a)
15 {
16 int r2;
17
18 rcu_read_lock();
19 r2 = READ_ONCE(*b);
20 WRITE_ONCE(*a, 1);
21 rcu_read_unlock();
22 }
23
24 P2(int *r, int *b)
25 {
26 int r1;
27
28 r1 = READ_ONCE(*r);
29 synchronize_rcu();
30 WRITE_ONCE(*b, 1);
31 }
32
33 filter (2:r1=1)
34 exists (0:r1=1 /\ 1:r2=1)
The model confirms that the cycle is forbidden, just as it is without the
emulated callback:
Outcome for Example Litmus Test #23 (strong model)
1 Test C-LB+o-rel+rl-o-o-rul+o-sync-o Allowed
2 States 3
3 0:r1=0; 1:r2=0;
4 0:r1=0; 1:r2=1;
5 0:r1=1; 1:r2=0;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (0:r1=1 /\ 1:r2=1)
10 Observation C-LB+o-rel+rl-o-o-rul+o-sync-o Never 0 3
11 Hash=190a2b21ba29cb53e597cfff2bfdec4e
This document has shown a progression of litmus tests requiring
increasingly strong memory-ordering primitives, due to increasing
numbers of non-causal non-rf links and decreasing numbers of
dependencies.
This progression give rise to the following rules of thumb:
- When the code has only causal rf links, and when
each process has properly placed address, control, or data
dependencies, no additional memory ordering is required.
- Code having only rf links can use either a release
(including RCU's rcu_assign_pointer()) or an acquire
on any process that lacks a properly placed dependency.
- Release-acquire chains (including RCU's
rcu_assign_pointer() and rcu_dereference())
can guarantee ordering even if one
of the links is a non-causal non-rf link, at long
as the remaining links are all rf links.
- An acquire leading to a write can be replaced by dependencies,
and in many cases a given process's acquire and release can also
be replaced by dependencies.
However, it is wise to run the model to check a given scenario.
- An acquire leading to a dependent read can usually be replaced by
rcu_dereference() or lockless_dereference().
Again, it is wise to run the model to check a given scenario.
- However, given more than one non-rf link,
release-acquire chains are not always able to provide ordering,
and smp_mb() might also be required.
As a rough rule of thumb, there will need to be at least one
smp_mb() between each successive pair of non-rf
links, but it is yet again wise to run the model to check a given
scenario.
- If any process needs to order a prior store against a subsequent
load, an smp_mb() is needed.
- When it is necessary to fully restore sequential
consistency, it is in the worst case necessary to place an
smp_mb() between each pair of accesses to shared memory
within each process.
The canonical sequential-consistency exmaple is shown in
Example Litmus Test #14.
- Finally, in the Linux kernel, either a
synchronous RCU grace-period operation
(such as synchronize_rcu()) or
a fully ordered value-returning atomic read-modify-write operations
(such as xchg())
may be substituted for an smp_mb().
We hope that this provides a useful intuitive approximation to
the full model.
Those desiring a full understanding of the memory model should look
here.
Quick Quiz 1:
So what will you do if compilers or CPUs start using value-speculation
optimizations?
Answer:
To be useful by concurrent software such as the Linux kernel,
such compilers and hardware will need to do one or both of two things:
- Permit value speculation only on accesses to non-shared memory.
For example, in C11, value speculation would need to be prohibited on
atomic accesses.
- Make sure that any speculation is rolled back as needed not only
to arrive at the correct value, but also to arrive at the correct
ordering.
It is also quite possible that those putting forward speculative
optimizations will receive considerable quantities of highly colorful
feedback from any number of flamboyant personalities.
Back to Quick Quiz 1.
Quick Quiz 2:
If it is not possible to make a change instantaneously visible,
then how do sequentially consistent systems work?
Answer:
First, there have not been many commercially available sequentially consistent
systems.
Second, sequential consistency does not guarantee simultaneity,
but rather ordering.
One way to guarantee ordering is to introduce delays, as shown by the
red triangle below:
Although there are any number of tricks that CPU designers can use to
hide the delay, some would argue that these speed-of-light delays are
a big reason why there have been so very few commercially available
sequentially consistent systems.
Back to Quick Quiz 2.
Quick Quiz 3:
Exactly how could the system “decide later” which write won?
Answer:
Consider the following sequence of events:
- The variable x is initially zero, and its cacheline
is in CPU 0's cache.
- CPU 1 writes the value 2 to x, but it does
not have the cacheline.
It therefore records the write in its write buffer and
requests exclusive access to the cacheline.
- CPU 0 writes the value 1 to x, and because the
cacheline is at CPU 0, the write completes almost
immediately, updating the value of x in the cacheline.
- CPU 1's request for the cacheline arrives at CPU 0,
which sends it (along with the new value of x)
to CPU 1.
- CPU 1 receives the cacheline, and moves the new value
of x to the cacheline.
CPU 1's earlier write therefore overwrites CPU 0's
later write.
As you can see, it is the later movement of the cacheline that determines
which write wins.
Back to Quick Quiz 3.
Quick Quiz 4:
Wouldn't it be way simpler if the last write always won?
Answer:
It might well be simpler, but it might not be better.
For example, the cache state of the system might be such that an
earlier write was stuck in its CPU's store buffer for an extended time.
Many CPU vendors would like to be able to decide which value wins
without needing an expensive inspection of the store buffer of each
and every CPU.
Back to Quick Quiz 4.
Quick Quiz 5:
Only the CPU?
Can't the compiler also reorder the reads and writes in
Example Litmus Test #1?
Answer:
No.
The volatile accesses in READ_ONCE() and WRITE_ONCE()
prohibit compiler reordering.
However, if the litmus test were to use C11 non-volatile
memory_order_relaxed accesses, then the compiler could
reorder these accesses.
Back to Quick Quiz 5.
Quick Quiz 6:
But wait!
DEC Alpha does not respect dependency ordering.
So wouldn't it fail to order P0() and P2() in
Example Litmus Test #2?
Answer:
No.
The Alpha does have weak ordering, but it still avoids speculative
writes.
However, if P0()'s and P1()'s final WRITE_ONCE()
was instead a READ_ONCE(), speculation would be possible and Alpha
would not guarantee ordering, as will be shown later.
Back to Quick Quiz 6.