Promising 2.0: Global Optimizations in Relaxed Memory Concurrency

Sung-Hwan Lee
Seoul National University
Korea
sunghwan.lee@sf.snu.ac.kr

Minki Cho
Seoul National University
Korea
minki.cho@sf.snu.ac.kr

Anton Podkopaev
National Research University Higher School of Economics & MPI-SWS
Russia & Germany
podkopaev@mpi-sws.org

Soham Chakraborty
IIT Delhi
India
soham@cse.iitd.ac.in

Chung-Kil Hur
Seoul National University
Korea
gil.hur@sf.snu.ac.kr

Ori Lahav
Tel Aviv University
Israel
orilahav@tau.ac.il

Viktor Vafeiadis
MPI-SWS
Germany
viktor@mpi-sws.org

Abstract
For more than fifteen years, researchers have tried to support global optimizations in a usable semantics for a concurrent programming language, yet this task has been proven to be very difficult because of (1) the infamous "out of thin air" problem, and (2) the subtle interaction between global and thread-local optimizations.

In this paper, we present a solution to this problem by redesigning a key component of the promising semantics (PS) of Kang et al. Our updated PS 2.0 model supports all the results known about the original PS model (i.e., thread-local optimizations, hardware mappings, DRF theorems), but additionally enables transformations based on global value-range analysis as well as register promotion (i.e., making accesses to a shared location local if the location is accessed by only one thread). PS 2.0 also resolves a problem with the compilation of relaxed RMWs to ARMv8, which required an unintended extra fence.

CCS Concepts: • Theory of computation → Concurrency; Operational semantics; • Software and its engineering → Semantics.

Keywords: Relaxed Memory Concurrency; Operational Semantics; Compiler Optimizations

ACM Reference Format:

1 Introduction
A major challenge in programming language semantics has been to define a weak memory model for a concurrent programming language supporting efficient compilation to the mainstream hardware platforms (i.e., x86, POWER, ARMv7, ARMv8, RISC-V) including all applicable compiler optimizations and yet avoiding semantics quirks, such as "out of thin air" reads [16], that prevent formal reasoning about programs and break DRF guarantees (the latter provide simpler semantics to data-race-free programs). In particular, such a semantics must allow the following annotated outcome (assuming all variables are initialized to zero and all accesses are relaxed).

\[
\begin{align*}
    a &= x \quad \langle 1 \rangle \\
    b &= y \quad \langle 1 \rangle \\
    y &= 1 \\
    x &= b
\end{align*}
\]

This outcome is observable after a compiler transformation that reorders (the independent) accesses of thread 1, while on ARM [20] it is even observable without the transformation.

While there are multiple partial solutions to this challenge [7, 8, 12, 16, 18], none of them properly supports global compiler optimizations, namely program transformations whose validity depends on some global analysis. Examples of such transformations are (a) removal of null pointer checks based on global null-pointer analysis; (b) removal of array bounds checks based on global size analysis; and (c) register promotion, i.e., converting accesses to a shared variable that happens to be used by only one thread to local accesses. The latter is very important in languages like Java that have only...
The desire to support global optimizations in concurrent programming languages goes at least as back as 15 years ago with the Java memory model (JMM) [16]. In fact, the very first JMM “causality test case” is centered around value-range analysis. Assuming all variables are initialized to 0, JMM allows the annotated outcome of the following example:

\[
\begin{align*}
    a & := x \quad \text{// 1} \\
    \text{if } a & \geq 0 \text{ then} \\
    y & := 1 \\
    b & := y \quad \text{// 1} \\
    x & := b 
\end{align*}
\]  

(JMM1)

"Decision: Allowed, since interthread compiler analysis could determine that \(x\) and \(y\) are always non-negative, allowing simplification of \(a \geq 0\) to true, and allowing write \(y := 1\) to be moved early." [10]

Supporting global optimizations, however, is rather challenging because of their interaction with local transformations. Global optimizations generally depend on invariants deduced by some global analysis but these invariants need not hold in the source program; they might hold after some local transformations have been applied. In the following example, (only) after the local elimination of the overwritten

\[
x := 42
\]

assignment, the condition \(a < 10\) becomes a global invariant, and so can be simplified to true as in JMM1.

\[
\begin{align*}
    a & := x \quad \text{// 1} \\
    \text{if } a & < 10 \text{ then} \\
    y & := 1 \\
    x & := 42 \\
    b & := y \quad \text{// 1} \\
    x & := b 
\end{align*}
\]  

(LB-G)

In more complex cases, a global optimization may enable a local transformation, which may further enable another global optimization, which may enable another local optimization, and so on. As a result, supporting both global and local transformations is very difficult, and none of the solutions so far has managed to fully support global analysis along with all the expected thread-local transformations.

In this paper, we present the first memory model that solves this challenge: (i) it allows the aforementioned global optimizations (value-range analysis and register promotion); (ii) it validates the thread-local compiler optimizations that are validated by the C/C++11 model [13] (e.g., roach-motel reorderings [21]); (iii) it can be efficiently mapped to the mainstream hardware platforms (x86, POWER, ARMv7, ARMv8, RISC-V); and (iv) it supports reasoning principles in the form of DRF guarantees, allowing programmers to resort to simpler well-behaved models when data races are appropriately restricted. In developing our model we mainly use (i)–(iii) to conclude that some behavior should be allowed; while (iv) tells us which behaviors must be forbidden.

As a starting point, we take the promising semantics (PS) of Kang et al. [12], a concurrency semantics that satisfies almost all our desiderata. It supports almost all C/C++11 features, all expected thread-local compiler optimizations, and several DRF theorems. In addition, Podkopaev et al. [19] established the correctness of a mapping from PS to hardware.1 The main drawback of PS is that it does not support global optimizations.

PS is an operational semantics which represents shared memory as a set of messages (i.e., writes). To support out-of-order execution, PS employs a non-standard step, allowing a thread to promise to perform a write in the future, which enables other threads to read from it before the write is actually executed.

The technical challenge resides in identifying the exact conditions on such promise steps so that basic guarantees (like DRF and no “thin-air values”) are maintained.

In PS, these conditions are completely thread-local: the thread performing the promise must be able to run in isolation from all extensions of the current state and fulfill all its outstanding promises. While thread-locality is useful, quantifying over all extensions of the current state prevents optimizations based on global analysis because some extensions may well not satisfy the invariant produced by the analysis.

Checking for promise fulfillment only from the current state without extension enables global analysis, but breaks the DRF guarantee (see §4). Our solution is therefore to check promise fulfillment for a carefully crafted extension of the current state, which we call capped memory. Because capped memory does not contain any new values, it is consistent with optimizations based on global value analysis. However, it still does not allow optimizations like register promotion.

To support register promotion, we introduce reservations, which allow a thread to secure an exclusive right to perform an atomic read-modify-write instruction reading from a certain message without fixing the value that it will write (because, for example, that might not have yet been resolved). In addition, reservations resolve a problem with the compilation of PS to ARMv8, whose intended mapping of RMWs was unsound and required an extra fence [19].2

With these two new concepts, we are able to retain the thread-local nature of PS and yet fully support global optimizations and the intended mapping of RMWs along with all the results available for PS. Our redesigned PS 2.0 model is the first weak memory model that achieves these results. To establish confidence in our model, we have formalized our key results in the Coq proof assistant.

Outline. In the following, we first review the PS definition (§2), and why it does not support global optimizations

---

1 Albeit, the mapping of RMWs to ARMv8 contains one more barrier (“ld fence”) than intended because the intended mapping is unsound.
2 Our current mechanized proof requires a fake control dependency from relaxed fetch-and-add instructions, which is currently not added by standard compilers. We believe that the compilation from our model without this dependency is sound as well, and leave the formal proof to a future work (see also §6.5).
We then present our PS 2.0 model both informally in an incremental fashion (§4) and formally all together (§5). In §6, we establish the correctness of mappings from PS 2.0 to hardware, and show that PS 2.0 supports all the local transformations and reasoning principles known to be allowed by PS, as well as register promotion, and the introduction of ‘assert’ statements for invariants derived by global analysis. The mechanization of our main results in Coq, the full model definitions, and written proofs of additional claims are available in [1].

2 Introduction to the Promising Semantics

In this section, we introduce the promising semantics (PS) of Kang et al. [12]. For simplicity, we present only a fragment of PS containing only three kinds of memory accesses: relaxed (the default mode), release writes (rel), and acquire reads (acc). Read-modify-write (RMW) instructions, such as compare-and-swap (CAS) and fetch-and-add (FADD), carry two access modes—one for the exclusive read and one for the write. We put aside other access modes, fences, and release writes.

The timestamp (also called the “to”-timestamp (§3). We then present our PS 2.0 model both informally in an incremental fashion (§4) and formally all together (§5). In §6, we establish the correctness of mappings from PS 2.0 to hardware, and show that PS 2.0 supports all the local transformations and reasoning principles known to be allowed by PS, as well as register promotion, and the introduction of ‘assert’ statements for invariants derived by global analysis. The mechanization of our main results in Coq, the full model definitions, and written proofs of additional claims are available in [1].

We refer the reader to [12] for the full PS model.

Domains. We assume non-empty sets Loc of locations and Val of values. We also assume a set Time of timestamps, which is totally and densely ordered by < with 0 as its minimum. (In our examples, we take non-negative rational numbers as timestamps with their usual ordering.) A view, V ∈ View, is a (global) sequence, as they are orthogonal to the contribution of this paper. We refer the reader to [12] for the full PS model.

Memory. In PS, the memory is a set of messages representing all previously executed writes. A message m is of the form ⟨x : v@(f, t), R⟩, where x ∈ Loc is the location, v ∈ Val is the stored value, ⟨f, t⟩ is a timestamp interval, and R ∈ View is the message view. The latter is used to model release-acquire synchronization and will be explained shortly. Initially, the memory consists of an initialization message for every location—a read followed by a write—with the same location and overlapping timestamp interval. The timestamp (also called the “to”-timestamp) of a message ⟨x : v@(f, t), R⟩ is the upper bound t of the message’s timestamp interval. The lower bound f, called the “from”-timestamp, is needed to handle atomic updates (a.k.a. RMW operations) as explained below.

Machine State. PS is an operational model where threads execute in an interleaved fashion. The machine state is a pair Σ = (TS, M), where TS assigns a thread state TS to every thread and M is a (global) memory. A thread state is a triple TS = (σ, V, P) where σ is the local store recording the values of its local variables, V ∈ View is the thread view, and P is a set of messages representing the thread’s outstanding promises.

Relaxed Reads and Writes. Thread views are instrumental in providing correct semantics to memory accesses. The thread view, V, records the “knowledge” of each thread, i.e., the timestamp of the most recent message that it has observed for each location. It is used to forbid a thread to read from a (stale) message m if the thread is aware of a “newer” message, i.e., when V(x) is greater than the message’s timestamp. Similarly, when a thread adds messages of location x to the memory, it has to pick a timestamp t for the added message that is greater than its view of x (V(x) < t):

READ. A thread can read from memory M by simply observing a message ⟨x : v@(f, t), ⊥⟩ ∈ M provided that V(x) ≤ t, and updating its view for x to t.

WRITE. A thread adds a new message m = ⟨x : v@(f, t), ⊥⟩ to the memory where the timestamp t is greater than the thread’s view of x (V(x) < t) and there is no other message with the same location and overlapping timestamp interval in the memory. Relaxed writes set the message view to ⊥, which maps each location to timestamp 0.

The following example illustrates how timestamps of messages and views interact. Note that we assume that both threads start with the initial thread view that maps x and y to 0, and that every location is initialized to 0: the initial memory only contains messages ⟨x : 0@(0, 0], ⊥⟩ and ⟨y : 0@(0, 0], ⊥⟩.

Here, both threads are allowed to read from the initialization messages, 0. When thread 1 performs the write to x, it will add a message ⟨x : 0@(f, t], ⊥⟩ by choosing some t > f ≥ 0. During this write, thread 1 should increase its view of x to t, while maintaining V(y) to be 0 as it was. Hence, thread 1 is still allowed to read 0 from y in the subsequent execution. As thread 2 can be executed in the same way, both threads are allowed to read 0.

Relaxed Atomic Updates. Atomic updates (a.k.a. RMW operations) are essentially a pair of accesses to the same location—a read followed by a write—with an additional atomicity guarantee: the read reads from a message that immediately precedes the one added by the write. PS employs timestamp intervals (rather than single timestamps) to enforce atomicity.

UPDATE. When a thread performs an RMW, it first reads a message ⟨x : v@(f, t], ⊥⟩, and then writes the updated message with “from”-timestamp equal to t, i.e., a message of the form ⟨x : v’@(t, t’], ⊥⟩. This results in consecutive messages

In all our code examples, we assume that all memory accesses are relaxed (r1x memory order) unless annotated otherwise.
(f, t], (t, t′]), forbidding other writes to be later placed between the two messages (recall that messages with the same location must have disjoint timestamp intervals).

This constraint, in particular, means that two competing RMWs cannot read from the same message, as the following “parallel increment” example demonstrates.\(^4\)

\[
a := \text{FADD}(x, 1) \quad // 0 \quad \| \quad b := \text{FADD}(x, 1) \quad // 0 \quad \text{(Upd)}
\]

Without loss of generality, suppose that thread 1 executed first. As it performs an RMW operation, it must “attach” the message it adds to an existing message. Since the only existing message in this stage is the initial one \(\langle x : 0@ (0, 0), \bot \rangle\), thread 1 will first add a message \(m = \langle x : 1@ (0, t), \bot \rangle\) with some \(t > 0\) to the memory. Then, the RMW of thread 2 cannot also read from the initial message because its interval would overlap with the \((0, t]\) interval of \(m\). Therefore, the annotated behavior is forbidden. More abstractly speaking, the timestamps intervals of PS express a dense total order on messages to the same location together with immediate adjacency constraints on this order, which are required for handling RMW operations.

**Release and Acquire Accesses.** To provide the appropriate semantics to release and acquire accesses, PS uses the message views. Indeed, a release write should transfer the current knowledge of the thread to other threads that read the message by an acquire read. Thus, (i) a release write operation puts the current thread view in the message view of the added message; and (ii) an acquire read operation incorporates the view of the message being read in the thread view (by taking the pointwise maximum).

**READ** is defined the same as before, except that when the thread performs an acquire read, it increases its view to contain not only the (“to”) timestamp of the message read but also the view of that message.

**WRITE** is defined as before, except that release writes record the thread view in the message being added, whereas relaxed writes record the \(\bot\) view.

As a result, the acquiring thread is confined in its future reads at least as the releasing thread was confined when it “released” the message being “acquired”. As a simple example, consider the following:

\[
x := 1 \\
y^{\text{rel}} := 1 \\
a := y^{\text{acq}} \quad // 1 \\
\text{if } a = 1 \text{ then } b := x \quad // 0 
\quad \text{(MP)}
\]

Here, if thread 2 reads 1 from \(y\), which is written by thread 1, both threads are synchronized through release and acquire. Thus, thread 2 obtains the knowledge of thread 1, namely its view for \(x\) is increased to include the timestamp of \(x := 1\) of thread 1. Therefore, after reading 1 from \(y\), thread 2 is not allowed to read the initial value 0 from \(x\).

Release/acquire RMW operations also transfer thread views via message views as release writes and acquire reads do.

**Promises.** The main novelty of PS lies in its way to enable the reordering of a read followed by a write (of different locations), needed to explain the outcome of the LB program in §1. Thus, besides step-by-step program execution, PS allows threads to non-deterministically promise their future writes. This is done by simply adding a message (whose interval does not overlap with that of any existing message to the same location) to the memory. Later, the execution of write instructions may also fulfill an existing promise (rather than add a message to the memory). Thread promises are kept in the thread state, and removed when the promise is fulfilled. Naturally, at the end of the execution all promises must be fulfilled.

**PROMISE.** At any point, a thread can add a message to both its set of promises and the memory.

**FULFILL.** A thread can fulfill its promise by executing a (non-release) write instruction, by removing a message from the thread’s set of promises. PS does not allow release writes to be promised, i.e., a promise cannot be fulfilled through a release write instruction.

In the LB program above, thread 1 may promise \(y := 1\) at first. This allows thread 2 to read 1 from \(y\) and write it back to \(x\). Then, thread 1 can read 1 from \(x\), which was written by thread 2, and fulfill its promise.

**Certification.** To ensure that promises do not make the semantics overly weak, each sequence of steps by a thread (before “yielding control to the scheduler”) has to be certified: the thread that took the steps should be able to fulfill all its promises when executed in isolation. Indeed, revisiting the LB program above, note that at the point of promising \(y := 1\) (in the very beginning of the run), thread 1 can run and perform \(y := 1\) without any “help” of other threads.

Certification (i.e., the thread-local run fulfilling all outstanding promises of the thread) is necessary to avoid “thin-air reads” as demonstrated by the following variant of LB:

\[
a := x \quad // 1 \\
y := a \quad // 1 \\
b := y \quad // 0 \\
x := b
\quad \text{(OOTA)}
\]

As every thread simply copies the value it reads, both threads are not supposed to read any other value than 0 from the memory. However, the annotated behavior, often called out-of-thin-air, is allowed in C11 [3]. In PS, if a thread could promise without certification, this behavior would be allowed by the same execution as the one for LB. However, with the certification requirement, thread 1 cannot promise \(y := 1\), as, when running in isolation, thread 1 will only write \(y := 0\).

PS requires a certification to exist for every future memory (i.e., any memory that extends the current memory). In §3, we explain the reason for this condition and its consequences.
Machine Step. A thread configuration $\langle TS, M \rangle$ can take one of Read, Write, Update, Promise, and Fulfills steps, denoted by $\langle TS, M \rangle \rightarrow \langle TS', M' \rangle$. In addition, a thread configuration is called consistent if for every future memory $M_{future}$ of $M$, there exist $TS'$ and $M'$ such that (where $TS.prm$ denotes the set of outstanding promises in thread state $TS$): $\langle TS, M_{future} \rangle \rightarrow* \langle TS', M' \rangle \land TS'.prm = \emptyset$

In turn, the machine step is defined as follows:

$\langle TS, M \rangle \rightarrow \langle TS[i \mapsto TS'], M' \rangle$

We note that the machine step is completely thread-local: it is only determined by the local state of the executing thread and the global memory, independently of the other threads’ states. Thread-locality is a key design principle of PS. It is what makes PS conceptually well-behaved, and, technically speaking, it allows one to prove the validity of various local program transformations, which are performed by compilers and/or hardware, using standard thread-local simulation arguments.

To show a concrete example, we list the execution steps of PS leading to the annotated behavior of the LB program (items prefixed with “C” represent certification steps):

1. Thread 1 promises $\langle y : 1@1,2, \bot \rangle$.
   (C1) Starting from an arbitrary extension of the current memory, thread 1 reads $\langle x : 0@0,0, \bot \rangle$, the initial message of $x$.
   (C2) Thread 1 fulfills its promise $\langle y : 1@1,2, \bot \rangle$.
2. Thread 2 reads $\langle y : 1@1,2, \bot \rangle$.
3. Thread 2 writes $\langle x : 1@1,2, \bot \rangle$.
4. Thread 1 reads $\langle x : 1@1,2, \bot \rangle$.
   (C1) Starting from an arbitrary extension of the current memory, Thread 1 fulfills its promise $\langle y : 1@1,2, \bot \rangle$.
5. Thread 1 fulfills its promise $\langle y : 1@1,2, \bot \rangle$.

DRF-RA Guarantee. We end this introductory section by informally describing DRF-RA, one of the main programming guarantees provided by PS. Generally speaking, DRF guarantees ensure that race-free programs have strong (i.e., more restrictive) semantics. To be more applicable and allow their use without even knowing the weaker semantics, race freedom is checked assuming the strong semantics.

In particular, DRF-RA is focused on release/acquire semantics (RA), and states that: if under RA semantics some program $P$ has no data race involving relaxed accesses (i.e., all races are on rel/acq accesses), then all behaviors that PS allows for $P$ are also allowed for $P$ by the RA semantics. Here, (i) by RA semantics we mean the model obtained from PS by treating all reads as acq reads, all writes as rel writes, and all RMWs as acqrel; and (ii) as PS is an operational model, data-races are naturally defined as states in which two different threads can access the same location and at least one of these accesses is writing.

For example, by analyzing the MP example under RA semantics, one can easily observe that the only race is on the rel/acq accesses to $y$. (Importantly, such analysis safely ignores promises, since these are not allowed under RA.) Then, DRF-RA implies that MP has only RA behaviors. In contrast, in the LB example, non-RA behaviors are possible, and, indeed, under RA semantics, there are races on relaxed accesses (to both $x$ and $y$).

In the sequel, DRF-RA provides us with the main guideline for making sure that our semantics is not overly weak (that is, we exclude any semantics that breaks DRF-RA). DRF-RA also serves as a main step towards “DRF-Lock”, which states that properly locked programs have only sequentially consistent semantics.\(^5\)

3 Problem Overview

As we will shortly demonstrate, the main challenge in PS is to come up with an appropriate thread-local condition for certifying the promises made by a thread. Maintaining thread-locality is instrumental in proving correctness of many compiler transformations, but is difficult to achieve given that promises of different threads may interact.

As we briefly mentioned above, PS requires a certification to exist for any memory that extends the current memory. We start by explaining why certifying promises only from the current memory (without quantifying over all future memories) is not good enough. Kang et al. [12] observed that such model may deadlock: the promising thread may fail to fulfill its promise since the memory was changed since the promise was made. In this work, we observe that a model that requires certifying promises only from the current memory has much more severe consequences. It actually breaks the DRF-RA guarantee as illustrated below:

$$a := \text{FADD}^{\text{acqrel}}(x, 1) \# 0$$

if $a = 0$ then
  $y := 1$

$ b := \text{FADD}^{\text{acqrel}}(x, 1) \# 0$

if $b = 0$ then
  $c := y \# 1$

(CDRF)

if $c = 1$ then
  $x := 0$

Under RA semantics only one thread can enter the if-branch, and the only race is between the two FADDS. Hence, to maintain DRF-RA, we need to disable the annotated behavior where both threads read 0 from $x$. To prevent this behavior, we need to disallow thread 1 to promise $y := 1$ in the beginning of the run. Indeed, by reading such a promise, thread 2 can write $x := 0$, and then, thread 1 can perform its update to $x$ and fulfill its outstanding promise. However, if we completely ignore the possible interference by other

\(^5\)The more standard DRF-SC, guaranteeing sequentially consistent semantics when all races (assuming SC semantics) are on SC accesses, is not applicable here since PS lacks SC accesses. The extension of PS with SC accesses is left to future work.
threads, thread 1 may promise \( y := 1 \), as it can be certified in a local run of thread 1 that starts from the initial memory and reads the initial message of \( x \).

Abstractly, what went wrong is that two threads compete on the same resource (i.e., to perform an RMW reading from the initialization message); one of them makes a promise assuming it will get the resource first but the other thread wins the competition in the actual run. This not only causes deadlock (which is semantically inconsequential), but also breaks DRF-RA.

To address this, PS followed a simple approach: it required that threads certify their promises starting from \textit{any} extension of the current memory. One such particular extension is the memory that will arise when the required resource is acquired by some other thread. Hence, this condition does not allow threads to promise writes assuming they will win a competition on some resource.

Revisiting CDRAF, PS’s certification condition blocks the promise of \( y := 1 \). For example, when certifying against \( M_\text{future} \), that, in addition to the initialization messages, consists of a message \( m = (x : 42@0, _{\_}, _) \), thread 1 is forced to read from \( m \) when performing its \texttt{FADD}, and cannot fulfill its promise. Since \( M_\text{future} \) is a possible future memory of the initial memory, thread 1 cannot promise \( y := 1 \).

PS’s future memory quantification maintains the thread-locality principle and suffices for establishing DRF-RA. However, next, we demonstrate that this very conservative over-approximation of possible interference is too strong to support global optimizations, and it is also the source of unsoundness of the intended compilation scheme to ARMv8.

Value-Range Analysis. PS does not support global optimizations based on value-range analysis. To see this, consider a variant of the LB-G program above that does not have the redundant store to \( x \) in thread 2 and has a \texttt{CAS} instruction in thread 1.

\[
\begin{align*}
a &:= \text{CAS}(x, 0, 1) \quad \text{// 1} \\
\text{if } a < 10 \text{ then} \quad b &:= y \quad \text{// 1} \\
y &:= 1 \quad \text{x := b} \quad \text{(GA)}
\end{align*}
\]

In PS, the annotated behavior is disallowed. Indeed, to obtain this behavior, thread 1 has to promise \( y := 1 \). This promise, however, cannot be certified for every future memory \( M_\text{future} \). For example, if, in addition to the initialization messages, the future memory \( M_\text{future} \) consists of a single message of the form \( (x : 57@0, _{\_}, _) \), then the \texttt{CAS} instruction can only read \( 57 \), and the write \( y := 1 \) is not executed. However, by observing the global invariant \( x < 10 \land y < 10 \), a global compiler analysis may transform this program to the following:

\[
\begin{align*}
a &:= \text{CAS}(x, 0, 1) \quad \text{// 1} \\
y &:= 1 \quad \text{x := b} \quad \text{(GA)}
\end{align*}
\]

Now, the annotated behavior is allowed (the promise \( y := 1 \) is not blocked anymore), rendering the optimization unsound. This is particularly unsatisfying because PS ensures that \( x < 10 \) is globally valid in this program (via its “invariant logic” [12, §5.5]), but does not allow an optimizing compiler to make use of this fact.

Register Promotion. A similar problem arises for a different kind of global optimization, namely register promotion:

\[
\begin{align*}
a &:= x \quad \text{// 1} \\
c &:= \text{FADD}(z, a) \quad \text{// 0} \\
y &:= 1 + c \\
\end{align*}
\]

PS disallows the annotated behavior. Again, thread 1 cannot promise \( y := 1 \), since an arbitrary future memory may not allow it to read \( z = 0 \) when performing the RMW. (Note also the RMW writing \( z := 1 \) cannot be promised before \( y := 1 \) since it requires to read \( x := 1 \) first.) Nevertheless, a global compiler analysis may notice that \( z \) is a local variable in the source program, and perform register promotion, replacing \( c := \text{FADD}(z, a) \) with \( c := 0 \) (since this FADD always returns 0). Now, PS allows the annotated behavior (nothing blocks the promise \( y := 1 \)), rendering register promotion unsound.

Unsound Compilation Scheme to ARMv8. A different problem in PS, found while formally establishing the correctness of compilation to ARMv8 [19], is that the intended mapping of RMWs to ARMv8 is broken. In fact, this problem stems from the exact same reason as the two problems above. While PS disallows the annotated behavior of the RP program above, when following the intended mapping to ARMv8 [6], ARMv8 allows the annotated behavior for the target program. Roughly speaking, although the instructions cannot be reordered at the source level, they can be reordered at the micro-architecture level. FADD is effectively turned into two special instructions, a load exclusive followed by a store exclusive. Since there is no dependency between the load of \( x \) and the exclusive load of \( z \), the two loads could be executed out of order. Similarly, the two stores could be executed out of order, and so the store to \( y \) could effectively be executed before the load of \( x \), which in turn leads to the annotated behavior.

\textbf{What went wrong?} These three problems all arise because PS’s certification requirement against every memory extension is overly conservative in approximating the interference by other threads. The challenge lies in relaxing this condition in a way that will ensure the soundness of global optimizations while maintaining thread-locality.

As CDRAF shows, simply relaxing the certification requirement by requiring certification only against the current memory is not an option. Another naive remedy would be to restrict the certification to extensions of the current memory that can actually arise in the given program. This approach, however, is bound to fail:

\footnote{Here the fact that no other thread accesses \( z \) is immaterial. ARMv8 allows this behavior also when, say, a third thread executes \( z := 5 \).}
We note that while §4.2 discusses \( m\) is forced to read \( M\) to \( FADD\) is an acquire \( m\) a message form do not introduce new values in the future memory suffices. In fact, even certification against memory extensions that disallows such problematic RMWs during certification. We observe here that this is the only DRF-RA guarantee (as happens in CDRF). We observe here the opportunity to read from that message in the actual run; to interference by other threads, such RMW may not have certification of promises, no see that for DRF-RA, one has to make sure that during the suffices for DRF-RA. By carefully analyzing CDRF, we can values message saturation is quantifying over two aspects of possible interference: 4.1 Capped Memory 4 Solution Overview In this section, we present the key ideas behind our modified PS model, which we call PS 2.0. Section 4.1 describes the notion of capped memory, which enables value-range analysis, while §4.2 discusses reservations, an additional mechanism needed to support register promotion and recover the correctness of the mapping to ARMv8. Section 4.3 discusses our modeling of undefined behavior (which we use to formally quantify over two aspects of possible interference: values and message views. We observe that quantifying only over message views suffices for DRF-RA. By carefully analyzing CDRF, we can see that for DRF-RA, one has to make sure that during the certification of promises, no acquire-release RMW reads from a message that already exists in the memory. Indeed, (i) due to interference by other threads, such RMW may not have the opportunity to read from that message in the actual run; and (ii) such racy RMWs may exist (the DRF-RA assumption does not prevent them). Together, (i) and (ii) invalidate the DRF-RA guarantee (as happens in CDRF). We observe here that this is the only role of the future memory quantification that is required for ensuring DRF-RA. The conservative future memory quantification of PS indeed disallows such problematic RMWs during certification. In fact, even certification against memory extensions that do not introduce new values in the future memory suffices for DRF-RA. For example, in CDRF, when certifying against \( M\) future, that, in addition to the initialization messages, has a message form \( m = (x : 0@0, _, R) \) with \( R(y) \geq t\), thread 1 is forced to read \( m\) when performing its FADD. Since it is an acquire FADD, it will increase the thread view of \( y\) to \( R(y)\), which will not allow it to fulfill its promise. Generally, when a thread promises a message of the form \( \langle x : v@\langle f, t, V\rangle \rangle \) in the current memory \( M\), there is always a possible memory extension \( M_{\text{future}}\) of \( M\) that forces (non-promised) RMWs of location \( y\) performed during certification (which read from a message in \( M_{\text{future}}\)) to read from a specific message \( m_{\text{future}}^y \in M_{\text{future}}\) whose view of \( x\) is greater than or equal to \( t\). When such RMWs are acquire RMWs, this will force the thread to increase its view of \( x\) to at least \( t\), which, in turn, does not allow the thread to fulfill its promise. Remark 1. Completely disallowing release-acquire RMWs during certification is too strong. We should allow them to read from local writes added during certification, since no other thread can prevent them from doing so. We further observe that value-range analysis concerns message values, but it is insensitive to message views. As we saw for the GA program above, the conservative future memory quantification of PS is doing too much: it forbids any promise that depends on the value read by an RMW, which invalidates value-range analysis. However, we note that there is no problem in disallowing the following variant of GA that uses an acquire CAS instead of a relaxed one:

\[
\begin{align*}
a &:= \text{CAS}(x, 0, 1) & \quad x := 42 \\
\text{if } a < 10 \text{ then } b &:= y & x := b \quad (\text{GA+E})
\end{align*}
\]

Here, \( x := 42 \) occurs in a possible future memory, but a compiler may soundly eliminate this write.

Second, this approach is not thread-local, and, since other threads may promise as well, it immediately leads to troublesome cyclic reasoning: whether thread 1 may promise a write depends on behavior of thread 2 that may include promise steps that again depend on behavior of thread 1.

4.1 Capped Memory

We note that PS’s certification against every memory extension is quantifying over two aspects of possible interference: message values and message views.

We observe that quantifying only over message views suffices for DRF-RA. By carefully analyzing CDRF, we can see that for DRF-RA, one has to make sure that during the certification of promises, no acquire-release RMW reads from a message that already exists in the memory. Indeed, (i) due to interference by other threads, such RMW may not have the opportunity to read from that message in the actual run; and (ii) such racy RMWs may exist (the DRF-RA assumption does not prevent them). Together, (i) and (ii) invalidate the DRF-RA guarantee (as happens in CDRF). We observe here that this is the only role of the future memory quantification that is required for ensuring DRF-RA.

The conservative future memory quantification of PS indeed disallows such problematic RMWs during certification. In fact, even certification against memory extensions that do not introduce new values in the future memory suffices for DRF-RA. For example, in CDRF, when certifying against \( M\) future, that, in addition to the initialization messages, has a message form \( m = (x : 0@0, _, R) \) with \( R(y) \geq t\), thread 1 is forced to read \( m\) when performing its FADD. Since it is an acquire FADD, it will increase the thread view of \( y\) to \( R(y)\), which will not allow it to fulfill its promise. More generally, when a thread promises a message of the form \( \langle x : v@\langle f, t, V\rangle \rangle \) in the current memory \( M\), there is always a possible memory extension \( M_{\text{future}}\) of \( M\) that forces (non-promised) RMWs of location \( y\) performed during certification (which read from a message in \( M_{\text{future}}\)) to read from a specific message \( m_{\text{future}}^y \in M_{\text{future}}\) whose view of \( x\) is greater than or equal to \( t\). When such RMWs are acquire RMWs, this will force the thread to increase its view of \( x\) to at least \( t\), which, in turn, does not allow the thread to fulfill its promise. Remark 1. Completely disallowing release-acquire RMWs during certification is too strong. We should allow them to read from local writes added during certification, since no other thread can prevent them from doing so.

We further observe that value-range analysis concerns message values, but it is insensitive to message views. As we saw for the GA program above, the conservative future memory quantification of PS is doing too much: it forbids any promise that depends on the value read by an RMW, which invalidates value-range analysis. However, we note that there is no problem in disallowing the following variant of GA that uses an acquire CAS instead of a relaxed one:

\[
\begin{align*}
a &:= \text{CAS}_{\text{acq}}(x, 0, 1) & \quad b := y & x := b \quad (\text{GAacq})
\end{align*}
\]

Although value analysis may deduce that \( a < 10\) is always true, it cannot justify the reordering of \( a := \text{CAS}_{\text{acq}}(x, 0, 1) \) and \( y := 1\), since acquire accesses in general cannot be reordered with subsequent accesses. In other words, an analysis that is based solely of values does not give any information about the views of read messages, so that any optimization based on such analysis cannot enable reordering of acquire RMWs. Based on these observations, it seems natural to replace the conservative future memory quantification of PS with a requirement to certify against all extensions of the current memory \( M\) that employ values that already exist in \( M\) (for each location). While this approach makes value-range analysis sound and maintains DRF-RA, it is still too strong for the combination of local and global optimizations. Indeed, consider the following variant of the GA+E program above.

\[
\begin{align*}
x &:= 42 & f := \text{flag} \\
x &:= 0 & \text{if } f = 1 \text{ then } a := \text{CAS}(x, 0, 1) \quad f_{\text{el}} := 1 \quad \text{if } a < 10 \text{ then } y := 1 \quad x := b \quad (\text{GA+E'})
\end{align*}
\]

In order for thread 2 to promise \( y := 1\), the write to \( \text{flag}\) has to be executed first. (Note that a release write cannot be promised.) Therefore, the value 42 for \( x\) exists in memory when the promise \( y := 1\) is made, but, to support both the elimination of overwritten values and global value analysis, \( x := 42\) should not be considered as a possible extension of
the current memory. We observe that it is enough, however, to consider memory extensions whose additional messages only use values of maximal messages (which were not yet overwritten) to each location.

Now, instead of quantifying over a restricted set of memory extensions, we identify the most restrictive such extension, which we called the “capped memory”. This leads to a conceptually simpler certification condition, where certification is needed only against one particular memory, which is uniquely determined by the current memory. The capped memory $\hat{M}$ of a memory $M$ is obtained by:

- Filling all “gaps” between existing messages so that non-promised RMWs can only read from the maximal message of the relevant location. In other words, for every two messages $m_1 = \langle x : @ (_, t), _ \rangle$ and $m_2 = \langle x : @ (f, _), _ \rangle$ with $t < f$ and no message in between, we block the space between $t$ and $f$. (The exact mechanism to achieve this, “reservations”, is discussed in §4.2.)
- For every location $x$, attaching a “cap message” $\hat{m}_x$ with a globally maximal view to the latest message to $x$ in $M$:

$$\hat{m}_x = \langle x : \hat{v}_x, \hat{\lambda}_x = \lambda y. \max \{ t | \langle y : @ (_, t), _ \rangle \in M \} \rangle$$

where $\hat{t}_x$ and $\hat{v}_x$ are the “to”-timestamp and the value of the message to $x$ in $M$ with the maximal “to”-timestamp, and $\hat{\lambda}_x$ is given by:

$$\hat{\lambda}_x = \lambda y. \max \{ t | \langle y : @ (_, t), _ \rangle \in M \}.$$ 

Fig. 1 depicts an example of the capped memory construction. The shaded area in $\hat{M}$ represents the blocked space.

Starting from $\hat{M}$, any (non-promised) RMWs reading from a message in $\hat{M}$ are forced to read from the $\hat{m}_x$ messages (since the timestamp interval $[0, \hat{t}_x]$ is completely occupied). Because these messages carry maximal views, acquire RMWs reading from them cannot be executed during certification, as it will increase the thread view to $\hat{\lambda}_x$, which, in turn, will prevent the thread from fulfilling its outstanding promises.

In turn, the new machine step is then simplified as follows:

$$\langle TS(i), M \rangle \rightarrow^{+} \langle TS', M' \rangle$$

where $\exists TS'' \cdot \langle TS', \hat{M} \rangle \rightarrow^{*} \langle TS'', _ \rangle \land TS''.prm = \emptyset$

$$\langle TS, M \rangle \rightarrow \langle TS'[i \mapsto TS''], M' \rangle$$

Since the capped memory is clearly one possible future memory, the semantics we obtain is clearly weaker than PS. It is (i) weak enough to allow the annotated behaviors of GA and RP above: certification against the capped memory will not lead to $a \geq 10$ in GA and to $c \neq 0$ in RP; and, on the other hand, (ii) strong enough to forbid the annotated behavior of CDRF above: certification against the capped memory will not allow the $y := 1$ promise. In particular, by using the maximal messages for constructing capped memory, thread 2 of GA+E can promise $y := 1$ and certify it while the message $x := 42$ (which is overwritten by $x := 0$) is in the memory.

**Remark 2.** The original PS quantification over all future memories could equivalently quantify over all memories defined just like the capped memory, except for using arbitrary values for the cap messages. Capped memory is more than that: it sets the value of each cap message to that of the corresponding maximal message.

### 4.2 Reservations

While capped memory suffices for justifying the weak outcomes of the examples seen so far, it is still too strong to support register promotion and to validate the intended mapping to ARMv8. Consider the following variant of RP that uses an acquire RMW in thread 1.

$$a := x \quad / 1$$

$$c := \text{FADD} \quad (z, a) \quad / 0$$

$$b := y \quad / 1$$

$$y := 1$$

The weakening of PS presented in §4.1 disallows the annotated behavior. Thread 1 cannot promise $y := 1$ because its certification has to execute a non-promised acquire RMW reading from an existing message against the capped memory; and also it cannot promise the RMW $z := 1$ before $y := 1$ because its certification requires reading $x := 1$. Nevertheless, as for RP, a global analysis may notice that $z$ is accessed only by one thread and perform register promotion, yielding the annotated outcome. (Similarly, ARMv8 allows the annotated behavior of the corresponding target program.)

We note that the standard (Java) optimization of removing locks used by only one thread requires to perform register promotion on local locations accessed by acquire RMWs. Indeed, lock acquisitions are essentially acquire RMWs.

So, how can we allow such behaviors without harming DRF-RA? Our idea here is to enhance PS by allowing one to declare which thread will win the competition to perform an RMW reading from a given message $m$. Once such a declaration is made, RMWs performed by other threads cannot read from $m$.

The technical mechanism for these declarations is simple: we add a “reservation” step to PS, allowing a thread to reserve a timestamp interval that it plans to use later, without committing on how it will use it (what value and view will be picked). Once an interval is reserved, other threads are blocked from reusing timestamps in this interval. Intuitively, a reservation corresponds to promising the “read part” of the RMW, which confines the behavior of other threads. In particular, if a thread reserves an interval $(t_1, t_2]$ attached to
some message \((f, t_i)\), then other threads cannot read from the \((f, t_i)\) message with an RMW operation.

Since reservations are included in the machine memory (just like normal writes and promises), the semantics remains thread-local. Technically, reservations take the form \(\langle x : (f, t) \rangle\) where \(x \in \text{Loc}\) and \((f, t)\) is a timestamp interval. To meet their purpose, we allow attaching reservations only immediately after existing concrete messages \((f\) should be the "to"-timestamp of some existing message to the same location\). Threads are also allowed to cancel their reservations (provided they can still certify their outstanding promises) if they no longer need to block an interval. This is technically needed for the soundness of register promotion (see [1, §B]).

Returning to the RPaq program above, reservations allow the annotated outcome. Thread 1 can first reserve the interval \((0, 1]\) for \(z\). Then, it can promise \(y := 1\) and certify its promise by using its own reservation to perform the RMW.

Intuitively, reservations are closer to the implementation of RMWs in ARM: reserving the read part of an RMW first and then writing the RMW at the reserved space later corresponds to execution of a load exclusive first and a (successful) write exclusive later.

Reservations are also used in the definition of the capped memory to fill the gaps between messages to the same location (§4.1). In the presence of reservations, however, the capped memory definition requires some care. First, the value of the cap messages \(\bar{m}_x\) should be the value of the maximal concrete message to \(x\) (reservations do not carry values). Second, when constructing the capped memory for thread \(i\), if the maximal message to some location \(y\) is a reservation of thread \(i\) itself, then we do not add a cap message for \(y\). In effect, during certification, the thread can execute any RMW on \(y\) but only after filling the reserved space on \(y\). Other threads cannot execute an RMW on reservations of thread \(i\), and so cannot interfere with respect to \(y\).

### 4.3 Undefined Behavior

So far, we have described value-range optimizations by informally referring to a global analysis performed by the compiler. For our formal development, we introduce undefined behavior (UB). We note that UB, which is not supported in the original PS model, is also useful in a broader context (e.g., to give sensible semantics to expressions like \(x/0\)).

In order to formally define global optimizations, we include in our language an abort instruction, abort, which causes UB. In turn, for a global invariant \(I\) (formally defined in §6.2), we allow the program transformation introducing an arbitrary program points the instruction assert\((I)\), which is a syntactic sugar to if \(\neg I\) then abort. This paves the way to further local optimizations, such as:

\[
\begin{align*}
\text{assert}(x \in \{0, 1\}) & \quad a := x \\
& \quad \text{if } a \in \{0, 1\} \text{ then } c \\
& \quad b := c
\end{align*}
\]

The standard semantics of UB is “catch-fire”: UB should be thought as allowing any arbitrary sequence of operations. This enables common compiler optimizations (e.g., if \(e\) then \(c\) else abort \(\leadsto c\)). Nevertheless, to make sure the semantics is not overly weak, like any thread step, for taking an abort-step, the certification condition has to be satisfied (where the certifying thread may replace abort by any sequence of operations).

Our formal condition for taking an abort-step is somewhat simpler: we require that for every location \(x\), the current view of the aborting thread for \(x\) be lower than the "to"-timestamp of all the outstanding promises for \(x\) of that thread. We say a thread is promise-consistent when this condition is met. Recall that a thread can take a write step to a location \(x\) when the thread view of \(x\) is lower than the "to"-timestamp of the writing message. In turn, considering that taking an abort-step is capable of executing arbitrary write instructions, a thread is able to fulfill its outstanding promises when aborting if and only if it is promise-consistent.

### 4.4 Relaxed RMWs in Certifications

In PS 2.0, we opted to allow relaxed RMWs (that were non-promised before and read from a message that exists in the current memory) during certification of promises. This design choice can cause execution deadlocks:

\[
\begin{align*}
a & := \text{FADD}(x, 1) \quad \text{// 0} \\
y & := 1 + a
\end{align*}
\]

Suppose that in the beginning of the run the thread 1 promises \(y := 1\). This promise can be certified against the capped memory by reading from the cap message of \(x\) (whose value is 0). Now, thread 2 can perform its RMW, and block thread 1 from fulfilling its promise. Although allowing such deadlocks is awkward, they are inconsequential, since deadlock runs are discarded from the definition of observable behavior.

Similarly, this choice enables somewhat dubious behaviors that seem to invalidate atomicity of relaxed RMWs: for instance, CDRF can have the annotated behavior if one FADD is made r1x. Such behaviors are actually unavoidable if one insists on allowing all (local and global) optimizations allowed by PS 2.0 ([1, §C] provides an example).

A stronger alternative would be to disallow relaxed RMWs during certification unless they were promised before the certification, or they read from a message that is added to the memory during certification. This can be easily achieved by defining the capped memory (against which threads certify their promises) to include a reservation instead of a cap message, which disallows to read from cap messages during certification. The resulting model is deadlock-free and it supports all (global and local) optimizations supported by PS 2.0, except for the local reordering of a relaxed RMW followed by a write. To see this consider the following example:

\[
\begin{align*}
a & := \text{FADD}(x, 1) \quad \text{// 1} \\
y & := 1 \\
b & := y \quad \text{// 1} \\
x & := b
\end{align*}
\]

\text{(LB-RMW)}
To read the annotated values, the run must start with thread 1 promising \( y := 1 \). Such a promise can only be certified if we allow relaxed RMWs that read an existing message during certification. Nevertheless, reordering the two instructions in thread 1 clearly exhibits the annotated behavior. In particular, since ARMv8 performs such reorderings, the mapping to ARMv8 should always include a dependency from relaxed RMWs, thereby incurring some (probably small) overhead.

## 5 Formal Model

In this section, we present our formal model, called PS 2.0, which combines and makes precise the ideas outlined above. For simplicity, we omit some features that were included in PS (plain accesses, fences, release sequences, and split and lower of promises).\(^7\) All of these features are handled just like in PS and are included in our Coq formalization. The full operational semantics and the programming language are presented in [1, §A].

To keep the presentation simple and abstract, we do not fix a particular programming language syntax, and rather assume that the thread semantics is already provided as a labeled transition system, with transition labels \( \S1\text{ent} \) for a silent thread transition with no memory effect, \( \text{R}(o, x, v) \) for reads, \( \text{W}(o, x, v) \) for writes, \( \text{F}_\text{all} \) for failing assertions, \( \text{Sys}(o) \) for a system calls.

The \( o, o_r, o_w \) variables denote access modes, which can be either \( \text{rlx} \) or \( \text{ra} \). We use \( \text{ra} \) for both release and acquire, and include two access modes in RMW labels: a read mode and a write mode. These naturally encode the syntax of the examples we discussed above, e.g.,

\[
\text{FADD} \rightarrow \text{U}(\text{rlx}, \text{rlx}, ...), \quad \text{FADD}^{\text{acq}} \rightarrow \text{U}(\text{ra}, \text{rlx}, ...)
\]

\[
\text{FADD}^{\text{acqrel}} \rightarrow \text{U}(\text{ra}, \text{ra}, ...), \quad \text{FADD}^{\text{rel}} \rightarrow \text{U}(\text{rlx}, \text{ra}, ...).
\]

Next, we present the components of the PS 2.0 model.

### Time

Time is a set of timestamps that is totally and densely ordered by \( < \) with a minimum value, denoted by \( 0 \).

### Views

A view is a function \( V : \text{View} \cong \text{Loc} \rightarrow \text{Time} \). We use \( \bot \) and \( \top \) to denote the natural bottom elements and join operations for views (pointwise extensions of the timestamp 0 and max operation on timestamps).

### Concrete Messages

A concrete message takes the form \( m = (x : v(\text{f}, t), R) \) where \( x \in \text{Loc}, v \in \text{Val}, f, t \in \text{Time}, \) and \( R \in \text{View} \), such that \( f < t \) or \( f = t = 0 \), and \( R(x) \leq t \). We denote by \( m._{\text{loc}}, m._{\text{val}}, m._{\text{from}}, m._{\text{to}}, \) and \( m._{\text{view}} \) the components of \( m \).

### Reservations

A reservation takes the form \( m = (x : (f, t)) \), where \( x \in \text{Loc} \), and \( f, t \in \text{Time} \) such that \( f < t \). We denote by \( m._{\text{loc}}, m._{\text{from}}, \) and \( m._{\text{to}} \) to the components of \( m \).

### Messages

A message is either a concrete message or a reservation. Two messages \( m_1 \) and \( m_2 \) are disjoint, denoted by \( m_1 \not\equiv m_2 \), if they have different locations or disjoint timestamp intervals:

\[
m_1 \not\equiv m_2 \equiv m_1._{\text{loc}} \neq m_2._{\text{loc}} \lor m_1._{\text{to}} < m_2._{\text{from}} \lor m_2._{\text{to}} < m_1._{\text{from}}
\]

Two sets \( M_1 \) and \( M_2 \) of messages are disjoint, denoted by \( M_1 \not\equiv M_2 \), if \( m_1 \not\equiv m_2 \) for every \( m_1 \in M_1 \) and \( m_2 \in M_2 \).

### Memory

A memory \( M \) is a (nonempty) pairwise disjoint finite set of messages. We write \( M(x) \) for the sub-memory \( \{ m \in M \mid m._{\text{loc}} = x \} \) and \( \hat{M} \) for the set \( \{ m \in M \mid m = (\_ : \_@\_\, \ldots) \} \) of concrete messages in \( M \).

### Memory Operations

A memory \( M \) supports the insertion for a message \( m \) denoted by \( M \leftrightarrow m \) and given by \( M \cup \{ m \} \). It is only defined if: (i) \( \{ m \} \not\equiv M \), (ii) if \( m \) is a concrete message with \( m._{\text{loc}} = x \), then no message \( m' \in M(x) \) has \( m'._{\text{from}} = m._{\text{to}} \), and (iii) if \( m \) is a reservation with \( m._{\text{loc}} = x \), then there is some concrete message \( m' \in M(x) \) such that \( m'._{\text{to}} = m._{\text{from}} \). Note that the second condition enforces that once a message is not an RMW (i.e., its “from”-timestamp is not attached to another message), it never becomes an RMW (i.e., its “from”-timestamp remains detached). Technically, this condition is required for the soundness of the register promotion.

### Closed View

Given a view \( V \) and a memory \( M \), we write \( V \in M \) if, for every \( x \in \text{Loc} \), we have \( V(x) = m._{\text{to}} \) for some concrete message \( m \in \hat{M}(x) \).

### Thread States

A thread state is a triple \( TS = (\sigma, V, P) \), where \( \sigma \) is a local state, \( V \) is a thread view, and \( P \) is a memory. We denote by \( TS._{\text{st}}, TS._{\text{view}}, \) and \( TS._{\text{prm}} \) the components of a thread state \( TS \).

### Thread Configuration Steps

A thread configuration is a pair \( (TS, M) \), where \( TS \) is a thread state and \( M \) is a memory. We use \( \perp \) as a thread configuration after a failure.

Fig. 2 presents the full list of thread configuration steps, which we discuss now. To avoid repetition, we use the helpers \text{read-helper} \ and \text{write-helper}. In these helpers, \( \{ x@t \} \) denotes the view assigning \( t \) to \( x \) and \( 0 \) to other locations.

### Promise

A thread can take a promise-step by adding a concrete message \( m \) to the set of outstanding promises \( P \) and update the memory \( M \) to \( \leftrightarrow m \).

### Reserve and Cancel

These two steps are specific to PS 2.0 model. In a reserve-step a thread reserves a timestamp interval by adding it to both the memory \( M \) and the set of outstanding promises \( TS._{\text{prm}} \). The thread is allowed to drop the reservation from the set of outstanding promises and the memory using the cancel-step.

### Read

In this step a thread reads the value of a location \( x \) from a message \( m \in M \) and extend its view. Following the read-helper, the thread’s view of location \( x \) is extended to
Machine Steps:

(MACHINE NORMAL) \[ (TS(i), \) \] \(\rightarrow^* (TS', M') \]

(\(TS', M'\) is consistent)

\[ (TS, M) \rightarrow (TS[i \mapsto TS'], M') \]

(MACHINE CALL) \[ (TS(i), M) \rightarrow^* (TS', M') \]

\( (TS', M') \) is consistent

\[ (TS, M) \rightarrow (TS[i \mapsto TS'], M') \]

(MACHINE FAIL) \[ (TS(i), M) \rightarrow_{\perp} \]

\[ (TS, M) \rightarrow_{\perp} \]

Thread Steps:

(PROMISE) \[ m = (\_ : @(_\_), R) \]

\[ M' = M \cap m \quad R \in M' \]

\[ (\langle \sigma, V, P, M \rangle, M') \rightarrow (\langle \sigma, V, P \cup \{m\}, M' \rangle) \]

(READ) \[ m = (x : @(_\_), _) \]

\[ \langle V, M \rangle \xrightarrow{\alpha_m} (V', M) \]

\[ (\langle \sigma, V, P, M \rangle, M') \rightarrow (\langle \sigma', V', P', M' \rangle) \]

(SILENT) \[ \sigma \rightarrow \sigma' \]

\[ (\langle \sigma, V, P, M \rangle, M') \rightarrow (\langle \sigma', V', P', M' \rangle) \]

(Failure) \[ \sigma \rightarrow_{\perp} \]

\[ (\langle \sigma, V, P, M \rangle, M') \rightarrow_{\perp} \]

Failure. We only allow a thread configuration \( (TS, M) \) to fail if \( TS \) is promise-consistent:

\[ \forall m \in TS.pm, TS.view(m.loc) \leq m.to \]

Cap View and Messages. The last message of a memory \( M \) to a location \( x \), denoted by \( \overline{m}_{MX} \), is given by:

\[ \overline{m}_{MX} = \arg \max_{m \in M(x)} m.to \]

The cap view of a memory \( M \), denoted by \( \widehat{V}_M \), is given by:

\[ \overline{V}_M = \lambda x. \overline{m}_{MX}.to \]

By definition, we have \( \widehat{V}_M \in M \). The cap message of a memory \( M \) to a location \( x \), denoted by \( \overline{m}_{MX} \), is given by:

\[ \overline{m}_{MX} = x : \overline{m}_{MX}.val(@\overline{m}_{MX}.to, \overline{m}_{MX}.to + 1), \widehat{V}_M \]

Capped Memory. The capped memory of a memory \( M \) with respect to a set of promises \( P \), denoted by \( \check{M}_P \), is an extension of \( M \), constructed in two steps:
1. For every \( m_1, m_2 \in M \) with \( m_1 \parallel oc = m_2 \parallel oc, m_1 \parallel to < m_2 \parallel to \), and there is no message \( m' \in M(m_1 \parallel oc) \) such that \( m_1 \parallel to < m' \parallel to < m_2 \parallel to \), we include a reservation \( (m_1 \parallel oc : (m_1 \parallel to, m_2 \parallel from)) \) to \( M_p \).

2. We include a cap message \( \hat{m}_{CM} \) for every location \( x \) unless \( \hat{m}_{CM} \) is a reservation in \( P \).

**Consistency.** A thread configuration \( \langle TS, M \rangle \) is called consistent if there exist \( TS', M' \) such that:
\[
\langle TS, M_{TS,prn} \rangle \rightarrow^* \langle TS', M' \rangle \land TS'.prn = \emptyset
\]

**Machine steps.** A machine state is a pair \( MS = \langle TS, M \rangle \) consisting of a function \( TS \) assigning a thread state to every thread, and a memory \( M \). The initial state \( M_S^0 \) (for a given program) consists of the function \( TS^0 \) mapping each thread \( i \) to its initial state \( s_i^0 \), the \( \perp \) thread view (all timestamps are 0), and an empty set of promises; and the initial memory \( M^0 \) consisting of one message \( \langle x : 0 \parallel @ (0, 0), \perp \rangle \) for each location \( x \). The three possible machine steps are given at the bottom of Fig. 2. We use \( \perp \) as a machine state after a failure.

**Behaviors.** To define what is externally observable during executions of a program \( P \), we use the system calls that \( P \)'s executions perform. More precisely, every execution induces a sequence of system calls, and the set of behaviors of \( P \), denoted \( \text{Beh}(P) \), consists of all such sequences induced by executions of \( P \). When a \text{Fail} occurs during the execution, \( \text{Beh}(P) \) consists of the sequence of system calls performed before the failure followed by an arbitrary sequence of system calls (reflecting an undefined behavior).

### 6 Results

We next present the results of PS 2.0. Except for Theorems 6.6 to 6.8 (whose proofs are given in [1]), all other results are fully mechanized in the Coq proof assistant. These results hold for the full model defined in [1, §A], not only for the simplified fragment presented in §5.

#### 6.1 Thread-Local Optimizations

A transformation \( P_{src} \rightsquigarrow P_{tgt} \) is sound if it does not introduce behaviors under any (parallel and sequential) context:

\[
\forall C, \text{Beh}(C[P_{src}]) \supseteq \text{Beh}(C[P_{tgt}])
\]

PS 2.0 allows all compiler transformations supported by PS. Additionally, it supports replacing \text{abort} by arbitrary code (more precisely, \text{assert}; \( C_1 \rightsquigarrow C_2 \)). Since \text{assert}(e) is defined as if \( \neg e \) then \text{abort}, the following transformations are valid:

1. \text{assert}(e); C \rightsquigarrow \text{assert}(e); C[\text{true}\ e]
2. \text{assert}(e) \rightsquigarrow \text{skip}

Thanks to thread-locality of PS and PS 2.0, we proved a theorem that combines and lifts the local simulation relations (almost without any reasoning on certifications) between pairs of threads \( S_i, T_i \) into a global simulation relation between the composed programs \( S_1 \parallel ... \parallel S_n \) and \( T_1 \parallel ... \parallel T_n \).

This theorem allows us to easily prove soundness of the thread-local transformations using sequential and thread-local simulation relations. See Kang [11] and our Coq formalization for more details.

#### 6.2 Value-Range Optimizations

First, we provide a global value-range analysis and prove its soundness in PS 2.0. A value-range analysis is a tuple \( A = (J, S_1, ..., S_n) \), where \( J \in \text{Loc} \rightarrow \mathcal{P}(\text{Val}) \) represents a set of possible values for each location and \( S_i \subseteq \text{State} \); a set of possible local states of the underlying language (i.e., excluding the thread views) for each thread \( i \). The analysis is sound for a program \( P \) if (i) the initial value for each location is in \( J \) and the initial state of each thread \( i \) in \( P \) is in \( S_i \); (ii) taking a step from each state in \( S_i \) necessarily leads to a state in \( S_i \) assuming that it only reads a value in \( J \) and guarantees that it only writes a value in \( J \).

Now, we show that sound analysis for \( P \) holds in every reachable state of \( P \).

**Theorem 6.1 (Soundness of Value-Range Analysis).** For a sound value-range analysis \( (J, S_1, ..., S_n) \) for \( P \), if \( (TS, M) \) is a reachable machine state for \( P \), then \( TS(i) \in S_i \) for every thread \( i \), and \( m.\text{val} \in J(x) \) for every \( m \in M(x) \).

Second, we prove the soundness of global optimizations based on sound value-range analysis. An optimization based on a value-range analysis \( A = (J, S_1, ..., S_n) \) can be seen as inserting \text{assert}(e) at positions in thread \( i \) when \( e \) is always evaluated to \text{true}. For this, we define a relation, \( \text{global}_{\text{opt}}(A, P_{src}, P_{tgt}) \), which holds when \( P_{tgt} \) is obtained from \( P_{src} \) by inserting valid assertions based on \( A \).

**Theorem 6.2 (Soundness of Global Optimizations).** For a sound value-range analysis \( A = (P_{src}, P_{tgt}) \), we have \( \text{Beh}(P_{src}) \supseteq \text{Beh}(P_{tgt}) \).

#### 6.3 Register Promotion

We prove soundness of register promotion. We denote by promote\((s, x, r)\) the statement obtained from a statement \( s \) by promoting the accesses to memory location \( x \) to accesses to register \( r \) (see [1, §D] for a formal definition).

**Theorem 6.3 (Soundness of Register Promotion).** For a program \( s_1 \parallel ... \parallel s_n \), if memory location \( x \) is only accessed by \( s_i \) (i.e., not occurring in \( s_j \) for every \( j \neq i \)) and register \( r \) is fresh in \( s_i \) (i.e., not occurring in \( s_i \)), we have:

\[
\text{Beh}(s_1 \parallel ... \parallel s_n) \supseteq \text{Beh}(s_1 \parallel ... \parallel \text{promote}(s_i, x, r) \parallel ... \parallel s_n)
\]

#### 6.4 DRF Theorems

We prove four DRF theorems for PS 2.0: DRF-Promise, DRF-RA, DRF-Lock-RA, and DRF-Lock-SC. First, we need several definitions:

- **Promise-free (PF) semantics** is the strengthening of PS 2.0 obtained by revoking the ability to make promises or reservations.
• **Release-acquire (RA)** is the strengthening of PF obtained by interpreting all memory operations as if they have ra access mode.

• **Sequential consistency (SC)** is the strengthening of RA obtained by forcing every read of a location \( x \) to read from the message with location \( x \) with the maximal timestamp and every write to a location \( x \) to write a message at a timestamp higher than any other \( x \)-message.

In the absence of promises, PS and PS 2 coincide:

**Theorem 6.4.** PF is equivalent to the promise-free fragment of PS, and thus the same holds for RA and SC.

We say that a machine state is \( rlx \)-race-free, if whenever two different threads may take a non-promise step accessing the same location and at least one of them is writing, then both are ra accesses.

**Theorem 6.5 (DRF-Promise).** If every PF-reachable machine state for \( P \) is \( rlx \)-race-free, then \( Beh_{PF}(P) = Beh_{PS 2,0}(P) \).

This theorem is one of the key results of DRF theorems for PS 2.0. In our Coq formalization, we proved a stronger version of DRF-Promise, which is presented in [1, §E].

**Theorem 6.6 (DRF-RA).** If every RA-reachable machine state for \( P \) is \( rlx \)-race-free, then \( Beh_{RA}(P) = Beh_{PS 2,0}(P) \).

Thanks to Theorems 6.4 and 6.5, the proof of DRF-RA for PS 2.0 is identical to that for PS given in [12].

Our DRF-Lock theorems given below generalize those for PS given in [12] in two aspects: our Lock are implemented with an acquire CAS rather than acquire-release CAS that was assumed in [12]; and our results cover tryLock, not just Lock and Unlock.

We define \( \text{tryLock} \), \( \text{Lock} \) and \( \text{Unlock} \) as follows:

\[
\begin{align*}
\text{tryLock}(L) & \triangleq a := \text{WCAS}^{\text{acq}}(L, 0, 1) \\
\text{Lock}(L) & \triangleq \text{do } a = \text{tryLock}(L) \text{ while } \text{!a} \\
\text{Unlock}(L) & \triangleq L^{re} := 0
\end{align*}
\]

where \( \text{WCAS}^{\text{acq}} \) is the weak CAS operation, which can either return \( \text{true} \) after successfully performing \( \text{CAS}^* \), or return \( \text{false} \) after reading any value from \( L \) with relaxed mode.

We prove DRF-Lock-RA and DRF-Lock-SC for programs using the three lock operations. We say such a program is well-locked if (1) locations are partitioned into lock and non-lock locations, (2) lock locations are accessed only by the three lock operations, and (3) \( \text{Unlock} \) is executed only when the thread holds the lock.

**Theorem 6.7 (DRF-Lock-RA).** For a well-locked program \( P \), if every RA-reachable machine state for \( P \) is \( rlx \)-race-free for all non-lock locations, then \( Beh_{RA}(P) = Beh_{PS 2,0}(P) \).

**Theorem 6.8 (DRF-Lock-SC).** For a well-locked program \( P \), if every SC-reachable machine state reachable for \( P \) is race-free for all non-lock locations, then \( Beh_{SC}(P) = Beh_{PS 2,0}(P) \).

The proofs of these theorems are given in [1, §F].

6.5 **Compilation Correctness**

Following Podkopaev et al. [19], we prove the correctness of mapping from PS 2.0 to hardware models (x86-TSO, POWER, ARMv7, ARMv8, RISC-V) using the Intermediate Memory Model, IMM, from which intended compilation schemes to the different architectures are already proved to be correct.

**Theorem 6.9 (Correctness of Compilation to IMM).** Every outcome of a program \( P \) under IMM is also an outcome of \( P \) under PS 2.0, i.e., \( Beh_{PS 2,0}(P) \supseteq Beh_{IMM}(P) \).

We note that this result (which is mechanized in Coq) requires the existence of a control dependency from the read part of each RMW operation. Such dependency exists "for free" in CAS operations, since its write operation (a store-conditional instruction) is anyway control-dependent on the read operation (a load-link instruction). However, when compiling FADDs to ARMv8, the compiler has to place "fake" control dependencies to meet this condition (and be able to use our theorem). We conjecture that a slightly more efficient compilation (standard) scheme of FADDs that does not introduce such dependencies is also sound. We leave this proof to a future work. In any case, our result is better than the one for PS by Podkopaev et al. [19] that requires an extra barrier ("ld fence") when compiling RMWs to ARMv8.

**Remark 3.** As in ARMv8, our compilation result to RISC-V uses release/acquire accesses. These accesses are not a part of RISC-V ISA, but the RISC-V memory model (RVWMO) is "designed to be forwards-compatible with the potential addition" of them [24, §14.1].

7 **Related Work**

We have already discussed the challenges in defining a `sweet-spot` for a programming language concurrency model, which is neither too weak (i.e., it provides programmability guarantees) nor too strong (i.e., it allows efficient compilation). Java was the first language, where considerable effort was put into defining such a formal model [16], but the model was found to be flawed in that it did not permit a number of desired transformations [21]. To remedy this, C/C++ introduced a very different model based on `per-execution` axioms [3], which was also shown to be inadequate [2, 13, 22, 23]. More recently, PS [12], which has already been discussed at length, addressed this challenge using the idea of locally certifiable promises. PS 2.0 improves PS by supporting global optimizations and better compilation of RMWs to ARMv8. We note that the promise-free fragment of PS 2.0 is identical to the promise-free fragment of PS.

Besides PS, there are three other approaches based on event structures [7, 8, 18]. Fichon-Pharabod and Sewell [18] defined an operational model based on plain event structures. Execution starts with a structure representing all possible program execution paths, and proceeds either by committing a prefix of the structure or by transforming it in a way...
that imitates a compiler optimization (e.g., by reordering accesses). The model also has a speculation step, whose aim is to capture transformations based on global value range analysis, but has side-condition that is rather difficult to check. The main downside of this model is its complexity, which hinders the formal development of results about it.

Jeffrey and Riely [8] defined a rather different model based on event structures, which constructs an execution via a two player game. The player tries to justify all the read events of an execution, while the opponent tries to prevent him. At each step, the player can extend the justified execution by one read event, provided that for any continuing execution chosen by the opponent, there is a corresponding write that produced the appropriate value. The basic model does not allow the reordering of independent reads, which means that compilation to ARM and Power are suboptimal. Although the model was later revised to fix the reordering problem [9], optimal compilation to hardware remains unresolved. Moreover, it does not support global optimizations and/or elimination of overwritten stores, since it forbids the annotated outcome of LB-G (see §1).

Chakraborty and Vafeiadis [7] introduced weakestmo, a model based on justified event structures, which are constructed in an operational fashion by adding one event at a time provided it can be justified by already existing events. Justified event structures are then used to extract consistent executions, which in turn determine the possible outcomes of a program. While weakestmo resolve PS’s ARMv8 compilation problem [17], it does not formally support global optimizations. Moreover, weakestmo does not support a class of strengthening transformations such as $W_{rel} \Rightarrow F_{rel}; W_{rel}$.

Both PS and PS 2.0 support these transformations.

More recently, Java has been extended with different access modes in JDK 9 [14, 15]. Bender and Palsberg [4] formalized this extension with a ‘per-execution’ axiomatic model similar to RC11 [13]. The model disallows load-store reordering (LB behaviors) for atomic accesses, while allowing out-of-thin-air values for plain accesses. Because of the latter, global value analysis is unsound in this model. It remains unclear, however, whether transformations based on such (unsound) analysis might be sound or not.

8 Conclusion

We have presented PS 2.0, the first model that formally enables transformations based on global analysis while supporting programmability (via DRF guarantees and soundness of value-range reasoning) and efficient compilation (including various compiler thread-local optimizations). The inherent tension between these desiderata, together with our goal to have a thread-local small-step operational semantics, naturally leads to a rather intricate model, which is less abstract than alternative declarative models. Nevertheless, we note that PS 2.0, like its predecessor PS, is modeling weak behaviors with just two principles: (i) “views” for out-of-order execution of reads; and (ii) “promises” for out-of-order execution of writes. The added complexity of PS 2.0 is twofold: reservations and capped memory. We view reservations as a simple and natural addition to the promises mechanism. Capped memory is less natural and more complex. Fortunately, it is only a part of the certification process and not of normal execution steps. In addition, the DRF-Promise (and the other DRF theorems as well, Theorems 6.5 to 6.8) are methods to simplify the semantics. Programmers may safely use the PF or the RA fragment of PS 2.0, which has only views (without any promises, certifications, reservations, or capped memory), when their programs are avoiding data race via release acquire and lock synchronization.

We also note that PS 2.0 allows some seemingly dubious behaviors, such as “read from unexecuted branch” [5]:

\[
\begin{align*}
    a &:= x \quad \text{(RFUB)} \\
    y &:= a \\
    b &:= y \\
    \quad \text{if } b = 42 \\
    \quad \text{then } x &:= b \\
\end{align*}
\]

The annotated behavior is allowed in PS 2.0 (as in PS and C/C++11). Aiming to support local compiler optimizations, this is actually unavoidable. Practical compilers (including gcc and llvm) may observe that thread 2 writes 42 to $x$ regardless of which branch is taken, and optimize the program of thread 2 to $b := y; x := 42$ (such optimization is a “trace-preserving transformation” [12]). The resulting program is essentially the LB program (see §1), whose annotated behavior can be obtained by mainstream architectures.

Finally, to the best of our knowledge, PS 2.0 supports all practical compiler optimizations performed by mainstream compilers. As a future goal, we plan to extend it with sequentially consistent accesses (backed up with DRF-SC guarantee) and C11-style consume accesses.

Acknowledgments

We thank the PLDI’20 reviewers for their helpful feedback. Chung-Kil Hur is the corresponding author. Sung-Hwan Lee, Minki Cho, and Chung-Kil Hur were supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFC-IT1502-53. Anton Podkopaev was supported by JetBrains Research and RFBR (grant number 18-01-00380). Ori Lahav was supported by the Israel Science Foundation (grant number 5166651), by Len Blavatnik and the Blavatnik Family foundation, and by the Alon Young Faculty Fellowship.

References


\[ p ::= s \mid \ldots \mid s \]

\[ s \in St ::= \text{program} \]

\[ \text{skip} \]

\[ | s; s \]

\[ | \text{if } e \text{ then } s \text{ else } s | \text{ do } s \text{ while } e \]

\[ | r := e^a \]

\[ | x^a := r \]

\[ | r := \text{FADD}^{o,a}(x, v) \]

\[ | r := \text{CAS}^{o,a}(x, v, o) \]

\[ | \text{fence}^{sc} | \text{fence}^{el} | \text{fence}^{sc} \]

\[ | r := \text{syscall}(e) \]

\[ | \text{abort} \]

\[ o \in \text{Mode} ::= \text{pln} | \text{rlx} | \text{ra} \]

\[ \text{access modes} \]

\[ \text{locations} \]

\[ \text{registers} \]

\[ \text{values} \]

\[ \text{unary ops.} \]

\[ \text{binary ops.} \]

\[ \text{pure expressions} \]

\[ \text{fences} \]

\[ \text{system call} \]

\[ \text{abort} \]

**Figure 3.** The language

### A Full Model

In this section, we present our full formal model, which accounts for plain accesses, fences, and release sequences, with a simple programming language of Fig. 3 that we used for constructing the formalized results. Note that we omit the descriptions on the components which are the same as in §5.

Now the model employs three modes for memory accesses, naturally ordered as follows:

\[ \text{pln} \sqsubset \text{rlx} \sqsubset \text{ra} \]

Furthermore, we introduce transition labels \( F_{acq}, F_{el}, \) and \( F_{sc} \) for fences.

**Timemaps.** A timemap is a function \( T : \text{Loc} \rightarrow \text{Time} \).

**Views.** A view is a pair \( V = (T_{\text{pln}}, T_{\text{rlx}}) \) of timemaps satisfying \( T_{\text{pln}} \leq T_{\text{rlx}} \). We denote by \( V.\text{pln} \) and \( V.\text{rlx} \) the components of \( V \). View denotes the set of all views.

**Memory.** A memory is a (nonempty) pairwise disjoint finite set of messages. A memory \( M \) supports the following operations for a message \( m \), where \( m.\text{loc} = x, m.\text{from} = f, m.\text{to} = t \), and \( f < t \):

- The **additive insertion**, denoted by \( \leftrightarrow m \), is given by \( M \cup \{ m \} \). It is only defined if (i) \( \{ m \} \neq M \); (ii) if \( m \) is a concrete message, then no message \( m' \in M \) has \( m'.\text{loc} = x \) and \( m'.\text{from} = t \); and (iii) if \( m \) is a reservation, then there exists \( m' \in M \) with \( m'.\text{loc} = x \) and \( m'.\text{to} = f \).

- The **splitting insertion**, denoted by \( \leftrightarrow m \), is only defined if \( m \) is a concrete message and there exists \( m' \in M \) such that \( m'.\text{loc} = x, m'.\text{from} = f, \) and \( m'.\text{to} = t' \) with \( t < t' \), in which case it is given by \( M \setminus \{ m' \} \cup \{ m, \langle x : v' @ (t, t'), R' \rangle \} \).

- The **lowering insertion**, denoted by \( \leftrightarrow m \), is only defined if \( m \) is a concrete message \( \langle x : v @ f, t, R \rangle \) and there exists \( m' \in M \) that is identical to \( m \) except for \( m.\text{view} \leq m'.\text{view} \), in which case it is given by \( M \setminus \{ m' \} \cup \{ m \} \).

- The **cancellation**, denoted by \( \leftrightarrow m \), is given by \( M \setminus \{ m \} \). It is only defined if \( m \) is a reservation in \( M \).

We use \( \leftrightarrow p \) to denote an additive insertion into a set of promises, which does not require the last condition of the additive insertion: for a memory \( P \) and a reservation \( m, P \leftrightarrow p m \) is defined if \( \{ m \} \neq M \). To simplify the presentation, we define \( \leftrightarrow p, \leftrightarrow p, \) and \( \leftrightarrow p \) to be the same as \( \leftrightarrow p, \leftrightarrow p, \) and \( \leftrightarrow p \) respectively.

**Closed Memory.** Given a timemap \( T \) and a memory \( M \), we write \( T \in M \) if, for every \( x \in \text{Loc} \), we have \( T(x) = m.\text{to} \) for some concrete message \( m \in M \) with \( m.\text{loc} = x \). For a view \( V \), we write \( V \in M \) if \( T \in M \) for each component timemap \( T \) of \( V \).

**Thread Views.** A thread view is a triple \( V = (\text{cur}, \text{acq}, \text{rel}) \), where \( \text{cur}, \text{acq} \in \text{View} \) and \( \text{rel} \in \text{Loc} \rightarrow \text{View} \) satisfying \( \text{rel}(x) \leq \text{cur} \leq \text{acq} \) for all \( x \in \text{Loc} \). We denote by \( V.\text{cur} \), \( V.\text{acq} \), and \( V.\text{rel} \) the components of \( V \).

**Thread States.** A thread state is a triple \( TS = (\sigma, V, P) \), where \( \sigma \) is a local state, \( V \) is a thread view, and \( P \) is a memory. We denote by \( TS.\text{st} \), \( TS.\text{view} \), and \( TS.\text{prm} \) the components of a thread state \( TS \).

**Thread Configuration Steps.** A thread configuration is a triple \( \langle TS, S, M \rangle \), where \( TS \) is a thread state, \( S \) is a timemap (the global SC timemap), and \( M \) is a memory.

Fig. 4 presents the full list of thread configuration steps. To avoid repetition, we use the additional rules \( \text{read-helper}, \text{write-helper}, \) and \( \text{sc-fence-helper} \). These employ several helpful notations: \( \sqsubseteq \) and \( \sqcup \) denote the natural bottom elements.
(Memory: new)

\[(P, M) \xrightarrow{m} (P, M \uplus m)\]

(Read-helper)

\[
\begin{align*}
o & \in \text{pln} \Rightarrow \text{cur}.\text{pln}(x) \leq t \\
o & \in \{Rl.x, ra\} \Rightarrow \text{cur}.rlx(x) \leq t \\
\text{cur}' & = \text{cur} \lor V \cup (o = ra \lor R) \\
\text{acq}' & = \text{acq} \lor V \cup (o \equiv rlx \lor \text{ra}) \\
\text{where V} & = [\text{pln} : (o \equiv rlx \lor \text{ra}) \lor \text{rlx} : [x@t]]
\end{align*}
\]

\[\langle \text{cur}, \text{acq}, \text{rel} \rangle \xrightarrow{R.o.x.t.R} \langle \text{cur}', \text{acq}', \text{rel}' \rangle\]

(Write-helper)

\[
\begin{align*}
\text{cur}.rlx(x) < t \\
\text{cur}' & = \text{cur} \lor V \\
\text{acq}' & = \text{acq} \lor \text{cur}' \\
\text{rel}' & = \text{rel} \lor V \cup (o = ra \lor \text{cur}) \\
\text{rel}_{\text{w}} & = (o \equiv rlx ? (\text{rel}'(x) \lor R)) \\
\text{where V} & = [\text{pln} : [x@t], \text{rlx} : [x@t]]
\end{align*}
\]

\[\langle \text{cur}, \text{acq}, \text{rel} \rangle \xrightarrow{\omega.o.x.t.R, R_{\text{w}}} \langle \text{cur}', \text{acq}', \text{rel}' \rangle\]

(Sc-fence-helper)

\[
\begin{align*}
\text{acq} & = \text{acq} \lor \text{cur}' \\
\text{cur}' & = \text{acq} \lor \text{cur} \\
\text{rel}' & = \lambda_\exists(S', S') \\
\langle \text{cur}, \text{acq}, \text{rel} \rangle, S & \xrightarrow{F_{\text{sc}}} \langle \text{cur}', \text{acq}', \text{rel}' \rangle
\end{align*}
\]

(System call)

\[
\sigma \xrightarrow{\text{Sys}(\text{call})} \sigma'
\]

(Memory: fulfill)

\[\langle P, M \rangle \xrightarrow{m} (P \setminus \{m\}, M') \]

(Failure)

\[\sigma \xrightarrow{\text{Fail}} \sigma'\]

(Failure step). A thread configuration \(\langle TS, S, M \rangle\) can fail if \(TS\) is promise-consistent:

\[\forall m \in TS.\text{prm}, TS.\text{view} \land \text{cur}.rlx(m.\text{loc}) \leq m.\text{to}\]

Cap View and Messages. The last message of a memory \(M\) to a location \(x\), denoted by \(\overline{m}_{M,x}\), is given by:

\[\overline{m}_{M,x} \triangleq \arg \max_{m \in M} m.\text{to}
\]

Figure 4. Full operational semantics.
The cap timemap and cap view of a memory $M$ is given by:

$$\hat{T}_M \doteq \lambda x. m_{\text{M,x}} \cdot \text{to} \quad \text{and} \quad \hat{V}_M \doteq \langle \hat{T}_M, \hat{V}_M \rangle$$

The cap message of a memory $M$ to a location $x$, denoted by $m_{\text{M,x}}$, is given by:

$$m_{\text{M,x}} = \langle x : m_{\text{M,x}} \cdot \text{Val}(\hat{m}_{\text{M,x}} \cdot \text{to}, \hat{m}_{\text{M,x}} \cdot \text{to} + 1), \hat{V}_M \rangle$$

**Consistency.** A thread configuration $(TS, S, M)$ is called consistent if for a capped memory $\hat{M}_{\text{TS,prm}}$ of $M$ with respect to $TS$.prm and the timemap $\hat{S} = \hat{T}_{\text{TS,prm}}$ of $\hat{M}_{\text{TS,prm}}$, there exist $TS', S'$, $M'$ such that:

$$(TS, \hat{S}, \hat{M}_{\text{TS,prm}}) \rightarrow^* (TS', S', M') \land TS'.\text{prm} = \emptyset$$

### B An Example for Cancellation

We present an example that justifies that canceling of reservations is essential to support register promotion. Consider the following variant of the RPacq program:

$$a := x \quad \mathbf{// 1}$$

if $a = 1$

then $c := \text{FADD}^{\text{acq}}(z, 1)$

else $d := \text{FADD}^{\text{acq}}(w, 1)$

$y := 1$

$b := y \quad \mathbf{// 1}$$

$a := x \quad \mathbf{// 1}$$

$b := y \quad \mathbf{// 1}$$

$x := b$$

$y := 1$$

$x := b$

In the source program (the left one), since both locations $z$ and $w$ are accessed only by thread 1, a compiler may promote these locations and remove the whole if-statement. However, if we do not allow a thread to cancel its reservations (i.e., a reservation should be fulfilled with a concrete message), the annotated behavior, which is clearly observable after the optimization, is not observable in the source program. Here, in order for thread 1 to promise $y := 1$, thread 1 should make a reservation to (at least) one of $z$ or $w$, as it will execute an acquire RMW in its certification. In fact, at the moment when thread 1 promises $y := 1$, the only value thread 1 can read from $x$ is 0, so that the only option for thread 1 is to reserve on $w$. After making a reservation to $w$, thread 1 will never be able to read from $x$ even if thread 2 will write $x := 1$ as thread 1 is obligated to “fulfill” its reservation on $w$.

### C Weak Behaviors

In this section, we discuss a variant of CDRF, where RMW operations are replaced with relaxed ones. Consider the following program:

$$a := \text{CAS}(x, 0, 1) \quad \mathbf{// 0}$$

if $a \leq 1$ then

$y := 1$

$b := \text{WCAS}(x, 0, 2) \quad \mathbf{// true}$

if $b$ then

$c := y \quad \mathbf{// 1}$

if $c = 1$ then

$x := 0$

Here, we use a weak compare-and-swap operation WCAS, which is allowed to spuriously fail even if it reads the desired value it wants to update. We assume that WCAS returns a boolean flag that represents whether the update was successful. PS 2.0 allows the annotated behavior, in particular, where both updates to $x$ succeed.

This might seem to be an overly weak behavior: when thread 2 succeeds WCAS (and updates $x$ to 2), it cannot read 1 from $y$ since thread 1 cannot update $x$ from 0 to 1.

In fact, however, this behavior is definitely allowed after applying several local optimizations and one global optimization. First, thread 2 can be (locally) optimized as follows:

$$c := y$$

if $c = 1$ then

$b := \text{WCAS}(x, 0, 2)$

if $b$ then

$x := 0$

else

$b := \text{WCAS}(x, 0, 2)$

else

$x := x$

(1) We can reorder the update to $x$ followed by the read from $y$ by introducing a relaxed read in the else-branch of thread 2.

(2) The update to $x$ can be distributed into the both branch.

(3) Since WCAS always can fail, we can replace WCAS in the else-branch with a relaxed read.
We demonstrate an algorithm used for register promotion in Fig. 5.

Figure 5. An algorithm for register promotion

(4) Finally, we can merge the update to \(x\) and the write to \(x\) since thread 2 executes the write only when WCAS was successful.

Now, the optimized program is given by:

\[
\begin{align*}
    a & := \text{CAS}(x, 0, 1) \\
    \text{if } a \leq 1 \text{ then } y & := 1 \\
    a & := \text{CAS}(x, 0, 1) \quad \text{// 0}
\end{align*}
\]

\[
\begin{align*}
    c & := y \\
    \text{if } c = 1 \text{ then } b & := \text{WCAS}(x, 0, 0) \\
    c & := y \quad \text{// 1} \\
    \text{if } c = 1 \text{ then } b & := \text{WCAS}(x, 0, 0) \quad \text{// true}
\end{align*}
\]

Here, a global invariant \( x \leq 1 \land y \leq 1 \) holds, which, in turn, \( a \leq 1 \) can be optimized to true. Then the update to \(x\) followed by the write to \(y\) can be reordered, so that the annotated behavior is allowed:

\[
\begin{align*}
    y & := 1 \\
    a & := \text{CAS}(x, 0, 1) \\
    c & := y \\
    \text{if } c = 1 \text{ then } b & := \text{WCAS}(x, 0, 0) \quad \text{// true}
\end{align*}
\]

\[
\begin{align*}
    \text{else } _ := x
\end{align*}
\]

D An Algorithm for Register Promotion

We demonstrate an algorithm used for register promotion in Fig. 5.

E A Stronger Version of DRF-Promise Theorem

In this section, we present a more general version of Theorem 6.5. We start by introducing a new access mode \(pf\), which appears to be stronger than \(rlx\) and weaker than \(ra\):

\[
\text{pln} \subset \text{rlx} \subset \text{pf} \subset \text{ra}
\]

A \(pf\)-write is not allowed to be promised as a release write is (thus, it cannot be reordered with a preceding read). However, unlike \(rel\), executing a \(pf\)-write does not increase the release view of the writing thread. This applies similar to the \(pf\)-fence operations.

More precisely, the rule (\text{write}) in Appendix A is updated, and the rule (\text{pf-fence}) is newly introduced as follows:

\[
\begin{align*}
\text{(write)} & \quad (\text{write}) \\
\text{pf} & \quad \sigma \xrightarrow{\text{write}(o, x, o)} \sigma' \\
\text{pf} & \quad o \sqsupset \sigma' \forall m' \in P(x), m'.\text{view} = \bot \\
\text{pf} & \quad m = \langle x : v@(_t, t], R \rangle \\
\text{pf} & \quad \langle P, M \rangle \xrightarrow{m} \langle P', M' \rangle \\
\text{pf} & \quad \langle P, M \rangle \xrightarrow{\text{write}(o, x, o, t, R, v', \sigma') \rangle \langle P', M' \rangle \\
\text{pf} & \quad \langle (\sigma, V, P), S, M \rangle \rightarrow \langle (\sigma', V', P'), S, M' \rangle \\
\text{pf} & \quad (\text{pf-fence}) \\
\text{pf} & \quad \sigma \xrightarrow{\text{pf}} \sigma' \\
\text{pf} & \quad \forall m \in P. m.\text{view} = \bot \\
\text{pf} & \quad \langle (\sigma, \langle \text{cur}, \text{acq}, \text{rel} \rangle, P), S, M \rangle \rightarrow \langle (\sigma', \langle \text{cur}, \text{acq}, \text{rel} \rangle, P), S, M \rangle
\end{align*}
\]

Now, we define and prove a generalization of Theorem 6.5.
**Definition E.1.** A machine configuration MS is promise-race-free if whenever two non-race threads may take a non-promise step accessing the same location, one of the step is reading (R/∪) and the other is writing (W/∪), then the access mode of the write is stronger than pf (W 2pf or U o, 2pf). In addition, if both competing steps are RMWs then both have a stronger mode than U ra, 2pf.

**Theorem E.2 (DRF-Promise).** If every PF-reachable machine state for P is promise-race-free, then Beh_{pf}(P) = Beh_{PS,0}(P).

**F Proof of DRF-Lock Theorems**

**Remark 4.** If a program P is well-locked with a set of lock locations L, the following invariant holds for every reachable memory M during the execution of P.

\[\forall l \in L, \forall (l : v@(\_), \_), M, v = 0 \lor v = 1 \land \]
\[\forall (l : 1@(t, \_), R^1) \in M, \exists (l : 0@(\_, t), R^0) \in M \land \]
\[\forall (l : v^0@(\_, R^0), R^1) \in M, t^0 \subseteq t^1 \Rightarrow R^0 \subseteq R^1 \land \]
\[\forall (l : 0@(f, \_), \_), M, t \sim t' \lor \exists (l : 1@(t, \_), \_) \in M\]

We define tryLock^{0, o_0} as a := WCAS^{acq, o_0, o_2}(L, 0, 1), where WCAS^{o_0, o_2} is a weak CAS operation, which can either return true after successfully performing CAS with o_0 for the read and o_1 for the write, or return false after reading any value from L with o_2. Note that tryLock in Theorem 6.7 and Theorem 6.8 is the same as tryLock_{l, r, l, r}.

We define nondetLock as follows where choose\{a, b\} non-deterministically executes one of a or b.

\[\text{nondetLock}^{o_1, o_2}(l) \triangleq \text{choose}\{0, \text{Lock}^{o_1, o_2}(l)\}\]

For a program P, we define P\[o_1, o_2\] be a program obtained by replacing every tryLock^{o_1, o_2}(l) in P with tryLock^{o_1, o_2}(l), and P' be a program that every tryLock^{o_1, o_2}(l) in P is replaced with nondetLock^{o_1, o_2}(l).

**Lemma F.1 (Strengthening Lock).** For a well-locked program P, we have:

\[\text{Beh}_{PS,0}(P) = \text{Beh}_{PS,0}(P[r a, r a]).\]

**Proof:** We show P[r a, r a] can simulate every execution of P.

First, we define view\_attached, wf\_attached and wf\_attached\_in.

\[\text{view\_attached}(l, t_l, R, V) \triangleq V. \text{rlx}(l) = t_l \Rightarrow R \subseteq V \land V. \text{pln}(l) = t_l\]

\[\text{wf\_attached}(\langle TS_{src}, S_{src}, M_{src} \rangle) \triangleq \forall l \in L, (l : \_@(_, t_l), R_l) \in M_{src}, \]
\[\langle \forall x, \text{view\_attached}(l, t_l, R_l, TS_{src}(i).V.\text{rel}(x)) \rangle \land \]
\[\text{view\_attached}(l, t_l, R_l, TS_{src}(i).V.\text{cur}) \land \]
\[\text{view\_attached}(l, t_l, R_l, TS_{src}(i).V.\text{acq}) \land \]
\[\text{view\_attached}(l, t_l, R_l, (S_{src}, S_{src})) \land \]
\[\forall \_ : \_@(_, R) \in M_{src}, \text{view\_attached}(l, t_l, R)\] .

\[\text{wf\_attached\_in}(\langle TS_{src}, S_{src}, M_{src} \rangle) \triangleq \forall l \in L, (l : \_@(_, t_l), R_l) \in M_{src}, \]
\[\langle \forall x, \text{view\_attached}(l, t_l, R_l, TS_{src}.V.\text{rel}(x)) \rangle \land \]
\[\text{view\_attached}(l, t_l, R_l, TS_{src}.V.\text{cur}) \land \]
\[\text{view\_attached}(l, t_l, R_l, TS_{src}.V.\text{acq}) \land \]
\[\text{view\_attached}(l, t_l, R_l, (S_{src}, S_{src})) \land \]
\[\forall \_ : \_@(_, R) \in M_{src}, \text{view\_attached}(l, t_l, R)\] .

**Remark 5.** The following properties on view\_attached hold for every \(\langle l : \_@(_, t_l), R \rangle \in M\).

- view\_attached(l, t_l, R, \perp)
- view\_attached(l, t_l, R, V_1) \land view\_attached(l, t_l, R, V_2) \Rightarrow view\_attached(l, t_l, R, V_1 \cup V_2)
- x@t \neq l@t \Rightarrow view\_attached(l, t_l, R, \{\text{pln} : \{x@t\}, \text{rlx} : \{x@t\}\})
- x@t \neq l@t \Rightarrow view\_attached(l, t_l, R, \{\text{pln} : \{\_ \equiv \_l lx ? \{x@t\} \}, \_l lx : \{x@t\}\})
We define $\equiv^T$ to be a simulation relation between program states of $P[ra, ra]$ and $P$. $\sigma_{src} \equiv^T \sigma_{tgt}$ if $\sigma_{src}$ is the same as $\sigma_{tgt}$ except every tryLock in the statement of $\sigma_{src}$ has the ordering $(ra, ra)$.

Then we define simulation relations between memories, thread states, and machine configurations as follows:

$$
M_{src} \not\sim M_{tgt} \triangleq \forall(x : v@(f, t), R_{src}) \in M_{src}, \\
\exists R_{tgt}. ((x : v@(f, t), R_{tgt}) \in M_{tgt} \land (x \notin L \lor v \neq 1 \Rightarrow R_{src} \subseteq R_{tgt})) \land \\
\forall(x : v@(f, t), R_{tgt}) \in M_{tgt}, \\
(\exists R_{src}. (x : v@(f, t), R_{src}) \in M_{src}) \lor (x \in l \land v = 1 \land \langle x : (f, t) \rangle \in M_{arc})
$$

$$
TS_{src} \not\sim TS_{tgt} \triangleq TS_{src}.st \not\sim TS_{tgt}.st \land \\
TS_{src}.view.cur \subseteq TS_{tgt}.view.cur \land \\
TS_{src}.view.acq \subseteq TS_{tgt}.view.acq \land \\
\forall x \notin l, TS_{src}.view.rel(x) \subseteq TS_{tgt}.view.rel(x) \\
TS_{src}.P \not\sim TS_{tgt}.P \land \\
\forall(x : _@((f, t), R_{src}) \in TS_{src}.P, \langle x : _@((f, t), R_{tgt}) \in TS_{tgt}.P, \\
(TS_{src}.view.rel(x) \subseteq R_{tgt} \Rightarrow TS_{src}.view.rel(x) \subseteq R_{src}) \land \\
(\forall(x : _@((_, f), R'_{src}) \in TS_{src}.P, \langle x : _@((_, f), R'_{tgt}) \in TS_{tgt}.P, \\
R'_{src} \subseteq R_{tgt} \Rightarrow R'_{src} \subseteq R_{src})
$$

$$
\langle TS_{src}, S_{src}, M_{src} \rangle \not\sim \langle TS_{tgt}, S_{tgt}, M_{tgt} \rangle \triangleq TS_{src} \not\sim TS_{tgt} \land S_{src} \subseteq S_{tgt} \land M_{src} \not\sim M_{tgt} \land \text{wf}_\text{attached}_\text{th}(\langle TS_{src}, S_{src}, M_{src} \rangle) \\
\langle TS_{src}, S_{src}, M_{src} \rangle \not\sim \langle TS_{tgt}, S_{tgt}, M_{tgt} \rangle \triangleq (\forall l, TS_{src}(l) \not\sim TS_{tgt}(l) \land S_{src} \subseteq S_{tgt} \land M_{src} \not\sim M_{tgt}) \land \\
\text{wf}_\text{attached}(\langle TS_{src}, S_{src}, M_{src} \rangle)
$$

We first start by simulating thread steps: for any thread configurations $TC^1_{src} = \langle TS_{src}, S_{src}, M_{src} \rangle$, $TC^1_{tgt} = \langle TS_{tgt}, S_{tgt}, M_{tgt} \rangle$, and $TC^2_{src}$, such that $TC^1_{src} \not\sim TC^1_{tgt}$ and $TC^1_{tgt} \rightarrow TC^2_{tgt}$, there exists a thread configuration $TC^2_{src}$ such that $TC^1_{src} \rightarrow TC^2_{src}$, $TC^2_{src} \not\sim TC^2_{tgt}$. Consider the following cases of the thread step taken by target thread configuration, $TC^1_{tgt} \rightarrow TC^2_{tgt}$:

1. **tryLock($l$) success**

   $TS_{src}$ takes the same step as $TS_{tgt}$ took and both $\not\sim$ and $\not\equiv$ still hold. We suppose that $TS_{src}$ wrote $\langle l : 1@(_, t), R \rangle$ and the thread view becomes $\text{view}'$. Then, $\text{view}'$.cur.rlxl(l) = $\text{view}'$.cur.acql(l) = t. As $R$ is a joined view of the source thread's relaxed view and the view of the message the thread read, $\text{view}_\text{attached}(l, t, R, \text{view}'.cur)$ and $\text{view}_\text{attached}(l, t, R, \text{view}'.acq)$ are satisfied. Therefore, $\text{wf}_\text{attached}_\text{th}$ still holds.

2. **tryLock($l$) fail**

   There exists a message $\langle l : v@(\_, TS_{src}.view.cur.rlxl(l)), R \rangle \in M_{src}$. $TS_{src}$ reads this message and fails to acquire the lock regardless of the value $v$. Since $\text{view}_\text{attached}(l, TS_{src}.view.cur.rlxl(l), R, TS_{src}.view.cur.rlxl(l))$, we can get $R^l_{src} \subseteq TS_{src}.view.cur.rlxl$. Therefore, after $TS_{src}$ reads the message, the thread view of $TS_{src}$ does not increase. As the thread view of $TS_{src}$ remains the same and the thread view of $TS_{tgt}$ may increase, $\not\sim$ and $\text{wf}_\text{attached}_\text{th}$ still hold.

3. **Unlock($l$)**

   $TS_{src}$ takes the same step as $TS_{tgt}$ took and both $\not\sim$ and $\not\equiv$ still hold. We suppose that $TS_{src}$ wrote $\langle l : 0@(_, t), R \rangle$ and the thread view becomes $\text{view}'$. As $R$ is equal to $TS_{src}.view.cur$ and $TS_{src}.view.cur \subseteq \text{view}'.cur \subseteq \text{view}'.acq$, $\text{view}_\text{attached}(l, t, R, \text{view}'.cur)$ and $\text{view}_\text{attached}(l, t, R, \text{view}'.acq)$ are satisfied. Therefore, $\text{wf}_\text{attached}_\text{th}$ still holds.

4. **Promise**

   Suppose that $\langle x : v@(f, t), R_{tgt} \rangle$ is the message $TS_{tgt}$ newly promised. If $x \in L$, $v$ should be 1. Then $TS_{src}$ reserves $\langle x : (f, t) \rangle$ in the same place.

In other cases, $TS_{src}$ promises $\langle x : v@(f, t), R_{src} \rangle$, where $R_{src}$ is determined as follows.

$$
\forall l \not\in L, R_{src}.pln(l) = R_{tgt}.pln(l) \land R_{src}.rlx(l) = R_{tgt}.rlx(l) \\
\forall l \in L, R_{src}.pln(l) = R_{src}.pln(l) = t_1
$$

where $t_1$ is a maximum timestamp s.t.
∃(l: _@(_, t1), R) ∈ Msrc, (∀l' /∈ L, R.p1n(l') ⊆ Rtgt.p1n(l') ∧ R.r1x(l') ⊆ Rtgt.r1x(l'))

By construction, view_attached(l, t1, R, Rsrc) for every (l: _@(_, t1), R) ∈ Msrc. Therefore, wf_attached still holds.

5. Other steps

TSrc takes the same step as TStgt took. Since TStgt.view except the release view on l ∈ L and a view of every related message in the Msrc are lower than those of the target thread configuration, and hold. The source thread’s new thread view V′, the new global timemap S, and any added messages’ views are obtained by joining existing views or a singleton view that does not contain l. By Remark 5, the resulting new source thread configuration satisfies wf_attached.

In the same way, we can prove that the source thread can simulate the target thread’s certification steps. The only difference is the cap messages on x ∈ L. Since there is no relaxed RMW on x ∈ L, capped messages does not make any changes on the above simulation arguments for thread steps. Therefore, if TStgt ≫ TStgt and TStgt is consistent, then TSRC is consistent as well.

Now, given a machine step of the target program, we construct a machine step of the source:

∀ MS1src, MS1tgt, MS2src, MS2tgt, MS1src ⊆ MS1tgt ∧ (MS1tgt ⇒ MS2tgt)

⇒ ∃ MS2src, (MS1src, MS2src) ∧ MS2src ⊆ MS2tgt

Let’s say MS1src = ⟨TS1src, S1src, M1src⟩, MS1tgt = ⟨TS1tgt, S1tgt, M1tgt⟩, and the i-th thread of MS2tgt, TS1tgt(i) took the step, so that MS2tgt = ⟨TS2tgt[i ↦ TS2tgt], S2tgt, M2tgt⟩ for some TS2tgt. From that MS1src ⊆ MS1tgt, we have TS1src(i) ≫ TS1tgt(i). Since the source thread configuration can simulate the target steps and indeed become consistent, we have the following:

∃ TS2src, S1src, S2src, M1src, M2src, (⟨TS1src(i), S1src, M1src⟩ →+ ⟨TS2src, S2src, M2src⟩) ∧

⟨TS2src, S2src, M2src⟩ is consistent ∧

⟨TS2src, S2src, M2src⟩ ⊆ ⟨TS2tgt, S2tgt, M2tgt⟩

while leaving the same trace as the target thread steps. Therefore, we achieve the machine step of the source machine configuration, MS1src, MS1tgt, MS2src, MS2tgt, where ⟨TS1src[i ↦ TS2src], S2src, M2src⟩ ⊆ MS2tgt.

Since it is trivial that the initial machine of P[ra, ra] and P are related with ≪,

BehPS2.0(P) ⊆ BehPS2.0(P[ra, ra])

□

Lemma F.2. For a well-locked program P, BehSC(P′) ⊆ BehSC(P). Furthermore, every machine state reachable in the SC execution of P′ is reachable in the SC execution of P.

Proof. It is easy to show that every execution of P′ in SC semantics can be simulated by P in SC semantics. Whenever nondetLock in P′ fails (returns 0), tryLock in P also fails. Whenever nondetLock in P′ is trying to get the lock in a loop, P does nothing. Finally, whenever nondetLock in P′ succeeds to get a lock, tryLock in P may also get the lock since we know that the lock is not acquired by any other thread.

□

Lemma F.3. For a well-locked program P, we have:

BehRA(P) ⊆ BehRA(P′).

Proof. For each step P takes, P′ can simulate the exact step with the following simulation relations. First, the machine state of P′ and the machine state of P are the same except the thread views and messages in their memory. Second, the thread views and messages in the memory of P′ are lower than those in the memory of P.

Whenever tryLock in P succeeds to acquire a lock, nondetLock in P′ also succeeds. Whenever tryLock in P fails, nondetLock in P′ fails. Since tryLock in P reads a message and nondetLock in P′ does not, views in the machine state of P′ remain lower.

□

F.1 Proof of DRF-Lock-RA (Theorem 6.7)

Proof. By the definition of RA semantics, the RA-execution of P is equal to the RA-execution of P[ra, ra]. Thus, if every machine state reachable from a RA-execution of a program P is r1x-race-free, then every machine state reachable from a RA-execution of the program P[ra, ra] is r1x-race-free as well. As a result, we get the following equations.

BehRA(P) = BehRA(P[ra, ra]) (by the definition of RA semantics)
= Beh_{PS,0}(P[r_{a}, r_{a}]) \text{ (by Theorem 6.6)}
= Beh_{PS,0}(P) \text{ (by Lemma F.1)}

\square

F.2 Proof of DRF-Lock-SC (Theorem 6.8)

Proof. By Lemma F.2, we know that every machine state reachable from \(P'\) has no race for all non-lock locations. Since \(P'\) accesses lock locations only using \text{Lock} and \text{Unlock}, we can apply DRF-LOCK Theorem in [12].

\[
\text{Beh}_{SC}(P) \subseteq \text{Beh}_{RA}(P) \text{ (trivial)} \\
\subseteq \text{Beh}_{RA}(P') \text{ (by Lemma F.3)} \\
= \text{Beh}_{SC}(P') \text{ (by DRF-LOCK Theorem in [12])} \\
\subseteq \text{Beh}_{SC}(P) \text{ (by Lemma F.2)}
\]

Therefore we get \(\text{Beh}_{SC}(P) = \text{Beh}_{RA}(P) = \text{Beh}_{RA}(P') = \text{Beh}_{SC}(P')\). As Theorem 6.7 says \(\text{Beh}_{RA}(P) = \text{Beh}_{PS,0}(P)\), we finally get \(\text{Beh}_{SC}(P) = \text{Beh}_{RA}(P) = \text{Beh}_{PS,0}(P)\). \square