Putting Weak Memory in Order via a Promising Intermediate Representation

SUNG-HWAN LEE, Seoul National University, Korea
MINKI CHO, Seoul National University, Korea
ROY MARGALIT, Tel Aviv University, Israel
CHUNG-KIL HUR, Seoul National University, Korea
ORI LAHAV, Tel Aviv University, Israel

We investigate the problem of developing an “in-order” shared-memory concurrency model for languages like C and C++, which executes instructions following their program order, and is thus more amenable to reasoning and verification compared to recent complex proposals with out-of-order execution. We demonstrate that it is possible to fully support non-atomic accesses in an in-order model in a way that validates all compiler optimizations that are performed in single-threaded code (including irrelevant load introduction). The key to doing so is to utilize the distinction between a source model (with catch-fire semantics) and an intermediate representation (IR) model (with undefined value for racy reads) and formally establish the soundness of mapping from source to IR. As for relaxed atomic accesses, an in-order model must forbid load-store reordering. We discuss the rather limited performance impact of this fact and present a pragmatic approach to this problem, which, in the long term, requires a new kind of hardware store instructions for implementing relaxed stores.

The source and IR semantics proposed in this paper are based on recent versions of the promising semantics, and the correctness proofs of the mappings from the source to the IR and from the IR to Armv8 are mechanized in Coq. This work is the first to formally relate an in-order source model and an out-of-order IR model with the goal of having an in-order source semantics without any performance overhead for non-atomics.

CCS Concepts:
- Theory of computation → Concurrency; Operational semantics
- Software and its engineering → Semantics; Compilers

Additional Key Words and Phrases: Relaxed Memory Concurrency; Operational Semantics; Compiler Optimizations; Intermediate Representation

ACM Reference Format:

1 INTRODUCTION

Despite decades of research, finding the right semantics for concurrent shared-memory programs in high-level languages is still considered to be a major open problem [Batty et al. 2015], which prevailing languages like C, C++, and Java have yet to overcome. The technical challenge lies in precisely identifying and balancing the conflicting desiderata of programmers, compilers, and modern multicore architectures. Generally speaking, programmers need a simple semantics that...
allows them to reason about their code; compilers strive to apply program optimizations and to be able to justify their correctness; and hardware aims at efficient non-blocking implementations that only provide rather weak consistency guarantees.

Recent years have shown multiple proposals of shared-memory concurrency models that aim to address this challenge (see, e.g., [Chakraborty and Vafeiadis 2019; Jagadeesan et al. 2020; Jeffrey et al. 2022; Kang et al. 2017; Lee et al. 2020; Paviotti et al. 2020]). These models typically focus on performance, aiming at a semantics that allows various compiler optimizations and efficient mapping to hardware. In particular, to support load-store reordering (of accesses to different addresses), either as a part of a compiler optimization or as a possible result of the hardware’s pipeline, all these models employ some sort of out-of-order execution that allows reads to read from future writes. For not sacrificing programmability, which typically means that “out-of-thin-air” values should be forbidden and the model should admit well-accepted data-race-freedom (DRF) guarantees [Adve and Hill 1990; Batty et al. 2015; Cho et al. 2021], such models have to restrict their speculation mechanisms in a way in which certain program behaviors have to be justified by the existence of other program behaviors. For instance, the promising semantics by Kang et al. [2017] requires promises of future writes to be justified by another thread-local run of the program, and event-structure models, as the one by Chakraborty and Vafeiadis [2019], enforce consistency constraints on a structure that captures several runs of the program. This makes these models rather complex to reason about, and, indeed, besides several notable exceptions for particular models (see, e.g., [Abdulla et al. 2021; Svendsen et al. 2018]), existing verification research cannot handle such models.

This paper is devoted to investigating an alternative approach that puts amenability to reasoning and verification in the center. For that, we are after an in-order semantics, where each allowed behavior is accounted for by one execution of the program in which the actions of the different threads follow the order dictated in their code, and every read reads from a previously executed write. An in-order semantics allows one to incrementally reason about the code line-by-line, considering at each step only the effect of the execution so far and the current instruction. In contrast, reasoning about out-of-order semantics is much harder as it requires considering future instructions (or revisiting previous decisions) based on other possible program executions.

The most intuitive example for an in-order semantics is the well-known model of sequential consistency (SC), where different threads take turns communicating with a single global memory in the form of address-to-value mapping, and every read obtains its value from the last previously executed write to the same address. Nevertheless, various other models, weaker than SC, are still in-order. In particular, RC11 [Lahav et al. 2017], a well-studied declarative model for C/C++ that follows the proposal in [Boehm and Demsky 2014] to forbid cycles in the union of the program order and the reads-from relation, is an in-order model. Verification for RC11-style models has been extensively studied, and multiple techniques have been developed, including program logics [Dang et al. 2020, 2022; Doherty et al. 2019], model checkers and fuzzers [Kokologiannakis et al. 2017, 2019; Luo and Demsky 2021], automatic robustness analyses [Margalit and Lahav 2021], and library abstraction theorems [Raad et al. 2019; Singh and Lahav 2023].

Accordingly, our goal is to study: How far can one go in an in-order semantics? More concretely, we aim to understand how in-order models can be designed in a way that minimizes the overhead they cause for compiler optimizations and mapping to modern hardware.

We target C/C++ as a source language [Batty et al. 2011; Boehm and Adve 2008]. Most importantly, this means that programmers distinguish between synchronization accesses (“atomics”) and weak accesses that should not be used for inter-thread synchronization (“non-atomics”), and can cause any behavior when they are misused for this purpose nonetheless. The latter allows us to rely on “undefined behavior” for racy non-atomics, which is a crucial ingredient of our proposed approach. 

(Thus, we do not provide a solution for “safe” languages that cannot tolerate undefined behavior.)
For atomics, we support the main shared memory constructs of C/C++11, including relaxed and release/acquire accesses, read-modify-writes, and release/acquire and sequentially consistent fences.

Non-atomic accesses account for the vast majority of memory accesses in concurrent programs, while atomics, which are used for inter-thread communication and synchronization, are relatively rare. In particular, among atomics, the only ones that are intended to allow the problematic load-store reordering are relaxed accesses, which are meant to be used by “very careful” programmers [Boehm and Adve 2008] and are often confined to libraries that are manually optimized by experts. Thus, we believe that the trade-off between performance and amenability to reasoning should be investigated differently for atomics and non-atomics. Next, we separately discuss the performance overhead that is imposed by an in-order semantics for supporting non-atomic accesses and atomic accesses.

**Overhead in Non-atomic Accesses.** Our first question is whether it is possible to have an in-order semantics without imposing any performance overhead for non-atomic accesses. This stems from a principled approach: being non-racy, non-atomics should allow all compiler optimizations that are performed in single-threaded code.¹ We observe that a significant challenge exists for validating this guiding principle in an in-order model, and we are not aware of any existing model that solves this challenge (even for a simple fragment with only non-atomics and strong synchronization accesses with, say, release/acquire semantics). In particular, RC11 invalidates (irrelevant) load introduction, a transformation widely used in sequential code with significant possible performance gains. In fact, the LLVM manual requires that non-atomics should validate all optimizations allowed on sequential accesses (the only exception is store introduction, which compilers avoid also in sequential code), and explicitly mentions that load introduction may be performed by the compiler, and the LLVM compiler indeed introduces non-atomic loads as a part in several of its optimization passes.² The assumptions of the GCC compiler are less clear, but some examples show that it introduces loads as well. A concrete example is given in Fig. 1.

We provide a full solution to this challenge, and design an in-order semantics that does not sacrifice any optimization on non-atomic accesses. Inspired by LLVM, the key to doing so is to utilize the distinction between a source semantics and an intermediate representation (IR) semantics. This allows the separation of concerns: compiler optimizations may be unsound in the source semantics, whereas the IR semantics does not have to be in-order. Indeed, the IR is not meant to be amenable to conventional verification and reasoning, and programmers in the source language only need to know the source semantics. This strategy, however, is not a magic potion: to have a sound

---

¹Compiler optimizations on single-threaded code effectively cover modern hardware’s behaviors of plain loads and stores, so it is sufficient to focus this discussion on validating compiler optimizations. Still, in our results, we prove the correctness of mapping non-atomics to plain machine loads and stores.

²See https://llvm.org/docs/Atomics.html#optimization-outside-atomic [Accessed November 2022].
compilation, these models have to be designed in a way that the IR semantics is stronger than the source semantics (i.e., all behaviors allowed by the IR should be allowed by the source).3

Our main contribution is to show that this approach works with the right choice of source and IR. Concretely, we develop an in-order source model, based on the promise-free fragment of the promising semantics, and an IR model based on a recent version of the promising semantics in [Cho et al. 2022], and prove the required relation between them. Our proposed source model is (slightly) stronger than RC11, which allows the application of previous work on verification under RC11. (In particular, we observe that certain races on non-atomics can be safely ignored in RC11’s catch-fire mechanism.) For the IR, we have ported the result of [Cho et al. 2022], which establishes the correctness of all optimizations on non-atomics that are allowed in sequential code. This means that most compiler optimizations can be formally validated based on sequential reasoning, so even most compiler developers need not understand the out-of-order IR model.

We note that while we mostly employ existing models (with some modifications and simplifications), to the best of our knowledge, this work is the first to formally relate an in-order source model and an out-of-order IR model with the goal of having an in-order source semantics without any performance overhead for non-atomics.

Overhead in Atomic Accesses. Naturally, the next question is about the performance overhead for atomic accesses. Here, the challenge concerns relaxed accesses, which are meant to allow load-store reordering that is in sharp contrast with in-order semantics. Unfortunately, we show that any in-order model that supports all optimizations on non-atomics has to forbid the reordering of a non-atomic/relaxed read followed by a relaxed write. (In particular, this reordering is forbidden in both the source and the IR models we propose.)

What is the practical impact of forbidding this reordering? First, we note that although compiler optimizations that reorder and eliminate atomics were extensively studied before (see, e.g., [Dodds et al. 2018; Vafeiadis et al. 2015]), to the best of our knowledge, existing compilers do not perform any of these optimizations. Then, it remains to understand the implications on the mapping to hardware. Indeed, load-to-store ordering between plain accesses is not guaranteed to be preserved by existing models of modern architectures, like those of Arm [Alglave et al. 2021; Pulte et al. 2017] and Power [Alglave et al. 2014; Sarkar et al. 2011],4 and so, forbidding this reordering seems to require a stronger mapping of relaxed accesses for these architectures.

Interestingly, we observe a significant gap between CPU models and observable behaviors in practice regarding the preservation of load-store ordering. While the abstract models of Arm and Power allow the reorder of loads followed by stores, such behaviors were observed in practice only in very few implementations.5 In our discussion with CPU architects from Arm, we confirmed that the load-store reordering is explicitly prohibited in Cortex processors, starting from Cortex-A76. From this discussion, we further understood the technical trade-offs involved in their design, and learned that, compared to other possible reorderings that the hardware performs, load-store reordering is hard to apply and has rather limited performance benefits.

3It is sufficient to have a correct efficient mapping from the source to the IR, where correctness is in the standard sense: every behavior that is allowed by the IR semantics (of the mapped program) is also allowed by the source semantics (of the source program). Since we do not want to sacrifice any performance in this mapping, we actually consider this mapping being the identity mapping, and, thus, we simply require that the IR semantics is stronger than the source.

4Intel’s architecture (assuming x86-TSO by Owens et al. [2009]), has rather strong semantics for plain loads and stores, which never reorders loads with later stores.

5Load-store reordering (concretely, the weak behavior of the LB litmus test) was never observed on Power as well as on various implementations of Armv8 that were tested in [Alglave et al. 2021, 2014]. An anonymous review of this paper provided information showing that this reordering is observed on Cortex A73, and mentioned that even on Cortex post A76 load-store reordering can be observed when memory locations are mapped to device, or when vector instructions are used.

Accordingly, we propose a practical approach to this challenge. In the long term, we believe the right way to go is for vendors to introduce new kinds of store instructions, which we call "strong stores", and officially preserve the order from loads to strong stores. We expect a minimal (to no) overhead for these instructions compared to plain stores. In particular, strong stores still admit store-store reorderings, which are commonly observed in practice, and are thus weaker than release stores that are more expensive to implement. Meanwhile, in the absence of such instructions, we propose to compile relaxed writes differently depending on the target hardware: (1) for a target that preserves load-store order, the compilation can use plain accesses; and (2) otherwise, relaxed writes have to be compiled as release writes.

Outline. The rest of this paper is structured as follows. In §2, we present the challenges, key ideas, and observations of this paper in more detail. In §3, we present (a simplified fragment of) the proposed source model and discuss its relation to RC11. In §4, we present (a simplified fragment of) the IR model and establish the soundness of mapping the source model to the IR model. In §5, we discuss the mapping to modern hardware, its soundness, and the proposed additions to hardware models. Finally, in §6, we discuss related work.

Supplementary Material. Our main results ((1) soundness of mapping from source to IR, (2) soundness of mapping from IR to ARMv8, (3) DRF guarantees for the source, and (4) adequacy of sequential reasoning for validating optimizations in the IR) are mechanized in Coq. The supplementary material available online [Lee et al. 2023] includes the Coq development, the full models, a (pen-and-paper) proof of the relation to RC11, and the results of our experiments.

2 CHALLENGES AND KEY IDEAS
In this section, we present more details on the main observations and contributions of this paper. To a significant extent, our central contributions are not in developing new concurrency models and proving their meta-theoretic properties but rather in providing a holistic analysis and approach to the problem of a shared-memory concurrency semantics in a high-level language like C, C++, or Rust. Like in §1, we separately discuss non-atomics (§2.1) and atomics (§2.2) while focusing on compiler optimizations for non-atomics and mapping to hardware for atomics. (Our results include the mapping of non-atomics to plain accesses on hardware, as well as compiler optimizations involving atomics, but these are not discussed in this section.)

2.1 Optimizing Non-Atoms in an In-Order Semantics
Supporting sequential optimizations for non-atomics in an in-order semantics is highly challenging, and, to the best of our knowledge, it was not addressed by previous work. Next, we demonstrate the challenge using the well-known load buffering example (§2.1.1); explain how the "catch-fire" addresses this challenge (§2.1.2); describe why catch-fire semantics cannot support load introduction by the compiler (§2.1.3); outline "undefined value" as an (informal) alternative to "catch-fire" and why it fails in combination with an in-order model (§2.1.4); and conclude with our proposal of having a two-layered model with catch-fire source semantics, and undefined-value-based semantics for the intermediate representation (§2.1.5).

2.1.1 Read-Write Reordering vs. In-Order Semantics. To understand the crux of the challenge, consider the classical example on the right, known as the load buffering litmus test (LB, for short), where all accesses are marked as non-atomics (na). Here and henceforth, we assume that all variables are implicitly initialized to 0. Our requirement on compiler

\[
\begin{align*}
& a := X^\text{na} & b := Y^\text{na} \\
& Y^\text{na} := 1 & X^\text{na} := 1 \\
& \text{print } a & \text{print } b
\end{align*}
\]  

(LB)
optimizations implies that the behavior in which both threads printing 1 must be allowed. Indeed, the compiler may reorder the read from $X$ and the write to $Y$ in the first thread (this is certainly possible in sequential code, thus non-atomics should allow the reordering as well), and then $a = b = 1$ is possible even under SC. This behavior is in tension with the requirement to have an in-order semantics for the source language, which will have to execute one of the reads first, and at that point the only available write to read from is the implicit initialization write of the value 0.

2.1.2 Catch-Fire as a Solution? A well-known approach to address the above example is to exploit the fact that non-atomics are not supposed to be used for inter-thread synchronization and avoid providing any guarantees on the program behaviors when non-atomics participate in data races. This idea, which we refer to as ”catch-fire” semantics, is the cornerstone of the C/C++11 [Batty et al. 2011], and its repaired version RC11 [Lahav et al. 2017], which explicitly states that a data race on non-atomic accesses implies undefined behavior (UB, for short) for the given program.

Accordingly, RC11 allows the annotated behavior of the LB example above, while still being an in-order semantics. A particular run, for instance, could perform both memory accesses of the first thread (read 0 and write 1), observe a forbidden data-race when executing the first (or second) access of the second thread, and then invoke UB. In turn, UB allows any possible continuation of the execution, which in particular includes the ability to print 1 by both threads. This is still an in-order semantics: threads execute their actions in the order specified by the program, a data-race is detected according to previously executed accesses, and UB only affects future decisions.

Remark 1. The original presentation of RC11 in [Lahav et al. 2017] identifies program behaviors with “final outcomes” (mapping each variable to the modification-order-maximal value written to it). The current discussion assumes that behaviors are captured by sequences of system calls (e.g., results of print statements) generated by a given program. RC11 can be easily adapted to this notion by assuming that consistent execution graphs are incrementally constructed during the program run, system calls are observed in the order they were executed along the run, and any suffix of system calls is allowed once a racy execution graph is reached.

2.1.3 Load Introduction. A catch-fire semantics validates various compiler optimizations on non-atomics, including access reordering and redundant access elimination. Indeed, whenever such transformations enable additional behaviors, it can be shown that the source program was already racy, and justify the target behaviors by UB invoked by the source. Catch-fire, however, falls short to fully admit our guiding principle: some transformations allowed on sequential code are still disallowed on non-atomics.

Concretely, the problem is with (irrelevant) load introduction. If the effect of the compiler’s optimization introduces a non-atomic load (which may happen, e.g., when transforming

\[
\text{while } B \text{ do } \{a := X^{\text{na}}, \ldots\} \quad \text{to} \quad a := X^{\text{na}}; \text{ while } B \text{ do } \{\ldots\}
\]

in traces where $B$ evaluates to false), then the target program may be racy (and invoke UB), while the source is not. Thus, any model based on catch-fire cannot validate load introduction.

Load introduction is necessary for multiple optimizations based on speculation, which are commonly performed by compilers (Clang, in particular) when hoisting loads, e.g., as a part of loop invariant code motion, loop unswitching, load-widening or when loading a vector while only a subset of elements is needed. In addition to Fig. 1 from §1, Fig. 2 demonstrates another case where load introduction has the potential to significantly improve performance.

---

6See https://llvm.org/docs/Passes.html [Accessed November 2022].
Fig. 2. An example of load introduction. The program on the left stores $x[j]$ into $y[j]$ for each odd $j$ (and 0 otherwise). The program on the right is a hand-optimized version: it introduces loads from $x[0]$, $x[2]$, $x[4]$, and $x[6]$; merges all loads into a single 8 bytes load; and stores the result with an appropriate mask into $y$ by a single 8 bytes store. External functions (foo and bar) are used to prevent the compiler from eliminating the loads and stores. By compiling both programs with Clang 15.0.1 and running them on ThunderX2 Armv8 server, we observed more than x2 performance gain (average execution time of 0.069s vs. 0.033s).

2.1.4 Undefined Value as a Solution? A natural idea for supporting load introduction is to limit the “undefinedness” to the value being read in racy reads: instead of invoking UB, just leave unspecified the value loaded by a non-atomic racy read, so if this value is never used (and the load is indeed irrelevant), we will not introduce additional behaviors. The LLVM semantics follows this idea: it keeps read-write races to be always well-defined and declares that non-atomic racy reads may return “undef” value. In turn, “undef” can be refined to any value.\footnote{Branching on “undef” is still considered UB. The “freeze” instruction recently introduced in LLVM is a tool to support branching on a possibly undefined value, which is often a result of load introduction [Lee et al. 2017].}

While being tempting at first sight, undefined value for racy reads will not solve our problem. Referring back to the LB example above, it is easy to see that any execution of an in-order semantics can observe a race only in one of the reads, so only one of them can return “undef”, which will not allow both threads to print 1. To fix this, one has to either speculate a data race when performing the first read, or revisit its previous decisions on the read value when performing the second write. Both options lead us to models that are much more complicated than in-order models.

2.1.5 Our Proposal: An Intermediate Representation. The key idea in our approach is to split the semantics into two models: a source model that accounts for the programmers’ needs, and an intermediate representation (IR) model that accounts for the compilers’ (and hardware’s) needs. Then, the compiler first maps the source program to the IR, and only then applies its optimizations. Programmers should be only aware of the source model, which can be in-order (e.g., with catch-fire) since it does not have to support compiler optimizations; and the IR semantics can support compiler optimizations (e.g., with undefined value for racy reads and out-of-order race detection) since it does not have to be in-order. Our manifestation of this approach consists of the following contributions:

(1) We propose a source model, which we denote by vRC11, obtained by adding non-atomic accesses to the promise-free fragment of the promising semantics [Kang et al. 2017; Lee et al. 2020]. This model, which is stronger than RC11, is formulated as an operational model using timestamps and thread-views to justify weak behaviors, and a simple race-detection mechanism that invokes UB for races on non-atomics. Interestingly, we observe that not all such races should invoke UB, and it is sufficient to consider races with previously executed writes (and ignore races with previously executed reads). Thus, we are able to restrict the catch-fire mechanism in a way that deems fewer programs racy but still achieves what catch-fire is needed for.
(2) For the IR model, we develop a simplification of the promising model by Cho et al. [2022], which we denote by PS\textsuperscript{IR}, where (simplified) promises are only needed for race detection. Thus, PS\textsuperscript{IR} justifies an out-of-order behavior by detecting a race with “promises” made by other threads. More concretely, a thread in PS\textsuperscript{IR} can promise to execute a non-atomic write to a location \(X\) in the future, whenever the thread can certify the promise by checking that it can perform a non-atomic write to \(X\) by executing alone. Once a promise is made, another thread reading from \(X\) races with the promise and reads “undef” value. We have ported the result of [Cho et al. 2022] to PS\textsuperscript{IR}, which establishes the correctness of all optimizations on non-atomics that are allowed in sequential code.

(3) We prove that PS\textsuperscript{IR} is stronger than vRC11. Roughly speaking, this is possible because catch-fire is sufficiently weak to account for the IR’s out-of-order behaviors. In other words, once a program exhibits any behavior that stems from an out-of-order execution under PS\textsuperscript{IR}, the same program has a race in a (possibly different) execution under vRC11, where a race leads to UB. To establish the proof, it is enough to show that the source can invoke UB for such an out-of-order execution in PS\textsuperscript{IR}. Here, the key idea is that the thread of vRC11 can follow the certification run (which is required to justify a promise under PS\textsuperscript{IR}) and perform a non-atomic write to \(X\) instead of making a promise to \(X\). Then, the other thread reading from \(X\) races with that non-atomic write, and the program invokes UB under vRC11.

Example 2.1. In the LB example, PS\textsuperscript{IR} allows \(a = b = \text{undef}\) through an out-of-order execution where the first thread promises to write to \(Y\), and the second thread reads “undef” from \(Y\) since the read races with the promise. The first thread could certify its promise before making it by reading 0 from \(X\) and executing \(Y^{na} := 1\). In vRC11, instead of promising the write, the first thread can execute and write to \(Y\) following the certification execution of PS\textsuperscript{IR}. Then, the second thread’s read from \(Y\) becomes racy, and the program invokes UB, which accounts for all possible behaviors.

Remark 2. In fact, since UB by the source accounts for any behavior of PS\textsuperscript{IR}, the proof of mapping the source to the IR can essentially assume that there is no race in the promise-free fragment of PS\textsuperscript{IR}, which makes the mapping proof similar to the proof of the DRF-PF theorem (a data-race-freedom guarantee w.r.t. the promise-free semantics) in [Cho et al. 2021].

We note that the fact that PS\textsuperscript{IR} is stronger than vRC11 allows one to soundly reason about programs under PS\textsuperscript{IR} semantics while assuming vRC11 (which may be needed when the intermediate language itself acts as a source language for another step of compilation and thus is not completely compiler-internal). Such reasoning would be incomplete, but we expect that only a small fraction of programs will need a precise analysis using the exact IR model.

2.2 Mapping Relaxed Accesses to Modern Hardware

In this section, we turn to the question of supporting atomic accesses focusing specifically on relaxed accesses. We demonstrate the challenge (§2.2.1); revisit the assumptions on hardware (§2.2.2); and propose two practical solutions: a long-term solution that depends on hardware vendors implementing our feature request (§2.2.3), and a short-term solution that requires strengthening the existing compiler mapping of relaxed accesses (§2.2.4).

2.2.1 Reordering of Relaxed Accesses in an In-Order Semantics. Like non-atomic accesses, relaxed accesses in C/C++11 were intended to be mapped to plain loads and stores in the hardware even when the hardware model allows load-store reordering. Clearly, this is in contrast with an in-order semantics (indeed, consider the LB example above with relaxed accesses). Moreover, since relaxed accesses are meant to be used in races (for improving the performance of certain concurrency idioms; see, e.g., [Sinclair et al. 2017]), catch-fire is not a possible solution here. In fact, as the next example shows, even the reordering of a non-atomic load followed by a relaxed atomic
(1) introduce a non-atomic read \( c := X^{na} \); (2) replace the non-deterministic choice with an expression; (3) forward the read \( c := X^{na} \) to the read \( b := X^{na} \) in the if-branch, turning it into \( b := 1 \); (4) forward \( b := 1 \) to the expression \( b := 1 \) and the print statement; (5) hoist the common write \( Y^{rlx} := 1 \) out of the branch; and (6) reorder \( c := X^{na} \) and \( Y^{rlx} := 1 \).

Fig. 3. A sequence of compiler transformations on non-atomic (1–5) and the problematic reordering of a non-atomic load followed by a relaxed store (6) applied to the second thread of LB-CHOICE.

store cannot be allowed in an in-order semantics:

<table>
<thead>
<tr>
<th></th>
<th>(2)</th>
<th>(3)</th>
<th>(4)</th>
<th>(5)</th>
<th>(6)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( c := X^{na} )</td>
<td>( c := X^{na} )</td>
<td>( c := X^{na} )</td>
<td>( c := X^{na} )</td>
<td>( Y^{rlx} := 1 )</td>
</tr>
<tr>
<td>if * then</td>
<td>if ( b := X^{na} )</td>
<td>if ( c = 1 ) then</td>
<td>if ( c = 1 ) then</td>
<td>if ( c = 1 ) then</td>
<td>( c := X^{na} )</td>
</tr>
<tr>
<td></td>
<td>( Y^{rlx} := 1 )</td>
<td>( b := 1 )</td>
<td>( b := 1 )</td>
<td>( b := 1 )</td>
<td>( Y^{rlx} := 1 )</td>
</tr>
<tr>
<td>if ( b = 1 ) then</td>
<td>print ( b )</td>
<td>if ( b = 1 ) then</td>
<td>if ( b = 1 ) then</td>
<td>( Y^{rlx} := 1 )</td>
<td>print 1</td>
</tr>
<tr>
<td></td>
<td>( Y^{rlx} := 1 )</td>
<td>( Y^{rlx} := 1 )</td>
<td>print ( b )</td>
<td>print 1</td>
<td></td>
</tr>
<tr>
<td>else</td>
<td>else</td>
<td>else</td>
<td>else</td>
<td>( Y^{rlx} := 1 )</td>
<td></td>
</tr>
<tr>
<td>( Y^{rlx} := 1 )</td>
<td>( Y^{rlx} := 1 )</td>
<td>( Y^{rlx} := 1 )</td>
<td>( Y^{rlx} := 1 )</td>
<td>print 1</td>
<td></td>
</tr>
</tbody>
</table>

Here, “*” means a non-deterministic choice that non-deterministically returns arbitrary value.\(^8\) Assuming an in-order semantics, the source program on the left cannot print 1 since either one of \( a := Y^{rlx} \) or \( b := X^{na} \) executes first and can only read 0 (from the initial memory). However, as shown in Fig. 3, by applying a sequence of compiler transformations on non-atomics and finally reordering (by the compiler or the target hardware) \( c := X^{na} \) and \( Y^{rlx} := 1 \) in the second thread, the program on the left can be transformed into the program on the right. Then, the second thread printing 1 is easily observable (even under SC). Therefore, the reordering of a non-atomic load followed by a relaxed store must be forbidden in any in-order source semantics that aims to allow common compiler transformations on non-atomics.

### 2.2.2 Revisiting the Assumptions on Hardware.

We observe that there is a significant gap between CPU models and observable behaviors in practice regarding the preservation of load-store ordering. While the abstract models effectively allow the reordering of loads with subsequent stores, such behaviors are rarely observed in practice. Indeed, previous experiments performed to validate the hardware models rarely observed weak behaviors of the LB litmus test. First, such behaviors were never observed on any Power hardware [Alglave et al. 2014; Sarkar et al. 2011]. Second, while they were observed on several Armv7 implementations, to the best of our knowledge, for Armv8, LB was only observed on Qualcomm’s Snapdragon 820 mobile processors [Alglave et al. 2021] and on Cortex A73.\(^9\) To gain more confidence, we experimentally tested a newer version, Qualcomm’s Snapdragon 888 processor, and the weak behaviors of LB were not observed there.

After discussing with Arm engineers, we gained a better understanding of the architectural reasons why the potential performance improvement by allowing load-store reordering is relatively

---

\(^8\)A non-deterministic choice corresponds to “freezing” an undefined value in LLVM. See https://llvm.org/docs/LangRef.html#undefined-values and https://llvm.org/docs/LangRef.html#freeze-instruction [Accessed November 2022].

\(^9\)Snapdragon 820 exhibits various other weak behaviors that are forbidden by the official model (950 such tests reported in [Alglave et al. 2021]). The information about Cortex A73 was obtained from the anonymous PLDI reviewer.
small. Essentially, this stems from the fact that a store can be treated as completed in its own core when it is added to the core-local store buffer before being made visible to other cores. Thus, no intra-core optimization is prevented by preserving load-store ordering. The only exception is that such reordering may reduce the pressure on the store buffer (so that fewer stores stall due to the buffer being full), as it allows to commit a store from the store buffer to the shared storage possibly before previous loads were completed. However, committing stores early complicates the cache implementation regarding the ECC (error correction code) logic, and before committing a store, the core must check that all incomplete preceding loads will never raise exceptions and are not aliased with the store to be committed.

Remark 3. Unlike load-store ordering, preserving store-store ordering is rather expensive. For instance, in Cortex A76 and later versions, a store from the store buffer is committed to a merge buffer when it is the oldest store (i.e., all preceding loads are completed, and all preceding stores are already committed to the merge buffer). Then, stores in the merge buffer may be reordered to group together those writes that fit in the same cache line, which are merged and committed at once. Such reordering between stores greatly reduces cache accesses and is thus considered performance-critical, which is why store-store reordering visible to other cores is needed.

2.2.3 A Long Term Practical Solution. Based on the above discussion, we raise a clear “feature request” from hardware vendors. Concretely, we propose hardware vendors to introduce a new kind of store instructions, which we call "strong stores”, that will preserve load-store ordering. Then, the IR’s relaxed stores will be mapped to strong hardware stores. For most hardware architectures, where architects agree that load-store ordering is preserved, strong stores could be implemented as plain stores. Otherwise, the overhead is not expected to be significant, and, in any case, strong stores should be cheaper than release stores (since they do not need to preserve store-store order).

We believe that this is a case where the input from multiple years of research in concurrent programming language semantics may guide hardware developers. In fact, other features of Arm, such as sequentially consistent accesses and release sequences, were developed hand in hand with C/C++11 constructs. Our proposal is of a similar nature, identifying an opportunity for hardware vendors to significantly assist programming language design with a rather minimal cost.

In §5, we provide the proposed formal additions to the declarative models of Armv8 and Power for supporting strong stores. We have performed extensive validation of these revised models using the Herd model checker [Alglave et al. 2021, 2014], to see that, indeed, when strengthening all stores to be strong, the behaviors that become disallowed are, like LB, behaviors that were not observed on hardware (except for Snapdragon 820 and Cortex A73 as discussed above).

2.2.4 A Short Term Practical Solution. Without the availability of “strong stores” in hardware, we propose to change the compiler mappings to take into account the target CPU. For CPUs that preserve load-store ordering, it is still safe to map relaxed accesses to plain accesses. Otherwise, the compiler should map relaxed store as it maps release stores (e.g., to an stlr instruction on Armv8).

Following Ou and Demsky [2018], mapping relaxed stores as release entails a performance overhead of 3.6% on Arm (although it is rather hard to estimate performance for real-world programs). We note that the mapping scheme that enforces the preservation of load-store ordering by inserting a (fake) branch from every relaxed read, which is more efficient according to Ou and Demsky [2018] (with -0.3% overhead), is unsound for our needs. Indeed, as the LB-CHOICE example shows, we also need to forbid reordering of non-atomic reads followed by relaxed writes. This would require adding a branch from every non-atomic read, which, given the prevalence of non-atomic reads in concurrent programs, is expected to significantly harm performance.
In this section, we present the in-order source semantics vRC11 (standin for “view-based RC11”), which we obtain by adding transitions for non-atomic accesses to the promise-free fragment of the promising semantics (PS, for short). In §3.1, we discuss the relation between vRC11 and RC11 and show that vRC11 is stronger than RC11. Therefore, verification theory and tools developed for RC11 (or any weaker model), such as model checkers [Kokologiannakis et al. 2017, 2019; Luo and Demsky 2020], program logics [Dang et al. 2020; Doko and Vafeiadis 2017], and robustness analysis [Lahav and Margalit 2019], all apply to vRC11. In §3.2, we provide a declarative presentation of the model.

vRC11 is obtained from PS by (i) removing the notion of promises that models early execution of writes and all transitions and components of states related to promises; and (ii) adding transitions for non-atomic and racy accesses. Next, we introduce the fragment of vRC11 consisting of non-atomic, relaxed and release/acquire writes and reads. In turn, read-modify-writes (RMWs), fences, and release sequences are omitted by brevity. They are included in the full model in Coq and presented in [Lee et al. 2023, Appendix A]. Figure 4 summarizes the domains and the transitions of vRC11, highlighting the differences w.r.t. the promise-free fragment of the model in [Kang et al. 2017].

Program Semantics. We assume that the program of each thread is represented as a labeled transition system, whose states, denoted by σ, record the local register file and the continuation code, and transitions σ → σ′ are labeled with the action l that is performed. For silent transitions that do not communicate with the memory (e.g., conditionals and local assignments), we write σ → σ′. Read and write transitions have labels l = R(X, oₙ, v) and l = W(X, oₙ, v), respectively. We also assume transitions executing system calls, which are externally observable (e.g., resulting from print statements), with a label l = Sys(e) where e is the output of the call.

Memory. A memory M is a finite set of messages of the form m = ⟨X@t, v, oₙ, Vₙ⟩ representing a previously executed write of a value v ∈ Val to a location X ∈ Loc. Each message has a timestamp t ∈ Time, where Time is the set of non-negative rational numbers, a write access mode oₙ of the

As in previous work [Kang et al. 2017; Lee et al. 2020], timestamps are densely ordered, so one can always add a message between existing messages. This property is particularly useful when proving the soundness of compiler transformations such as “store merge” that merges two successive stores X := 1; X := 2 into a single store X := 2. In the proof, the source has to mimic the target program by finding a free timestamp for X := 1 before X := 2.
**States.** A machine state \( M = \langle \mathcal{T}, M \rangle \) consists of a function \( \mathcal{T} \) assigning a thread state to each thread identifier, and a memory \( M \) shared among the threads. A thread state is a pair \( T = \langle \sigma, V \rangle \) where \( \sigma \) is a local program state and \( V \in \text{View} \) is a thread view, recording the latest timestamp that has been observed the thread for each location. The initial thread state consists of the initial program state and the 0-view assigning 0 to each location.

**Read Step.** A thread can read a message \( \langle X@t, o, o_n, V_n \rangle \in M \) with a timestamp greater than or equal to the thread’s view of \( X \) (i.e., \( V(X) \leq t \)), updating its view of \( X \) to include the timestamp \( t \) of the message. If the read is an acquire (acq) read, the thread also acquires the message view \( V_n \) and joins it to its own view by taking pointwise maximum (denoted by \( \sqcup \)).

**Write Step.** A thread writes by adding a message \( m = \langle X@t, o, o_n, V_n \rangle \) to the memory \( M \) provided that \( t \) is greater than the thread’s view \( V(X) < t \) and that there is no existing message in \( M \) with location \( X \) and timestamp \( t \) (denoted by \( M \ni m \)). The access mode \( o_w \) of the write operation is recorded in \( m \). The thread updates its view to \( V' = V[X \mapsto t] \). A release write records the thread’s view \( V_n = V' \) in the message, while non-release writes have the 0-view in \( V_n \).

**Racy Access.** A memory access to location \( X \) by a thread with view \( V \) is racy if there is some message \( \langle X@t, o, o_n, V_n \rangle \in M \) with \( V(X) < t \) and either the message is written by a non-atomic write (\( o_w = \text{na} \)) or the access itself is non-atomic (as defined in (race) in Fig. 4). Executing a racy read or a racy write leads the thread to the \( \bot \) program state.

**Machine Step.** Machine steps are obtained as standard interleaving of thread steps \( \langle \mathcal{T}(\tau), M \rangle \rightarrow \langle \mathcal{T}', M' \rangle \). If the thread detects a race and steps to \( \bot \), the machine may take a (pf-machine: ub) step that leads to the \( \bot \) machine state, that is later interpreted as UB.

**Behavior.** An (observable) behavior is a sequence \( s = \langle e_1, e_2, \ldots, e_n \rangle \) of system calls. A machine state \( M \) generates a behavior \( s \), denoted by \( M \downarrow s \), if \( s \) is obtained by restricting a trace of vRC11 starting from \( M \) to system call labels and replacing UB by an arbitrary suffix of system calls. With standard notations for sequences, \( M \downarrow s \) is defined by:

<table>
<thead>
<tr>
<th>terminal ((M))</th>
<th>( M_1 \rightarrow M_2 ) ( M_2 \downarrow s )</th>
<th>( M \rightarrow \langle \bot, _ \rangle ) ( M \downarrow s )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( M \downarrow e )</td>
<td>( M_1 \downarrow s )</td>
<td>( M \downarrow s )</td>
</tr>
</tbody>
</table>

Here, \( \text{terminal} \!(M) \) means that the machine state \( M \) is terminal (i.e., every thread has empty continuation code). As captured by the last rule, once a UB is invoked during the execution, the machine exhibits any behavior that is prefixed with the sequence of system calls occurred before the invocation of the UB. We let \( [\text{prog}]_{\text{vRC11}} = \{ s \mid \text{init}(\text{prog}) \downarrow s \} \), which denotes the set of all behaviors that an initial machine state \( \text{init}(\text{prog}) \) of a program \( \text{prog} \) exhibits.

**Example 3.1.** The “store buffering” test on the right demonstrates how the memory and the thread views of vRC11 captures weak behaviors exhibited by the reordering of a store followed by a load. The behavior of both threads printing 0 is allowed by vRC11.

Specifically, the first thread writes 1 to \( X \) by adding a message \( \langle X@t, 1, r1x, \lambda X. 0 \rangle \) with some timestamp \( t > 0 \) and increasing its thread view of \( X \) to \( t \). After the write, the thread reads from the initial message \( \langle Y@0, 0, \text{na}, \lambda X. 0 \rangle \). By executing the second thread in the same way, it can read either from the initial message \( \langle X@0, 0, \text{na}, \lambda X. 0 \rangle \) (since its view of \( X \) is still 0) or from the message of the first thread. Therefore, both threads can read 0 at the same execution.

**Example 3.2.** We show how vRC11 allows both threads printing 1 in the LB example in §2. Suppose that the first thread reads 0 from the initial message for \( X \), and writes 1 to \( Y \) by adding

\[ X^{r1x} := 1 \]
\[ a := X^{r1x} \]
\[ b := X^{r1x} \]  

(SB)

\[ \text{print } a \]
\[ \text{print } b \]
a message \( m_Y = \langle Y @ t, 1, \text{na}, \lambda X. 0 \rangle \) with some \( t > 0 \). Then, the read from \( Y \) by the second thread races with the message \( m_Y \) since it has a timestamp \( t \) greater than the timestamp of \( Y \) in the second thread’s view (i.e., \( V(Y) = 0 < t \)). Then, due to the racy read from \( Y \), the second thread invokes UB, which generates arbitrary behavior, including the behavior in which both threads print 1.

**Example 3.3.** Consider the message passing program on the right.

The two non-atomic accesses to the data \( D \) are well-synchronized by a release-acquire synchronization through the flag \( F \), and thus, they are not racy. Indeed, the first thread records its view in the message \( F = 1 \) and the view is transferred to the second thread when it reads \( F = 1 \). The read from \( D \) by the second thread is not racy since the timestamp of \( D \) it has in its view is already increased to include the timestamp of the message \( D = 42 \). Moreover, the second thread is only allowed to read 42 from \( D \). In contrast, the program becomes racy if any (or both) of the accesses to \( F \) is made relaxed. Then, there would not be a release-acquire synchronization between the two threads, and the timestamp of \( D \) in the second thread’s view would remain 0 (pointing to the initial message of \( D \)) even after reading 1 from \( F \). In turn, the read from \( D \) would be racy and invoke a UB, as the message \( D = 42 \) would have a higher timestamp than the second thread’s view of \( D \).

We have ported the Coq proof by Cho et al. [2022] to establish the local DRF guarantees, LDRF-RA and LDRF-SC, for vRC11. Generally speaking, data-race-freedom (DRF) guarantees ensure “strong” semantics for programs that are race-free under the “strong” semantics, and thus provide an essential formal justification for defensive programming. Local DRF (LDRF) guarantees further extend this idea to be applicable also in the presence of races on some unrelated locations (e.g., confined in optimized libraries). LDRF-RA means that we consider release/acquire semantics as the strong semantics, and LDRF-SC means that under the strong semantics, threads can only access messages with globally maximal timestamps.

### 3.1 Relating vRC11 to RC11

The RC11 [Lahav et al. 2017] memory model addresses two problems of the C/C++11 model: its flawed semantics for sequentially consistent accesses and fences (which is unrelated to the current paper) and the more crucial problem of “out-of-thin-air” reads [Batty et al. 2015] that breaks the fundamental DRF guarantee. To solve the latter problem, following [Boehm and Demsky 2014], RC11 takes a conservative approach and forbids cycles in the union of the program order and the reads-from relation. As discussed before, verification of concurrent programs under RC11 has been extensively studied and multiple verification methods and tools have been developed. The next theorem states that vRC11, the source model of the present paper, is stronger than RC11. Hence, the soundness of all verification approaches for RC11 applies to vRC11 as well.

**Theorem 3.4.** For every program \( \text{prog} \), \([\text{prog}]_{vRC11} \subseteq [\text{prog}]_{RC11}\).

We provide a (pen-and-paper) proof in [Lee et al. 2023, Appendix C], based on a declarative presentation of vRC11 (see §3.2), which can be more easily compared to RC11. Next, we demonstrate behaviors allowed by RC11 but disallowed by vRC11 using examples.

Putting presentation aside, the main difference between vRC11 and RC11 is related to the fact that an access in vRC11 can only race with previously executed writes, but not with previously executed reads. To illustrate this point, consider the following example:

\[
\begin{align*}
  a := X^\text{na} \\
  b := Y^\text{rlx} \\
  Y^\text{rlx} := 1 \\
  \text{if } b = 1 \text{ then } X^\text{na} := 42
\end{align*}
\]

(RW-RACE)

In both vRC11 and RC11, the read of \( X \) has to return 0, but this program is considered racy in RC11 but not in vRC11. Specifically, both vRC11 and RC11 allow the execution where the second thread...
reads 1 from Y (from the write of the first thread) and writes 42 to X. In RC11 this execution is
demed racy, since it has two accesses to the same location, such that (i) one of them is a write;
(ii) one of them is non-atomic; and (iii) they are not properly synchronized by release/acquire
accesses. In contrast, vRC11 does not view this execution as racy. In vRC11, an access can only race
with a message to the same location that already exists in the memory. Thus, the write to X by the
second thread is never racy since there has not been any other write to X. In other words, a write
can never race with a read executed before the write. Note that the execution $X^{na} := 42$ requires a
message $Y = 1$ in the memory, so it cannot precede the read $a := X^{na}$ by the first thread.

Remark 4. Deeming programs like RW-RACE as non-racy may allow performance improvements
in certain programming idioms that are forbidden in RC11. For example, consider the following
multiple-readers-single-writer (MRSW) lock pattern:

```
... a := X^{na}  |  ... b := X^{na}  |  writer-lock()
reader-unlock() | reader-unlock() | X^{na} := 42
```

An MRSW lock protecting a location $X$ allows multiple readers to read from $X$ concurrently, while
the writer should be exclusive, blocking any other reader or writer. A typical implementation of
an MRSW lock maintains a counter counting how many readers currently hold the reader-lock.
For such an implementation, `reader-unlock()` decreases the counter using a fetch-and-decrement
operation and, `writer-lock()` checks if the counter reaches 0 and atomically swaps the value of the
counter to some special value using a compare-and-swap. Under RC11, to prevent the race between
the reads and the later write, `reader-unlock()` and `writer-lock()` should form release-acquire
synchronization. In contrast, as in RW-RACE, such synchronization is unnecessary under vRC11
since a write never races with a read executed before the write. Therefore, vRC11 allows one to
relax the write access mode of the fetch-and-add in `reader-unlock()` from rel to rlx. Moreover,
when there is only one writer thread, `writer-lock()` can be further optimized to use a relaxed
RMW instead of an acquire RMW.

In addition to the above, even for races with previously executed writes, the operational race
condition of vRC11 is more restrictive than the race definition in RC11, where two accesses to the
same location are considered racy if they are not “well-synchronized” (which is formally defined
using the “happens-before” relation). This can be observed in programs when certain locations are
accessed by both atomic and non-atomic accesses, as in the following example:

```
X^{rlx} := 1  |  a := X^{rlx}  
\text{if} a = 1 \text{ then } b := X^{na}
```

(COH-RACE)

In both vRC11 and RC11, the read of $X$ has to return 1, but, again, this program is racy in RC11
but not in vRC11. To see this, consider an execution where the relaxed read $a := X^{rlx}$ by the
second thread reads 1 written by the first thread. The write $X^{rlx} := 1$ by the first thread and the
non-atomic read $b := X^{na}$ by the second are racy in RC11 since they are not well-synchronized via
a release-acquire synchronization. In vRC11, the two accesses are not racy: once the second thread
reads 1 by the relaxed read $a := X^{rlx}$, its view to $X$ increases to include the message $X = 1$. Then,
when the thread performs a non-atomic read from $X$, no message to $X$ has a timestamp higher than
the thread’s view (i.e., only the message $X = 1$ can be read by the second thread). Therefore, the
non-atomic read by the second thread does not race with the write of the first thread.

Both of the above examples demonstrate cases that RC11 assigns UB to a program, whereas
vRC11 gives it a defined semantics. We believe that races in vRC11 have a clear and simple meaning:
a read access is racy iff it can read from more than one message, and a write access is racy iff it
can “overwrite” more than one message. The examples above show cases where RC11 forces a non-atomic read to read from a particular write, but the read is still considered racy in RC11.

Finally, there also is a difference between the two models related to SC-fences (which are not presented above but included in the full model). In vRC11 the semantics of SC-fences is similar to the one in the RC20 model in [Margalit and Lahav 2021], which is stronger than their semantics in RC11. In particular, SC-fences in vRC11 model can be expressed in terms of a release and acquire fences and an RMW to an otherwise unused location (see [Margalit and Lahav 2021, Remark 1]). We do not discuss further this difference since it is orthogonal to our main topic.

### 3.2 A Declarative Presentation

For informed readers, we provide a declarative (a.k.a. axiomatic) presentation of vRC11. Such presentation is more concise than the operational one, and it is especially useful for comparing vRC11 to other models that are presented in a similar declarative fashion such as C/C++11. In the following, we consider the full model with RMWs and fences. Due to lack of space, we refer to [Lahav et al. 2017], which we build on, for more background and examples of this definition style. (In any case, this technical section can be skipped when reading the paper.) The equivalence between the operational and declarative models is proved in [Lee et al. 2023, Appendix B].

In declarative models, program executions are represented by execution graphs, whose nodes, called events, keep track of accesses to the shared memory, and edges provide several (partial) orders on these accesses. We assume that events are divided into three sets: writes (W), reads (R), and fences (F). We use standard notations to retrieve events properties (such as $loc(e)$ for the location accessed in $e$ and $mod(e)$ for the access mode) and to restrict sets accordingly (such as $W^{rel}$ for the set of release writes). Our execution graphs employ the standard basic relations: a program order (po) that totally orders the events of each thread; an RMW relation (rmw) that distinguishes the read-write pairs that together form an RMW; a reads-from relation (rf) that links each write event $w$ to the read events that read their value from $w$; and a modification order (mo), a.k.a. coherence order, that totally orders all writes to the same location. Based on these relations, several other relations are derived (all as in RC11) using standard relational notations:

\[
\begin{align*}
po \cap & \{ (e_1, e_2) \in po \mid loc(e_1) = loc(e_2) \} & \text{(po-same-location)} \\
rb & \equiv rf^{-1} \cup mo & \text{(reads-before, a.k.a. from-read)} \\
eco & \equiv (rf \cup mo \cup rb)^+ & \text{(extended-coherence-order)} \\
rs & \equiv [W] \cup po \cup [W^{rel}] \cup [W^{rlx}] \cup [R^{rf}] \cup [R^{rmw}] & \text{(release-sequence)} \\
sw & \equiv ([W^{rel}] \cup [F^{rel}] \cup [R^{rel}] \cup [F^{rlx}] \cup [R^{rlx}] \cup [R^{acq}] \cup [F^{acq}]) & \text{(synchronized-with)} \\
hb & \equiv (po \cup sw)^+ & \text{(happens-before)}
\end{align*}
\]

Now, to handle SC-fences we include another primitive relation in execution graphs that determines the order of SC-fences. We call this relation the SC-order, denoted by $sc$, and require it to be a total strict order on all the SC-fences (i.e., on $F^{sc}$) in the execution graph. (Like rf and mo, sc is existentially quantified—a behavior of a program is justified by some sc order of a corresponding graph.) Using sc we derive the execution order, which is a partial order on events that operational runs follow (note that $hb \subseteq exec$):

\[
exec \equiv (po \cup rf \cup sc)^+ & \text{(execution-order)}
\]

Then, consistent graphs are defined as follows.

**Definition 3.5.** An execution graph $G$ is vRC11-consistent if the following hold for its relations:

- $hb \cup eco$ is irreflexive. (COHERENCE)
- $rmw \cap (rb \cup mo) = \emptyset$. (ATOMICITY)
- $hb \cup sc ; hb : eco$ is irreflexive. (SC-FENCE)
- $exec$ is irreflexive. (NO-LB)
The **coherence** constraint is standard, and it ensures “SC-per-location”. The **sc-fence** constraint gives the semantics to SC-fences, so they forbid, e.g., store buffering behaviors when inserted between writes and reads. The **atomicity** constraint ensures the atomicity of RMWs. Finally, **no-lb** demonstrates that vRC11 is “in-order” as it entails the acyclicity of the union of the program order and the reads-from relation.

**Example 3.6.** As we saw in Example 3.3, in the MP program the second thread can only read 42 from D. To see how this follows from the declarative model, we depict the execution graph obtained when trying to read 0 (the initial value), and explain why it is inconsistent. The nodes, labels, and program order (po) arise from the program behavior we analyze. Then, the reads-from relation (rf) is forced since every read has to read its value from some write writing that value. The modification order (mo) between init and e1 is also forced: it has to order these two nodes (as both write to the same location), and going in the opposite order would violate coherence as we have hb from init to e1. Then, according to the definition above, a “reads-before” edge (rb) is induced from e4 to e1, which implies ⟨e4, e1⟩ ∈ eco. Now, since e2 and e3 are rel and acq, we have an sw edge between them, inducing hb from e1 to e4. Together, this violates coherence since we have ⟨e1, e4⟩ ∈ hb ∪ eco.

To complete the presentation of the model, we define what execution graphs are considered racy, with the help of additional derived relations:

\[
\begin{align*}
\text{conflict} & \triangleq \left\{ (e_1, e_2) \mid e_1 \neq e_2 \land (\text{typ}(e_1) = \text{W} \lor \text{typ}(e_2) = \text{W}) \land \text{loc}(e_1) = \text{loc}(e_2) \land (\text{mod}(e_1) = \text{na} \lor \text{mod}(e_2) = \text{na}) \right\} \\
\text{pb} & \triangleq [W] ; \text{rf}^2 ; \text{hb} ; \text{sc}^2 ; \text{hb}^2 \\
\text{raceWW} & \triangleq [W] ; \text{conflict} ; [W] \setminus (\text{pb} \cup \text{exec}^{-1}) \\
\text{raceWR} & \triangleq [W] ; \text{conflict} ; [R] \setminus (\text{pb} \cup \text{exec}^{-1}) \\
\end{align*}
\]

Roughly, ⟨w, e⟩ ∈ pb means that the write w has been observed by the thread executing e before it executes e, where observations are propagated through release/acquire synchronization and SC-fences. In the operational model, this means that (the message associated with) w has a timestamp lower than the timestamp of the location of w in the current view of the thread executing e. Then, raceWW relates two conflicting writes, w1 to w2, when w1 has not propagated before w2 (⟨w1, w2⟩ /∈ pb) and w1 can be executed before w2 (⟨w2, w1⟩ /∈ exec). Similarly, raceWR relates conflicting write and read, w to r, when w has not propagated before r (⟨w, r⟩ /∈ pb) and w can be executed before r (⟨r, w⟩ /∈ exec).

Finally, we say that execution graph G is vRC11-racy if raceWW ∪ raceWR ≠ ∅, and as in RC11, a program outcome is allowed if it is induced by some vRC11-consistent execution graph that is generated by the program, or if some racy vRC11-consistent execution graph is generated by the program. The latter disjunct corresponds to the program invoking UB.

**Remark 5.** An equivalent model is obtained if we include sc inside hb (together with po and sw). In this presentation, **sc-fence** is not needed, and the definition of pb can be simplified to rf^2 ; hb.

**Example 3.7.** The RW-RACE program above does not generate a racy vRC11-consistent execution graph. Indeed, the only vRC11-consistent execution graph of it executing both non-atomic
We explain the PSIR in the following. There are two major differences between PSIR and vRC11.

4 THE IR MODEL

In this section, we present our model of the intermediate representation, called PSIR, establish the soundness of mapping from vRC11 to PSIR, and show that every sound transformation under the sequential semantics SEQ in [Cho et al. 2022] is also sound under PSIR.

There are two major differences between PSIR and vRC11. First, to allow load introduction, a racy read in PSIR returns an undefined value instead of invoking UB. As demonstrated above, this change alone forbids the weak behavior of LB. PSIR addresses this issue by allowing promises, so it is not an in-order semantics. Intuitively, a thread may promise that it will perform a non-atomic write to a location X in the future, making any accesses to X by other threads to be racy.

In the following, we explain the PSIR model focusing on how it extends and modifies vRC11. Figure 5 summarizes the steps of PSIR. We note that the (silent) and (read) steps are adapted in an obvious way that does not alter the new components of the state, and the (race) definition is also exactly as in vRC11. The full PSIR model in our Coq development includes RMWs, fences, and release sequences, and is presented in [Lee et al. 2023, Appendix D].

Promises. Both each thread state and the global machine state are equipped with a set of locations, called local promises (P) and global promises (Pg), respectively. The (promise) step allows a thread to promise to write to a certain location X in the future by adding X both to its local promises and to the global promises, provided that X has not been already promised by some
thread \((X \notin P_C)\). Once \(X\) being promised, the thread can later fulfill its promise by writing to \(X\) via a non-atomic write. Accordingly, the \((\text{write})\) step is extended to update the local and global promises set by removing \(X\) from them if the write is non-atomic and \(X\) was previously promised.

**Certification.** At each machine step (see \((\text{MACHINE: NORMAL})\)), the thread taking a sequence of steps should certify its promises by demonstrating it is able to fulfill all its promises by taking multiple steps in isolation. The certification requirement is crucial in proving the soundness of mapping from vRC11 to PS\(_\text{IR}\) (see Example 4.4 below).

**Racy Reads.** A racy read retrieves an undefined value (denoted by \(\text{undef}\) in \((\text{racy-read})\)) (unlike invoking UB in vRC11), thereby allowing the compiler transformation that introduces unused loads. In addition, a race occurs with promised writes (see \((\text{promised race})\) definition): a memory access to \(X\) is considered racy also if there is a promise to \(X\) made by another thread.

**Racy Writes.** A racy write invokes UB (like in vRC11), and there is no need to consider promised writes for these races. A thread transition invoking UB directly fulfills the remaining promises (see \((\text{racy-write})\)), so the local promises set is made empty after the transition, which allows for a successful certification process.

**System Calls.** A system call requires the local promises to be empty \((P = \emptyset)\). In other words, a write cannot be promised over a system call. Intuitively, this means that a system call followed by a (non-atomic) write cannot be reordered.

**Behavior.** Program behaviors under PS\(_\text{IR}\) are defined as for vRC11, with one modification: when \(\text{undef}\) is a part of a system call output in an execution trace, then it can be refined to any concrete value in the program behavior (e.g., \(\text{print}(\text{undef})\) in a trace can be mapped to \(\text{print}(1)\) in the behavior). We denote by \(\{\text{prog}\}_{\text{PS}_{\text{IR}}}\) the set of behaviors a program \(\text{prog}\) exhibits under PS\(_{\text{IR}}\).

Example 4.1. In contrast to vRC11, in which a racy read invokes UB, in PS\(_\text{IR}\) a racy read returns \(\text{undef}\). Thus, PS\(_\text{IR}\) justifies LB example using a promise. Specifically, the first thread promises to \(Y\), certified by reading 0 from \(X\) and writing 1 to \(Y\). Then, the read from \(Y\) by the second thread is racy with that promise and it returns \(\text{undef}\). After the racy read, the second thread writes 1 to \(X\) and prints 1. (Since \(\text{undef}\) represents an arbitrary value, it can be refined to 1.) Now, the first thread reads 1 from \(X\), fulfills its promise to \(Y\) by performing a non-atomic write, and prints 1 as well.

Example 4.2. The following example demonstrates that a promise can be certified and fulfilled by different write instructions even with different written values.

\[
\begin{align*}
a &= X^a \\
Y^{\text{rlx}} &= a \\
\text{if } b \neq 0 \text{ then } X^a &:= 1 \\
\text{print } 42 \\
\text{else } X^a &:= 2 \\
\end{align*}
\]

\[
\begin{align*}
a &= X^a \\
Y^{\text{rlx}} &= a \\
X^a &:= 2 \\
\text{if } b \neq 0 \text{ then } X^a &:= 1 \\
\text{print } 42 \\
\end{align*}
\]

The program on the left can be transformed into the program on the right by applying a sequence of compiler transformations in the second thread: \((i)\) split the non-atomic write \(X^a := 1\) into two writes \(X^a := 2\) followed by \(X^a := 1\); \((ii)\) hoist the common write \(X^a := 2\) out of the branch; and \((iii)\) reorder the read \(b := Y^{\text{rlx}}\) and the write \(X^a := 2\). Since the program on the right is allowed to print 42 (even under SC), PS\(_\text{IR}\) should allow the same behavior for the program on the left. Indeed, the program on the left can print 42 through the following PS\(_\text{IR}\) execution: \((1)\) the second thread promises to \(X\), certifying it by reading 0 from \(Y\) and writing 2 to \(X\); \((2)\) the first thread reads \(\text{undef}\) from \(X\) by performing a racy read and writes \(\text{undef}\) to \(Y\); and \((3)\) the second thread reads \(\text{undef}\) from \(Y\), enters the then-branch as \(b \neq 0\) evaluates to \(\text{undef}\), fulfills its promise to \(X\) by writing 1.

\(^{11}\)Here, we assume that branching on \(\text{undef}\) non-deterministically takes either one of the then-branch or the else-branch. In LLVM, branching on \(\text{undef}\) is UB, and a "freeze" instruction should be used before the branching [Lee et al. 2017].

---

to $X$, and prints 42. Notably, in this execution, the promise to $X$ of the second thread is certified using the write $X^{na} := 2$ and fulfilled by the other write $X^{na} := 1$.

Our main result is the following theorem (a Coq proof is available in the supplementary material):

**Theorem 4.3.** For any program $prog$, $\llbracket prog \rrbracket_{PS^{IR}} \subseteq \llbracket prog \rrbracket_{vRC11}$.

The proof of this theorem is established using a simulation argument: for each transition in $PS^{IR}$, we identify a corresponding sequence of transitions in $vRC11$. While most thread transitions are identical in $vRC11$ and $PS^{IR}$, there are no corresponding transitions in $vRC11$ for the following two transitions of $PS^{IR}$: (1) (promise) transition; and (2) (racy-read) transition that races with a promise (i.e., when $race_{prm}(P, P_G, X)$ holds). For the former, the $vRC11$ machine simply takes no transition and the machine states of $vRC11$ and $PS^{IR}$ remains identical except for the sets of promises $(P$ and $P_G)$ in $PS^{IR}$. For the latter, we show that the $vRC11$ machine can take multiple steps and invoke UB by performing a racy read. Concretely, suppose that a thread $t_1$ of $PS^{IR}$ performs a racy read from a location $X$ that races with a promise made by another thread $t_2$. Since there is no promise in $vRC11$, we need to prove that $t_2$ of $vRC11$ can actually perform a non-atomic write to $X$ before $t_1$ takes a racy read transition. The key property in this is to turn a certification of the promise by $t_2$ in $PS^{IR}$ into a real execution of $vRC11$. To do so, we proved that once a thread certifies its promise to a location $X$, under any possible future memory, the thread can take multiple steps and perform a non-atomic write to $X$.

**Example 4.4.** Theorem 4.3 does not hold without the certification of promises. Under $vRC11$, the only possible execution for this program is that both threads read the initial messages (i.e., $a = b = 0$). For Thm. 4.3 to hold, $PS^{IR}$ cannot allow any other behavior. Indeed, under $PS^{IR}$, the first thread cannot promise to $Y$ since the only value that can be read from $X$ is $0$, and thus, the thread cannot certify the promise. However, if $PS^{IR}$ allowed a thread to promise without certifying it, then $a = b = undef$ would be allowed. Specifically, the first thread could unconditionally promise to $Y$; the second thread could read $undef$ and write $1$ to $X$; and then the first thread could read $undef$ and write $1$ to $Y$ while fulfilling its promise to $Y$.

In addition to Thm. 4.3, we also proved that program transformations sound in sequential semantics are also sound to apply on non-atomics in $PS^{IR}$. To do so, we adapted the sequential machine $SEQ$ from [Cho et al. 2022, Def. 3.3] to include non-promisable relaxed writes, as we have in $PS^{IR}$ (see [Lee et al. 2023, Appendix E]), and ported the proof in [Cho et al. 2022] to $PS^{IR}$ to show that all sound optimizations on non-atomics under (the adapted) $SEQ$ are also sound under $PS^{IR}$. Thanks to this result, not only the programmers but also compiler writers who develop optimizations on non-atomic code (including reorderings and eliminations of non-atomic accesses across atomics) do not have to understand the out-of-order IR model.

The sequential machine $SEQ$, however, is not helpful for validating reorderings and eliminations of *atomics*. These optimizations are not important for our current purpose (to the best of our knowledge, they are not performed by current compilers). In fact, we found out that reordering of relaxed writes (to different locations) is unsound in $PS^{IR}$. (Still, $PS^{IR}$ can be soundly mapped to Armv8, which effectively allows reordering in the target code; see §5.) The reason is related to the *reservation* mechanism, an addition that was introduced to $PS$ in [Lee et al. 2020] and used in our full $PS^{IR}$ model, in order to support an efficient mapping of RMWs to Armv8. Future work is required to understand whether $PS^{IR}$ can be changed to allow this reordering. We expect all other transformations on atomics that are sound in $RC11$ to be sound in $PS^{IR}$.

$freeze(v)$ returns $v$ when $v$ is a defined value (i.e., not $undef$) and non-deterministically returns any defined value (e.g., 42) when $v$ is $undef$. In the example, the same argument holds when $b \neq 0$ is replaced with $freeze(b) \neq 0$. 

5 MAPPING TO HARDWARE

In this section, we consider the compiler mapping of PSIR to hardware: we present the proposed addition of "strong stores" to hardware models (§5.1); discuss the implementation of strong stores in existing hardware (§5.2); and establish soundness of mapping PSIR to the extended models (§5.3).

5.1 Strong Stores in Hardware Models

We propose a new store instruction called a “strong store” that preserves load-store ordering in modern architectures. Strong stores are stronger than a plain hardware stores but weaker than release stores (or than “lightweight fence”, lwsync, followed by a plain store, as release stores are implemented on Power). Next, we describe the proposed extension of the Armv8 and Power models.

Armv8. We define Armv8S as the extension of the Armv8 memory model [Alglave et al. 2021; Pulte et al. 2017] with strong stores. The Armv8 memory model defines a relation called barrier-ordered-before (bob), modeling thread-local order of memory accesses that is induced by barriers and release/acquire accesses. For example, bob includes po; [L] that corresponds to the fact that a release store (denoted by L) is never reordered with an earlier instruction in the program order (denoted by po). In Armv8S, we extend bob to include also [R]; po; [S], where S represents the set of strong stores. This simple modification enforces the preservation of the order of any load followed by a strong store in the program order.

Power. Similarly to Armv8S, we define PowerS by extending the Power memory model of [Alglave et al. 2014]. Specifically, we propose a modest extension of the “no-thin-air” rule of the Power consistency predicate that requires acyclicity of ppo ∪ fence ∪ rfe to include [R]; po; [S] as well. Roughly, this constraint forbids load buffering behaviors when the order of the load followed by the store is preserved by certain dependencies (ppo) or fences (fence). The PowerS model extends this rule to prevent the load-store reordering also when the order is preserved by a strong store.

5.2 Implementing Strong Stores on Existing Hardware

As discussed in §2.2, the weak behavior of LB has been rarely observed in practice, despite massive testing on CPU implementations of multiple Arm and Power architectures. In particular, among the Armv8 and Power implementations that have been tested in [Alglave et al. 2021, 2014], only Qualcomm’s Snapdragon 820 processor exhibited this behavior.

To gain more knowledge about the Snapdragon anomaly (and extend the dataset of [Alglave et al. 2021]), we experimented with a new Snapdragon version. We acquired a Snapdragon 888 (SM8350) processor, and using the Litmus7 (part of DIY7) testing framework, we ran the 23 basic behavior tests. Like other processors and unlike Snapdragon 820, Snapdragon 888 did not exhibit the weak behavior of LB in 6000M runs. For comparison, weak behaviors of the well-known store buffering (SB) and message passing (MP) tests were observed in 93% and 0.676% (respectively) of the 6000M runs. The supplementary material [Lee et al. 2023] includes the full results for Snapdragon 888.

On all those implementations that do not exhibit load buffering, we believe that strong stores could be implemented just like plain stores, without any additional overhead. To validate this claim, we used the Herd7 model checker. We started from the available tests in the suite of [Alglave et al. 2021, 2014], which includes 3,773 tests for Armv8 and 3,116 tests for Power, and confirmed that all behaviors that are forbidden by the strengthened hardware models where every store is strong (i.e., the models obtained by including [R]; po; [W] in the bob relation of Armv8 or the

“no-thin-air” constraint of Power), but allowed by the existing models, were never observed on an implementation (except for Snapdragon 820). The supplementary material [Lee et al. 2023] includes the cat files defining the models, the tests we ran, and the logs of the results.

5.3 Mapping PS\textsuperscript{IR} to Hardware

Given strong stores in hardware, the mapping of PS\textsuperscript{IR} to hardware follows the standard schemes of C/C++ concurrency primitives,\textsuperscript{15} except that relaxed writes in PS\textsuperscript{IR} are mapped to strong stores (while non-atomic writes are compiled to plain stores). We do not assume here that the hardware generally preserves load-store ordering, in which case, strong stores are not needed at all. In addition, the soundness of the “short-term” solution (see §2.2), which, in the absence of strong stores, suggests mapping relaxed writes as if they were release, follows from the discussion below since release writes are mapped to constructs that provide stronger guarantees than strong stores.

Remark 6. As was observed in [Cho et al. 2021], to be able to match every out-of-order execution to an in-order racy execution (which we need for Thm. 4.3, and Cho et al. [2021] need for LDRF-PF), PS\textsuperscript{IR} has to forbid the reordering of RMWs with subsequent writes. Then, the mapping of certain RMW instructions to ARMv8 requires an extra “fake” control dependency from the read part of the RMW, so the hardware will not reorder RMWs with following plain writes (which arise from non-atomic writes in the source). We refer the reader to [Cho et al. 2021] for the exact mapping scheme and the (unnoticeable) performance impact of it. We note that for hardware that preserves load-store ordering for all stores, this additional fake dependency is not needed.

To formally state the correctness of this mapping, since there are no system calls in the hardware models, we define the set of outcomes of a program for representing the final memories obtained after program executions are completed. This notion is defined for PS\textsuperscript{IR} and Armv8S as follows.

Definition 5.1. A function \( o : \text{Loc} \rightarrow \text{Val} \) is an outcome of a program \( \text{prog} \) under PS\textsuperscript{IR} if some execution of \( \text{prog} \) terminates with a memory \( M \) (i.e., init(\( \text{prog} \)) \( \rightarrow^* \) \( \langle T, P_G, M \rangle \) \& terminal(\( \langle T, P_G, M \rangle \))), and \( o(X) = v \) where \( \langle X@t, v, _, _ \rangle \in M \) is the message to \( X \) with the greatest timestamp \( t \).

Definition 5.2. A function \( o : \text{Loc} \rightarrow \text{Val} \) is an outcome of a program \( \text{prog} \) under Armv8S (PowerS) if \( o \) assigns to every location \( X \) the value of the co-maximal write to \( X \) in some execution graph of \( \text{prog} \) that is Armv8S-consistent (PowerS-consistent).\textsuperscript{16}

Using these definitions, the soundness of mapping from PS\textsuperscript{IR} to Armv8S is stated as follows.

Theorem 5.3. For a PS\textsuperscript{IR} program \( \text{prog} \), we denote by \( \langle \text{prog} \rangle_A \) the Armv8S program obtained by mapping \( \text{prog} \) as described above. Then, given a program \( \text{prog} \) and an outcome \( o \) of \( \langle \text{prog} \rangle_A \) under Armv8S, we have that either \( o \) is an outcome of \( \text{prog} \) under PS\textsuperscript{IR} or \( \text{prog} \) has undefined behavior under PS\textsuperscript{IR} (i.e., it has an execution reaching a machine state of the form \( \langle \bot, _, _ \rangle \)).

To prove this theorem, we utilized the operational model for Armv8 by Pulte et al. [2019], who also showed (in Coq) its equivalence to the declarative formulation of Armv8. We extended their operational model with strong relaxed accesses, reestablished the equivalence of the extended models, and proved (in Coq), using a simulation argument, that runs of this extended operational model reaching a certain outcome corresponds to runs of PS\textsuperscript{IR} that yield the same outcome.

We believe the standard mapping from PS\textsuperscript{IR} to PowerS (with relaxed stores compiled as strong stores) is sound as well. Formally establishing the soundness of this mapping, possibly using the IMM memory model [Podkopaev et al. 2019], is left as future work.

\textsuperscript{15}http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html [Accessed November 2022].

\textsuperscript{16}Intuitively, the coherence order (co), which totally orders the writes to each location, corresponds to the timestamp order in PS\textsuperscript{IR}.
6 RELATED WORK

Our proposal for a concurrency semantics refines, simplifies, and combines existing ideas: catch-fire and preserving load-store ordering as in RC11 [Boehm and Adve 2008; Boehm and Demsky 2014; Lahav et al. 2017], an operational presentation of RC11 using the promising semantics without promises and certified promises as a speculation mechanism to allow load-store reordering [Kang et al. 2017], justifying compiler optimizations on non-atomics based on sequential reasoning [Cho et al. 2022] (see also [Zha et al. 2022]), and the operational Arm model as a bridge between the high-level semantics and the hardware model [Pulte et al. 2019]. We also rely on [Alglave et al. 2021, 2014] for the models of Power and Arm, the experimental data on observed behaviors, the Herd model checker, and the testing framework; and on [Ou and Demsky 2018] for the performance evaluation of different compilation schemes.

In particular, our PS_{IR} model is inspired by the PS_{na} model in [Cho et al. 2022]. The most significant difference between PS_{IR} and PS_{na} is that PS_{na} allows also promises of relaxed writes, which makes PS_{na} significantly more complex than PS_{IR}. First, in PS_{na} a thread promises messages with specific timestamp, value, and view, while PS_{IR} only maintains sets of locations that threads will write to in the future. (This is possible because promises of PS_{IR} are needed only for race detection.) Second, unlike PS_{IR}, PS_{na} allows a thread to lower or split their promises, which leads to substantial complications in proofs. Lastly, a non-atomic write in PS_{IR} adds a single message to the memory, while in PS_{na} multiple messages may be added by a single non-atomic write.

The idea to use an undefined value rather than catch-fire in order to validate load introduction comes from the LLVM (informal) model. To the best of our knowledge, the first attempt to apply this approach in a formal model was in [Chakraborty and Vafeiadis 2017], where previous read values can be revisited whenever relevant writes are executed. This requires a rather complicated event-structure-based model, which does not admit the DRF guarantee. Later improvements of this model [Chakraborty and Vafeiadis 2019; Moiseenko et al. 2020] admit DRF but fail to support load introduction. In turn, our PS_{IR} model (following PS_{na}) applies this approach together with promises. We are not aware of any previous proof relating an in-order source model based on catch-fire to an IR model that is based on undefined values.

Other weak memory models were recently proposed (see, e.g., [Jagadeesan et al. 2020; Jeffrey et al. 2022; Paviotti et al. 2020]), but they are all focused on generally allowing load-store reordering, while our models (both source and IR) allow it only for non-atomic accesses. Notably, supporting load introduction in these models is rather hard, and besides the promising models (e.g., the recent version in [Cho et al. 2022]), we are not aware of any model that fully supports load-store reordering as well as load introduction. In particular, Jeffrey et al. [2022] observe a tension between the kind of temporal reasoning supported by their model and load-introduction.

In contrast, other work, e.g., [Liu et al. 2021; Marino et al. 2011], propose SC as a concurrency semantics for programmers, and study its expected cost (which can be rather high). In turn, we believe that an in-order model enjoys the advantages of SC, while allowing for rather minimal performance overhead, provided that catch-fire for races on non-atomics is acceptable.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their valuable feedback. Chung-Kil Hur is the corresponding author. Minki Cho, Sung-Hwan Lee, and Chung-Kil Hur were supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFC-IT2102-03. Roy Margalit and Ori Lahav were supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 851811) and the Israel Science Foundation (grant numbers 1566/18 and 814/22).
REFERENCES


Mike Dodds, Mark Batty, and Alexey Gotsman. 2018. Compositional Verification of Compiler Optimisations on Relaxed Memory. In ESOP. Springer International Publishing, Cham, 1027–1055. https://doi.org/10.1007/978-3-319-89884-1_36


Received 2022-11-10; accepted 2023-03-31