"The Art of Concurrency Programming" - Reading Notes 03: Java Memory Model

Reading notes on Chapter 3 of "The Art of Concurrent Programming," starting from the basics.

If the foundation is not solid, the ground will shake.

1. Basics#

1. Two Key Issues in Concurrency#

Inter-thread communication and inter-thread synchronization.

Thread communication mechanisms:

==Shared Memory==: Implicit communication, explicit synchronization
==Message Passing==: Explicit communication, implicit synchronization

Java's concurrency adopts a shared memory model.

2. Java Memory Structure#

In Java, all instance fields, static fields, and array elements are stored in heap memory; local variables, method parameters, and exception handler parameters are not shared between threads.

The Java Memory Model (JMM) defines the abstract relationship between threads and main memory: shared variables between threads are stored in main memory, and each thread has a private local memory that stores a copy of the shared variables read/written by that thread.

Local memory is an abstract concept of JMM and does not actually exist; it encompasses caches, write buffers, registers, and other hardware and compiler optimizations.

For thread A and thread B to communicate, two steps must occur:

Thread A refreshes the updated shared variable from local memory A to main memory.
Thread B reads the shared variable that A has previously updated from main memory.

JMM provides memory visibility guarantees for Java programmers by controlling the interaction between main memory and each thread's local memory.

3. Instruction Reordering#

Compiler optimization reordering. The compiler can rearrange the execution order of statements without changing the semantics of single-threaded programs.
Instruction-level parallelism reordering. Modern processors use instruction-level parallelism (ILP) to overlap the execution of multiple instructions. If there are no data dependencies, the processor can change the execution order of the corresponding machine instructions.
Memory system reordering. Due to the use of caches and read/write buffers by processors, load and store operations may appear to be executed out of order.

The final instruction sequence executed by Java:

Source code ------> Compiler optimization reordering ------> Instruction-level parallelism reordering ----------> Memory system reordering ---------> Final instruction sequence

The first belongs to compiler reordering, while the latter two belong to processor reordering. JMM provides consistent memory visibility guarantees for programmers by prohibiting specific types of compiler and processor reordering.

For the compiler, JMM prohibits specific types of compiler reordering. For processor reordering, JMM's processor reordering rules require the Java compiler to insert specific memory barrier instructions when generating instruction sequences, thereby prohibiting specific types of processor reordering.

4. Types of Memory Barriers#

Load: Read Buffer

Store: Write Buffer

Four types of memory barriers:

LoadLoad Barriers: Ensure that load1's data is loaded before load2 and all subsequent load instructions.
StoreStore Barriers: Ensure that store1's data is visible to other processors before store2 and all subsequent store instructions.
LoadStore Barriers: Ensure that the load occurs before the store is flushed to memory.
StoreLoad Barriers: The instructions before this barrier will complete before executing the subsequent instructions (high overhead).

5. Happens-Before#

Programming principles that programmers must follow.

In JMM, if the result of one operation needs to be visible to another operation, there must be a happens-before relationship between these two operations.

Some ordering rules (for programmers):

Program order rule: Each operation in a thread happens-before any subsequent operations in that thread.
Monitor lock rule: Unlocking a lock happens-before subsequent locking of that lock.
Volatile variable rule: Writing to a volatile field happens-before any subsequent reading of that volatile field.
Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.

Happens-before only requires that the previous operation (the result of execution) is visible to the subsequent operation, and that the previous operation is ordered before the second operation; it does not mean that the previous operation must be executed before the subsequent operation!

The requirement for the previous operation to be visible to the subsequent operation does not mean that the previous operation must be executed before the subsequent operation?

JMM requires that if A happens-before B, it must first ensure that the execution result (the result of A's execution is **not necessarily visible** to B) must be logically the same as happens-before, and secondly, A's order is before B (this refers to the order before reordering, later adjustments to the order of A and B's operations are allowed through reordering), but the final result must remain logically consistent, so it appears to be executed in order.

2. Reordering#

Data Dependency

In single-threaded programs, reordering operations with control dependencies does not change the execution result; however, in multi-threaded programs, reordering operations with control dependencies may change the program's execution result.

See the example on page 30.

As-if-serial

No matter how reordering occurs (compilers and processors to improve parallelism), the execution result of the (single-threaded) program cannot be changed. Compilers, runtime, and processors must adhere to as-if-serial semantics.

3. Sequential Consistency Memory Model#

If a program is correctly synchronized, its execution will have sequential consistency (Sequentially Consistent) — that is, the execution result of the program is the same as that of the program in a sequentially consistent memory model.

All operations in a thread must be executed in program order.
(Regardless of whether the program is synchronized) all threads can only see a single operation execution order. In a sequentially consistent memory model, each operation must be executed atomically and immediately visible to all threads.

In an unsynchronized program, although the overall execution order is unordered, all threads can only see a consistent overall execution order. The reason for this guarantee is that each operation in the sequentially consistent memory model must be immediately visible to any thread.

JMM does not provide this guarantee. Because only after the current thread refreshes the data written in local memory to main memory can this write operation be visible to other threads.

JMM's processing logic: Code within critical sections can be reordered; JMM will perform special handling at the two key points of exiting and entering the critical section, ensuring that threads have the same memory view as in the sequentially consistent model at these two points.

The basic principle of JMM's specific implementation: to open up convenience for compiler and processor optimizations as much as possible without changing the execution results of (correctly synchronized) programs.

Differences between the sequential consistency memory model and JMM:

The sequential consistency model guarantees that operations within a single thread will be executed in program order, while JMM does not guarantee that operations within a single thread will be executed in program order (for example, the reordering of correctly synchronized multi-threaded programs within critical sections). This has been discussed earlier, so it will not be repeated here.
The sequential consistency model guarantees that all threads can only see a consistent operation execution order, while JMM does not guarantee that all threads can see a consistent operation execution order. This has also been discussed earlier, so it will not be repeated here.
JMM does not guarantee atomicity for write operations on 64-bit long and double variables, while the sequential consistency model guarantees atomicity for all memory read/write operations.

Processor Bus Working Mechanism

In a computer, data is transmitted between the processor and memory via a bus. Each data transfer between the processor and memory is completed through a series of steps.

Bus transactions include Read Transactions and Write Transactions. Read transactions transfer data from memory to the processor, while write transactions transfer data from the processor to memory, with each transaction reading/writing one or more physically contiguous words in memory.

The bus synchronizes transactions that attempt to concurrently use the bus. During a bus transaction executed by one processor, the bus prohibits other processors and I/O devices from performing memory reads/writes (thus ensuring that memory reads and writes within a single bus transaction are atomic).
On some 32-bit processors, if atomicity is required for writing 64-bit data, there will be significant overhead. When the JVM runs on such processors, it may split a write operation for a 64-bit long/double variable into two 32-bit write operations.

Note that in the old memory model prior to JSR-133, the read/write operation of a 64-bit long/double variable could be split into two 32-bit read/write operations. Starting from the JSR-133 memory model (i.e., from JDK5), it is only allowed to split a write operation for a 64-bit long/double variable into two 32-bit write operations; any read operation in JSR-133 must be atomic (i.e., any read operation must be executed in a single read transaction).

4. Volatile Memory Semantics#

Characteristics of volatile variables:

Visibility: Reading a volatile variable always sees the last write to that variable by any thread.
Atomicity: Reading and writing any single volatile variable is atomic (including long and double), but composite operations like volatile++ do not have atomicity.

Volatile write-read memory semantics:

Write: When writing to a volatile variable, JMM refreshes the shared variable value in the thread's corresponding local memory to main memory.

Read: When reading a volatile variable, JMM invalidates the corresponding local memory of that thread. The thread then reads the shared variable from main memory.

When reading volatile, it must be ensured that the data in main memory does not change; therefore, if volatile read is the first operation, reordering cannot be achieved.
When writing volatile, it must ensure that the previous data can be written to main memory correctly; therefore, if volatile write is the second operation, reordering cannot be achieved.
Ordinary read/write occurs in local memory but may randomly write to main memory, hence it has randomness.

Implementation of Volatile Memory Semantics#

To implement the memory semantics of volatile, the compiler inserts memory barriers into the instruction sequence to prohibit specific types of processor reordering.

JMM memory barrier insertion strategy (conservative!):

Insert a StoreStore barrier before each volatile write operation.
Insert a StoreLoad barrier after each volatile write operation.
Insert a LoadLoad and a LoadStore barrier after each volatile read operation.

Why is there no need to insert LoadStore before volatile write operations to avoid reordering between ordinary reads and volatile writes?
Speculation: There is an inclusion relationship between memory barriers; for example, StoreLoad can achieve all functions of the other three.

A common usage pattern for volatile write-read memory semantics is: one writing thread writes to a volatile variable, and multiple reading threads read the same volatile variable. When the number of reading threads greatly exceeds the number of writing threads, choosing to insert a StoreLoad barrier after volatile writes will bring considerable performance improvement.

In practice, the compiler can optimize whether to add memory barriers when generating bytecode, but note that the last StoreLoad cannot generally be omitted, as it cannot be determined whether there will be another volatile read/write after return.

Different processor platforms may also optimize memory barriers.

Functionally, locks are more powerful than volatile; in terms of scalability and execution performance, volatile has advantages.

5. Lock Memory Semantics#

Memory semantics of lock release and acquisition (like volatile):

When a thread releases a lock, it refreshes the shared variable in local memory to main memory (corresponding to volatile write).
When a thread acquires a lock, it invalidates the corresponding local memory of the thread, so the critical section code must read the shared variable from main memory (corresponding to volatile read).

Implementation of lock memory semantics:

Analyzing ReentrantLock Source Code#

The implementation of ReentrantLock relies on the Java synchronizer framework AbstractQueuedSynchronizer (hereinafter referred to as AQS). AQS uses an integer volatile variable (named state) to maintain synchronization state, which is key to the memory semantics implementation of ReentrantLock.

ReentrantLock can be either a fair lock or a non-fair lock.

Fair Lock

The tryAcquire method has slightly changed; it adds the hasQueuedPredecessors() method at the point of checking whether the current thread is the first to acquire the lock, which checks if there are threads waiting longer than the current thread.

Non-Fair Lock
Calls CAS: If the current state value equals the expected value, it atomically sets the synchronization state to the given update value.

Summary of Fair and Non-Fair Lock Semantics:

Both fair and non-fair locks must write a volatile variable state upon release.
When acquiring a fair lock, it first reads the volatile variable.
When acquiring a non-fair lock, it first uses CAS to update the volatile variable, which simultaneously has the memory semantics of volatile read and volatile write.

It can be seen that the memory semantics of lock release and acquisition can be implemented in at least two ways:

Using the memory semantics of write-read of volatile variables.
Using the volatile read and volatile write memory semantics associated with CAS.

How does CAS simultaneously have the memory semantics of volatile read and volatile write?

In a multi-processor environment, the cmpxchg instruction is prefixed with lock; single processors do not need this prefix (single processors maintain their own sequential consistency).

Originally using bus locking, later considering overhead, cache locking was used, greatly reducing the execution overhead of lock-prefixed instructions.
The lock instruction can prevent reordering.
It flushes all data in the write buffer to memory.

The effects of the last two points provide sufficient memory barrier effects to simultaneously achieve the memory semantics of volatile read and volatile write.

General Implementation Pattern of the Concurrent Package#

First, declare shared variables as volatile.
Then, use atomic conditional updates with CAS to achieve synchronization between threads.
Simultaneously, utilize the memory semantics of volatile reads, writes, and the volatile read and write memory semantics of CAS to achieve communication between threads.

6. Memory Semantics of Final Fields#

1. Reordering Rules for Final Fields#

(1) Write: Writing to a final field within the constructor cannot be reordered with subsequently assigning the reference of the constructed object to a reference variable.

(2) Read: The initial read of a reference to an object containing a final field cannot be reordered with subsequently reading this final field for the first time.

2. Reordering Rules for Writing Final Fields#

Prohibiting the reordering of writes to final fields outside the constructor includes two aspects:

Compiler: JMM prohibits the compiler from reordering writes to final fields outside the constructor.
Processor: The compiler will insert a StoreStore barrier after writing to the final field and before returning from the constructor. This barrier prevents the processor from reordering the write to the final field outside the constructor.

These rules ensure that:

Before the reference to the object is visible to any thread, the final fields of the object have been correctly initialized, while ordinary fields do not have this guarantee.

3. Reordering Rules for Reading Final Fields#

Processor: In a thread, the initial read of an object reference and the initial read of the final field contained in that object, JMM prohibits the processor from reordering these two operations.

Compiler: The compiler will insert a LoadLoad barrier before the operation of reading the final field.

These reordering rules ensure that: Before reading a final field of an object, the reference to the object containing this final field will be read first (to handle indirect dependency operations on a few processors).

4. When Final Fields are Reference Types#

For reference types, the reordering rules for writing final fields add the following constraint:

Writing to a final reference object's member field within the constructor cannot be reordered with subsequently assigning the reference of the constructed object to a reference variable outside the constructor.

For threads with data races, visibility cannot be guaranteed.

5. Why Final Fields Cannot Escape from the Constructor#

Before the constructor returns, the reference to the constructed object cannot be visible to other threads, because at this point the final fields may not have been initialized (reordering may occur within the constructor).

(Before the reference variable is visible to any thread, the final fields of the object pointed to by the reference variable must be correctly initialized; this is the reordering rule for writing final fields.)

7. Happens-Before#

As-if-serial semantics creates an illusion for programmers writing single-threaded programs: Single-threaded programs execute in the order specified by the program.

The happens-before relationship creates an illusion for programmers writing correctly synchronized multi-threaded programs: Correctly synchronized multi-threaded programs execute in the order specified by happens-before.

The purpose of this is to maximize the parallelism of program execution without changing the execution results of the program.

Complete Happens-Before Rules

Program order rule: Each operation in a thread happens-before any subsequent operations in that thread.
Monitor lock rule: Unlocking a lock happens-before subsequent locking of that lock.
Volatile variable rule: Writing to a volatile field happens-before any subsequent reading of that volatile field.
Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.
start() rule: If thread A performs the operation ThreadB.start() (starting thread B), then A's ThreadB.start() operation happens-before any operation in thread B.
join() rule: If thread A performs the operation ThreadB.join() and successfully returns, then any operation in thread B happens-before thread A successfully returns from ThreadB.join().

8. Double-Checked Locking and Lazy Initialization#

Double-Checked Locking

Problem description: (generally in singleton patterns) Add a non-null check before locking, and perform the non-null check again after locking to achieve double-checked locking.

Reason: When constructing an instance, the operations on the object reference pointer and the initialization operations may be reordered, leading to the situation where it seems that the object has been created during if(instance==null), but the initialization has not yet occurred.

1. Allocate memory space for the object.
2. Initialize the object.
3. Set instance to point to the memory space.
4. First access the object.

Steps 3 and 2 may be reordered, leading to issues like 1342.

Initially, once the singleton pattern encounters concurrency, it may lead to the initialization of two objects.

Later, locking was chosen; locking can solve the problem, but it incurs significant performance overhead.

Then, a non-null check was added before locking, allowing initialization to occur only the first time, and subsequent accesses do not require locking. However, due to the added non-null check, multiple threads performing this check may bypass the waiting time and directly read the object, but at this point, due to the reordering within the locked space, the object may not have been fully initialized. This can lead to serious consequences.

Solution:

Volatile. Use the semantics of volatile to prohibit reordering. In the lazy initialization of the singleton pattern, the instance must be marked as volatile.

Advantage: In addition to achieving lazy initialization for static fields, it can also achieve lazy initialization for instance fields.

Class initialization-based solutions ensure that initialization is completed before threads access it.

During the class initialization phase (i.e., after the Class is loaded and before it is used by threads), the JVM will perform class initialization. During
the class initialization, the JVM will acquire a lock. This lock can synchronize multiple threads accessing the same class during initialization.

Note: Several situations for class initialization
1) T is a class, and an instance of type T is created.
2) T is a class, and a static method declared in T is called.
3) A static field declared in T is assigned a value.
4) A static field declared in T is used, and this field is not a constant field.
5) T is a top-level class (Top Level Class, see Java Language Specification), and an assertion statement nested within T is executed.

Synchronization Process During JVM Initialization#

Initialization Phase

Phase 1: Control class or interface initialization by synchronizing on the Class object (i.e., acquiring the initialization lock of the Class object). The thread acquiring this lock will wait until the current thread can acquire this initialization lock.

At this point, only one thread can acquire the initialization lock.
Phase 2: Thread A executes the class initialization, while thread B waits on the condition corresponding to the initialization lock.

The initialization process of instance (==reordering may occur, but it is not visible to other threads==)
1. Allocate memory space.
2. Copy to the reference variable.
3. Initialize the object.
Phase 3: Thread A sets the initialization completion flag and then wakes up all threads waiting on the condition.
Phase 4: Thread B completes the class initialization processing.

Thread A executes class initialization in Phase 2 at A1 and releases the initialization lock in Phase 3 at A4;

Thread B acquires the same initialization lock in Phase 4 at B1 and starts accessing this class only after Phase 4 at B4.
Phase 5: Thread C executes class initialization processing.

Thread A executes class initialization in Phase 2 at A1 and releases the lock in Phase 3 at A4;

Thread C acquires the same lock in Phase 5 at C1 and starts accessing this class only after Phase 5 at C4.

9. Overview of the JAVA Memory Model#

Processor Memory Models#

Common processor memory models:

Total Store Ordering Memory Model (TSO)

Relax the order of ==write-read== operations in the program.
Partial Store Ordering Memory Model (PSO)

Further relax the order of ==write-write== operations in the program.
Relaxed Memory Order Memory Model (RMO)

Further relax the order of ==read-write== operations in the program.
PowerPC Memory Model

Further relax the order of ==read-read== operations.

All processor memory models allow write-read reordering because they all use write buffers.

Write buffers may lead to write-read operation reordering.

Since write buffers are only visible to the current processor, this feature allows the current processor to see the writes temporarily stored in its write buffer before other processors.

Relationships Between Memory Models#

Similar to processor memory models, the more a language pursues execution performance, the weaker its memory model design will be.

JMM's Memory Visibility Guarantees#

Divided by program type:

Single-threaded programs:

There will be no memory visibility issues.
Correctly synchronized multi-threaded programs:

They have sequential consistency (the execution result of the program is the same as that of the program in a sequentially consistent memory model).
Unsynchronized/incorrectly synchronized multi-threaded programs:

JMM provides them with minimal safety guarantees: the values read by threads during execution are either values written by some previous thread or default values (0, null, false).

However, minimal safety does not guarantee that the values read by threads are necessarily the values after some thread has finished writing. Minimal safety guarantees that the values read by threads will not appear out of thin air, but it does not guarantee that the values read by threads will be correct.

Fixes to the Old Memory Model#

After JDK1.5, the memory semantics of volatile and final have been enhanced.