"The Art of Concurrent Programming" - Reading Notes 02: The Underlying Implementation Principles of Java Concurrency Mechanism

Reading notes on Chapter 2 of "The Art of Concurrent Programming". Mainly summarizes volatile and synchronized.

1. Application of volatile#

If the volatile variable modifier is used properly, it has a lower usage and execution cost than synchronized, because it does not cause thread context switching and scheduling.

1. Definition and Implementation Principles of volatile#

When performing write operations on shared variables with the volatile variable modifier, an additional lock-prefixed instruction is generated.

The lock-prefixed instruction triggers two things in multi-core processors:

(1) Writes back the data in the current processor's cache line to memory.

However, the overhead of locking the bus is relatively high, so the current LOCK signal basically locks the cache, using cache coherence mechanisms to ensure the atomicity of modifications (cache locking).

(2) This write-back operation will invalidate the data cached at that memory address in other CPUs.

MESI (Modified, Exclusive, Shared, Invalid) control protocol maintains memory and other processor cache coherence.
Snooping techniques ensure that internal caches, system memory, and other data caches remain consistent on the bus.

To improve processing speed, processors do not communicate directly with memory but first read data from memory into the cache before performing operations, but the operations are completely unaware of when they will be written back to memory. If a write operation is performed on a declared volatile variable, the JVM sends a Lock-prefixed instruction to the processor, writing back the data in the cache line of this variable to system memory. However, even if it is written back to memory, if the values cached in other processors are still old, subsequent calculations will have issues. Therefore, under multi-processor conditions, a cache coherence protocol must be implemented, where each processor checks whether its cached values are outdated by snooping the data propagated on the bus. If outdated, the current processor's cache line is set to an invalid state, and when the processor modifies this data, it will read the data from system memory back into the processor cache.

2. Optimization of volatile usage#

Append shared variables to 64 bytes (seems ineffective, not seen in the source code).

LinkedTransferQueue

Some processors' cache lines are 64 bytes, and do not support partially filled cache lines. By appending to 64 bytes to fill the cache line, it avoids the head and tail nodes being loaded into the same cache line, preventing them from locking each other during modifications.

Not all uses of volatile variables need to be appended to 64 bytes.

Processors with cache lines not 64 bytes wide are not applicable.
Situations where shared variables are not frequently read and written are not applicable, as appending bytes may lead to increased performance consumption.

2. Implementation Principles and Applications of synchronized (Heavyweight Lock)#

The basis of synchronize: every object in Java can act as a lock. This is manifested in three forms:

For ordinary synchronized methods, the lock is the current instance object.
For static synchronized methods, the lock is the class object of the current class.
For synchronized blocks, the lock is the object in the synchronize parentheses.

Where exactly is the lock stored? What information is stored inside the lock?·

Implementation Principles of Synchronized in JVM#

The JVM implements method synchronization and block synchronization based on entering and exiting the Monitor object, but the details of the two are different.

Block synchronization is implemented using monitorenter and monitorexit instructions, while method synchronization is implemented using another method, the details of which are not specified in the JVM. However, method synchronization can also be implemented using the aforementioned two instructions.

The monitorenter instruction is inserted at the beginning of the synchronized code block after compilation, while monitorexit is inserted at the end of the method and at exception points. The JVM must ensure that each monitorenter has a corresponding monitorexit to pair with it.
Every object has a monitor associated with it, and when a monitor is held, it will be in a locked state. When a thread executes the monitorenter instruction, it will attempt to acquire ownership of the monitor corresponding to the object, i.e., attempt to obtain the lock of the object.

Synchronized methods: The JVM uses the ACC_SYNCHRONIZED flag to implement it. That is, the JVM implements synchronization by adding ACC_SYNCHRONIZED to the method access identifier (flags). Synchronized methods will store ACC_SYNCHRONIZED in the access_flags of the class file, and access_flags is stored in the constant pool.

Synchronized methods are implicit. A synchronized method will store the ACC_SYNCHRONIZED identifier in the method_info structure in the runtime constant pool. When a thread accesses the method, it will check if the ACC_SYNCHRONIZED flag exists. If it does, it must first obtain the corresponding monitor lock before executing the method. When the method execution ends (whether it returns normally or throws an exception), it will release the corresponding monitor lock. If other threads want to access this method at this time, they will be blocked because they cannot obtain the monitor lock. If an exception is thrown in the synchronized method and is not caught within the method, the monitor lock that has been obtained will be released before throwing it out.

Monitor#

A monitor is a program structure where multiple subprograms (objects or modules) mutually access shared resources.

This is a concept aimed at simplifying the process of synchronous calls, encapsulating synchronization operations, and avoiding direct use of PV semaphores. The specific implementation in Java is ObjectMonitor.

Key fields of ObjectMonitor

_count: Records the number of times the owner thread has acquired the lock. This is easy to understand and determines that synchronized is reentrant.
_owner: Points to the thread that owns the object.
_WaitSet: Stores the queue of threads that are in the wait state.
_EntryList: Stores the queue of threads that are blocked waiting for the lock.

8694380-509ad307903635af

8694380-0d3b09e6c73f8892

Threads wanting to acquire the monitor first enter the __EntryList queue and block waiting.
If the wait() method is called in the program, the thread enters the _WaitSet queue, and wait() will release the monitor lock, setting _owner to null and entering the _WaitSet queue to block waiting.
When other threads in the program call the notify/notifyAll methods, one of the threads in the _WaitSet will be awakened, and this thread will attempt to acquire the monitor lock again. If successful, it will become the owner of the monitor.

Specific Implementation

8694380-c4a60ba68bcc7704

There are many related implementation methods available online.

1. Java Object Header#

The lock used by synchronized exists in the Java object header.

Array types are 3 words wide (3*4 bytes).

Non-numeric types are 2 words wide (2*4 bytes).

The data stored in the mark word changes with the lock flag changes.

Storage structure under a 64-bit virtual machine.

2. Lock Upgrade and Comparison#

from JDK1.6

Locks have a total of 4 states, from low to high: no lock, biased lock, lightweight lock, heavyweight lock. These states will gradually upgrade with competition.
Locks can be upgraded but cannot be downgraded (to improve the efficiency of acquiring and releasing locks).

Biased Lock:#

In most cases, locks not only do not have multi-thread competition but are always acquired multiple times by the same thread. To reduce the cost of acquiring locks for threads, biased locks are introduced.

When a thread accesses a synchronized block and acquires the lock, it stores the lock bias's thread ID in the object header and the lock record in the stack frame. Later, when this thread enters and exits the synchronized block, it does not need to perform CAS operations to lock and unlock; it only needs to test whether the mark word in the object header stores a biased lock pointing to the current thread.

If the test is successful, it indicates that the thread has already acquired the lock. If the test fails, it needs to test whether the biased lock identifier in the mark word is set to 1: if not set, it uses CAS to compete for the lock; if set, it attempts to use CAS to point the biased lock in the object header to the current thread.

(1) Revocation of Biased Lock

Biased locks use a mechanism that releases the lock only when competition occurs, so when other threads attempt to compete for the biased lock, the thread holding the biased lock will release it. The revocation of the biased lock requires waiting for a global safe point (at which time no bytecode is being executed).

First, ==pause== the thread holding the biased lock, then check whether the thread holding the biased lock is alive. If the thread is not active, the object header is set to an unlocked state; if the thread is still alive, the stack holding the biased lock will be executed, traversing the lock records of the biased object, and the lock records in the stack and the mark word in the object header will either be re-biased to another thread or restored to an unlocked state or marked as unsuitable for being a biased lock.

This part is still unclear, need to look it up online.
If another thread competes with the existing thread for the biased lock, how is it determined whether to continue biasing?

(2) Disabling Biased Lock

Java 6 and 7 enable biased locks by default, but there will be a few seconds delay after the program starts. If necessary, the delay can be disabled with -XX:BiasedLockingStartupDelay=0. If it is determined that all locks in the program are usually in a competitive state, biased locks can be disabled through JVM parameters: -XX:UseBaisedLocking=false, and the program will default to lightweight lock state.

Lightweight Lock#

(1) Locking

Before a thread executes a synchronized block, the JVM first creates space in the current thread's stack frame to store lock records and copies the Mark word from the object header into the lock record (Displaced Mark Word). The thread then attempts to use CAS to replace the Mark word in the object header with a pointer to the lock record. If successful, the current thread acquires the lock; if it fails, it indicates that other threads are competing for the lock, and the current thread attempts to use spin to acquire the lock.

(2) Unlocking

During unlocking, an atomic CAS operation is used to replace the Displaced Mark Word back to the object header. If successful, it indicates that no competition occurred. If it fails, it indicates that there is competition for the current lock, and the lock will expand into a heavyweight lock.

Since spinning consumes CPU, to avoid useless spinning, once upgraded to a heavyweight lock, it will not revert to a lightweight lock state. When the lock is in this state, other threads attempting to acquire the lock will be blocked. When the thread holding the lock releases it, these threads will be awakened, and the awakened threads will engage in a new round of lock contention.

3. Comparison of Advantages and Disadvantages of Locks#

Biased Lock:
Advantages: Locking and unlocking do not require additional consumption.
Disadvantages: If there is lock competition between threads, it will incur additional costs for lock revocation.
```
Applicable scenarios: <mark>Suitable for scenarios where only one thread accesses the synchronized block.</mark>
```
Lightweight Lock:
Advantages: Competing threads do not block, improving the program's response speed.
Disadvantages: If threads that cannot obtain the lock persist, spinning will consume CPU.
Usage scenarios: Pursuing response time, synchronization execution speed is very fast.
Heavyweight Lock:
Advantages: Thread competition does not use spinning, so it does not consume CPU.
Disadvantages: Thread blocking leads to slow response time.
Applicable scenarios: Pursuing throughput, synchronization block execution time is relatively long.

Self-Understanding#

The biased lock initially points to the thread holding the lock, and then revokes the biased lock when lock competition occurs. It is questionable whether the biased lock will evolve into a lightweight lock.
The lightweight lock initially needs to copy a portion of the mark word content (i.e., hashcode or other lock information) to the local thread stack frame, then modify the mark word to point to the thread's pointer.
If lock competition occurs, the mark word will expand into a heavyweight lock.

3. Implementation Principles of Atomic Operations#

1. Terminology#

CAS: Compare and Swap
Cache Line: The smallest unit of operation in the cache
Memory Order Conflict: Generally caused by false sharing, when a memory order conflict occurs, the CPU must clear the pipeline.
False Sharing: Multiple CPUs modifying different parts of the same cache line simultaneously, causing one CPU's operation to be invalid.

2. Processor Implementation of Atomic Operations#

(1) Ensuring Atomicity through Bus Locking

If multiple processors simultaneously read and modify shared variables (i++), then the shared variable will be operated on by multiple processors at the same time, making the read-modify operation non-atomic, and the value of the shared variable will be inconsistent with expectations after the operation.

Bus locking: When a processor outputs the LOCK# signal on the bus, requests from other processors will be blocked, allowing that processor to exclusively access shared memory.

(2) Ensuring Atomicity through Cache Locking

Bus locking locks the communication between the CPU and memory, and during the locking period, other processors cannot operate on data from other memory addresses, so the overhead of bus locking is relatively high.

Cache locking: If a memory area is cached in the processor's cache line and is locked during the LOCK period, when it executes the lock operation to write to memory, the processor does not assert the LOCK# signal on the bus but modifies the memory address internally and allows its cache coherence mechanism to ensure the atomicity of the operation, because the cache coherence mechanism prevents simultaneous modifications of memory area data cached by two or more processors, and when other processors write back data from the locked cache line, it will invalidate the cache line.

Cache locking will not be used in two situations:
    1. Data cannot be cached within the processor, or the data being operated on spans multiple cache lines, in which case bus locking is used.
    2. Some processors do not support cache locking.

3. Java Implementation of Atomic Operations (Locks and Spin CAS)#

(1) Spin CAS Mechanism

Processor's CMPXCHG instruction

Spin CAS: Continuously perform CAS operations until successful.

Three major issues with CAS implementing atomic operations:

ABA Problem: A to B back to A; when CAS checks the value, it may think there has been no change, but in fact, it has changed. The solution is to prepend a version number: 1A to 2B to 3C.

The AtomicStampedReference class solves this problem:

Long spin time incurs high overhead: If spin CAS is unsuccessful for a long time, it will impose a significant execution overhead on the CPU.
Can only guarantee atomic operations on one shared variable: In this case, use locks or combine several shared variables.

(2) Lock Mechanism

Besides biased locks, the other two types of locks use the spin CAS mechanism, meaning that when a thread enters a synchronized block, it uses a spin CAS method to acquire the lock, and when it exits the synchronized block, it uses spin CAS to release the lock.