Move thin lock acquire/release in CoreCLR to managed code#129502
Move thin lock acquire/release in CoreCLR to managed code#129502VSadov wants to merge 27 commits into
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @VSadov |
There was a problem hiding this comment.
Pull request overview
This PR moves the thin-lock (object header) acquire/release fast paths from CoreCLR native code into managed implementations in System.Private.CoreLib, removing the associated FCALL/ecall surface and native inline helpers.
Changes:
- Removed native thin-lock helpers (
syncblk.inl,ObjHeader::*HeaderThinLock, FCALL entries) and associated includes/build references. - Implemented thin-lock acquire/release in managed
System.Threading.ObjectHeader(CoreCLR) and updatedMonitorto call the new managed entrypoints. - Kept NativeAOT parity by renaming/updating its thin-lock entrypoints and adjusting call sites accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/syncblk.inl | Removes native inline thin-lock acquire/release implementation. |
| src/coreclr/vm/syncblk.h | Removes native HeaderLockResult and thin-lock method declarations from ObjHeader. |
| src/coreclr/vm/ecalllist.h | Drops the ObjectHeader FCALL mapping entries. |
| src/coreclr/vm/comsynchronizable.h | Removes FCDECLs for thin-lock FCALL entrypoints. |
| src/coreclr/vm/comsynchronizable.cpp | Removes FCIMPL implementations for thin-lock FCALL entrypoints. |
| src/coreclr/vm/common.h | Removes syncblk.inl from global inline includes. |
| src/coreclr/vm/CMakeLists.txt | Removes syncblk.inl from VM header lists. |
| src/coreclr/System.Private.CoreLib/src/System/Threading/ObjectHeader.CoreCLR.cs | Adds managed thin-lock acquire/release logic and exposes AcquireThinLock(...) and managed Release(...). |
| src/coreclr/System.Private.CoreLib/src/System/Threading/Monitor.CoreCLR.cs | Routes Monitor.Enter/TryEnter/... to the new managed thin-lock entrypoints. |
| src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs | Renames/reshapes NativeAOT thin-lock entrypoint to AcquireThinLock(...) and adjusts uncommon-path handling. |
| src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs | Updates NativeAOT Monitor to call AcquireThinLock(...). |
| return HeaderLockResult.UseSlowPath; | ||
| } | ||
|
|
||
| if (Interlocked.CompareExchange(pHeader, oldBits | currentThreadID, oldBits) == oldBits) |
There was a problem hiding this comment.
Doing CAS in managed code is one of the motivations. The native CAS needs to check for the presence of LSE on ARM64, JITed code does not need that.
This affects Linux-arm64 perhaps even more than Windows-arm64.
|
@MihuBot benchmark System.Collections.Concurrent -arm -intel |
|
@MihuBot benchmark System.Threading -arm |
System.Collections.Concurrent.IsEmpty_String_
System.Collections.Concurrent.IsEmpty_Int32_
System.Collections.Concurrent.Count_String_
System.Collections.Concurrent.Count_Int32_
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/coreclr/System.Private.CoreLib/src/System/Threading/Monitor.CoreCLR.cs:90
- Monitor.TryEnter(object, int) always falls back to GetLockObject(obj).TryEnter(millisecondsTimeout) after a failed one-shot thin-lock attempt. For millisecondsTimeout == 0, this can unnecessarily allocate/create the Lock (inflation) even though we already know the thin lock is currently owned by another thread. Since TryEnter(…, 0) is a one-shot operation, it can return false immediately on HeaderLockResult.Failure and avoid the slow path/inflation cost.
public static bool TryEnter(object obj, int millisecondsTimeout)
{
ArgumentOutOfRangeException.ThrowIfLessThan(millisecondsTimeout, -1);
ObjectHeader.HeaderLockResult result = ObjectHeader.AcquireThinLock(obj, isOneShot: true);
if (result == ObjectHeader.HeaderLockResult.Success)
return true;
return GetLockObject(obj).TryEnter(millisecondsTimeout);
|
@MihuBot benchmark System.Threading -arm |
|
@MihuBot benchmark System.Collections.Concurrent -arm |
System.Threading.Tests.Perf_Volatile
System.Threading.Tests.Perf_Timer
System.Threading.Tests.Perf_ThreadStatic
System.Threading.Tests.Perf_ThreadPool
System.Threading.Tests.Perf_Thread
System.Threading.Tests.Perf_SpinLock
System.Threading.Tests.Perf_SemaphoreSlim
System.Threading.Tests.Perf_Monitor
System.Threading.Tests.Perf_Lock
System.Threading.Tests.Perf_Interlocked
System.Threading.Tests.Perf_EventWaitHandle
System.Threading.Tests.Perf_CancellationToken
System.Threading.Tasks.Tests.Perf_AsyncMethods
System.Threading.Tasks.ValueTaskPerfTest
System.Threading.Channels.Tests.UnboundedChannelPerfTests
System.Threading.Channels.Tests.SpscUnboundedChannelPerfTests
System.Threading.Channels.Tests.BoundedChannelPerfTests
|
| // This is a case when we have: | ||
| // * a fat lock - the most likely case by far, or | ||
| // * we don't own the lock and need to throw and it is ok if the lock gets inflated. | ||
| // Let the slow path handle this. | ||
| Monitor.GetLockObject(obj).Exit(); |
There was a problem hiding this comment.
This is intentional. Typical program would not see these exceptions except if it has bugs.
| // if unused for anything, try setting our thread id | ||
| // N.B. hashcode, thread ID and sync index are never 0, and hashcode is largest of all | ||
| if (oldBits == 0) | ||
| { | ||
| int* pHeader = GetHeaderPtr(ppMethodTable); | ||
| int oldBits = *pHeader; | ||
| // if unused for anything, try setting our thread id | ||
| // N.B. hashcode, thread ID and sync index are never 0, and hashcode is largest of all | ||
| if ((oldBits & MASK_HASHCODE_INDEX) == 0) | ||
| // Thread IDs are allocated sequentially starting from 1 and recycled, so it's | ||
| // unusual to have a thread ID that doesn't fit in the thin-lock field. | ||
| // Check here rather than at entry to keep the hot path as tight as possible. | ||
| // The uninitialized 0 id is also ruled out by this check. | ||
| // If the id doesn't fit, we fall through and call TryAcquireUncommon outside the | ||
| // fixed block to avoid keeping the object pinned while potentially spinning. | ||
| if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID) | ||
| { | ||
| if (Interlocked.CompareExchange(pHeader, oldBits | currentThreadID, oldBits) == oldBits) | ||
| if (Interlocked.CompareExchange(pHeader, currentThreadID, oldBits) == oldBits) | ||
| { |
There was a problem hiding this comment.
GC_RESERVE is set in GC pause, we will never see it set when acquiring the lock.
It is possible to see FINALIZER_RUN, but chances are nearly 0
These are uncommon cases.
|
For perf measuring I use the following benchmark: The benchmark measures lock throughput in ~0.5sec time samples in variety of scenarios. The higher the score, the better. using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
internal class Program
{
private static readonly int ProcessorCount = Environment.ProcessorCount;
private static void Main(string[] args)
{
System.Console.WriteLine("MonitorEnterExitThroughput_ThinLock");
MonitorEnterExitThroughput(1, false, false);
System.Console.WriteLine("MonitorEnterExitThroughput_FatLock");
MonitorEnterExitThroughput(1, false, true);
System.Console.WriteLine("MonitorReliableEnterExitThroughput_ThinLock");
MonitorReliableEnterExitThroughput(1, false, false);
System.Console.WriteLine("MonitorReliableEnterExitThroughput_FatLock");
MonitorReliableEnterExitThroughput(1, false, true);
System.Console.WriteLine("MonitorTryEnterExitWhenUnlockedThroughput_ThinLock");
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(1);
System.Console.WriteLine("MonitorTryEnterExitWhenUnlockedThroughput_FatLock");
MonitorTryEnterExitWhenUnlockedThroughput_FatLock(1);
System.Console.WriteLine("MonitorTryEnterWhenLockedThroughput_ThinLock");
MonitorTryEnterWhenLockedThroughput_ThinLock(1);
System.Console.WriteLine("MonitorTryEnterWhenLockedThroughput_FatLock");
MonitorTryEnterWhenLockedThroughput_FatLock(1);
System.Console.WriteLine("MonitorEnterExitThroughput_ThinLock 4 threads");
MonitorEnterExitThroughput(4, false, false);
}
private static void MonitorReliableEnterExitThroughput(int threadCount, bool delay, bool convertToFatLock)
{
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
if (convertToFatLock)
Monitor.Enter(m);
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localDelay = delay;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
var rng = localDelay ? new Random(threadIndex) : null;
threadReady.Set();
if (convertToFatLock)
{
Monitor.Enter(localM);
Monitor.Exit(localM);
}
startTest.WaitOne();
if (localDelay)
{
while (true)
{
var d0 = RandomShortDelay(rng);
var d1 = RandomShortDelay(rng);
lock (localM)
Delay(d0);
++localThreadOperationCounts[threadIndex];
Delay(d1);
}
}
else
{
while (true)
{
lock (localM)
{
}
++localThreadOperationCounts[threadIndex];
}
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
if (convertToFatLock)
{
Thread.Sleep(50);
Monitor.Exit(m);
}
Run(startTest, threadOperationCounts);
}
private static void MonitorEnterExitThroughput(int threadCount, bool delay, bool convertToFatLock)
{
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
if (convertToFatLock)
Monitor.Enter(m);
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localDelay = delay;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
var rng = localDelay ? new Random(threadIndex) : null;
threadReady.Set();
if (convertToFatLock)
{
Monitor.Enter(localM);
Monitor.Exit(localM);
}
startTest.WaitOne();
if (localDelay)
{
while (true)
{
var d0 = RandomShortDelay(rng);
var d1 = RandomShortDelay(rng);
Monitor.Enter(localM);
Delay(d0);
Monitor.Exit(localM);
++localThreadOperationCounts[threadIndex];
Delay(d1);
}
}
else
{
while (true)
{
Monitor.Enter(localM);
Monitor.Exit(localM);
++localThreadOperationCounts[threadIndex];
}
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
if (convertToFatLock)
{
Thread.Sleep(50);
Monitor.Exit(m);
}
Run(startTest, threadOperationCounts);
}
private static void MonitorTryEnterExitThroughput(int threadCount, bool delay, bool convertToFatLock)
{
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
if (convertToFatLock)
Monitor.Enter(m);
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localDelay = delay;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
var rng = localDelay ? new Random(threadIndex) : null;
threadReady.Set();
if (convertToFatLock)
{
Monitor.Enter(localM);
Monitor.Exit(localM);
}
startTest.WaitOne();
if (localDelay)
{
while (true)
{
var d0 = RandomShortDelay(rng);
var d1 = RandomShortDelay(rng);
if (!Monitor.TryEnter(localM, -1))
return;
Delay(d0);
Monitor.Exit(localM);
++localThreadOperationCounts[threadIndex];
Delay(d1);
}
}
else
{
while (true)
{
if (!Monitor.TryEnter(localM, -1))
return;
Monitor.Exit(localM);
++localThreadOperationCounts[threadIndex];
}
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
if (convertToFatLock)
{
Thread.Sleep(50);
Monitor.Exit(m);
}
Run(startTest, threadOperationCounts);
}
private static void MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(int threadCount)
{
threadCount = 1;
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
threadReady.Set();
startTest.WaitOne();
while (true)
{
if (!Monitor.TryEnter(localM))
return;
Monitor.Exit(localM);
++localThreadOperationCounts[threadIndex];
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
Run(startTest, threadOperationCounts);
}
private static void MonitorTryEnterExitWhenUnlockedThroughput_FatLock(int threadCount)
{
threadCount = 1;
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
Monitor.Enter(m);
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
threadReady.Set();
Monitor.Enter(localM);
Monitor.Exit(localM);
startTest.WaitOne();
while (true)
{
if (!Monitor.TryEnter(localM))
return;
Monitor.Exit(localM);
++localThreadOperationCounts[threadIndex];
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
Thread.Sleep(50);
Monitor.Exit(m);
Run(startTest, threadOperationCounts);
}
private static void MonitorTryEnterWhenLockedThroughput_ThinLock(int threadCount)
{
threadCount = 1;
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
Monitor.Enter(m);
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
threadReady.Set();
startTest.WaitOne();
while (true)
{
if (Monitor.TryEnter(localM))
return;
++localThreadOperationCounts[threadIndex];
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
Run(startTest, threadOperationCounts);
Monitor.Exit(m);
}
private static void MonitorTryEnterWhenLockedThroughput_FatLock(int threadCount)
{
threadCount = 1;
var threadReady = new AutoResetEvent(false);
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var m = new object();
Monitor.Enter(m);
ParameterizedThreadStart threadStart = data =>
{
int threadIndex = (int)data;
var localThreadOperationCounts = threadOperationCounts;
var localM = m;
threadReady.Set();
if (Monitor.TryEnter(localM, 50))
return;
startTest.WaitOne();
while (true)
{
if (Monitor.TryEnter(localM))
return;
++localThreadOperationCounts[threadIndex];
}
};
var threads = new Thread[threadCount];
for (int i = 0; i < threads.Length; ++i)
{
var t = new Thread(threadStart);
t.IsBackground = true;
t.Start((i + 1) * 16);
threadReady.WaitOne();
threads[i] = t;
}
Thread.Sleep(50);
Run(startTest, threadOperationCounts);
Monitor.Exit(m);
}
private static void Run(
ManualResetEvent startTest,
int[] threadOperationCounts,
bool hasOneResult = false,
int iterations = 4)
{
var sw = new Stopwatch();
int threadCount = threadOperationCounts.Length / 16 - 1;
var afterWarmupOperationCounts = new long[threadCount];
var operationCounts = new long[threadCount];
startTest.Set();
// Warmup
Thread.Sleep(100);
//while (true)
for (int j = 0; j < iterations; ++j)
{
for (int i = 0; i < threadCount; ++i)
afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
// Measure
sw.Restart();
Thread.Sleep(500);
sw.Stop();
for (int i = 0; i < threadCount; ++i)
operationCounts[i] = threadOperationCounts[(i + 1) * 16];
for (int i = 0; i < threadCount; ++i)
operationCounts[i] -= afterWarmupOperationCounts[i];
double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
Console.WriteLine("Score: {0:0.000000}", score);
}
}
internal static class Clock
{
private static readonly long s_swFrequency = Stopwatch.Frequency;
private static readonly double s_swFrequencyDouble = s_swFrequency;
public static long Ticks => Stopwatch.GetTimestamp();
public static double TicksToS(long ticks) => ticks / s_swFrequencyDouble;
public static double TicksToMs(long ticks) => ticks * 1000 / s_swFrequencyDouble;
public static double TicksToUs(long ticks) => ticks * (1000 * 1000) / s_swFrequencyDouble;
}
private static uint RandomShortDelay(Random rng) => (uint)rng.Next(4, 10);
private static uint RandomMediumDelay(Random rng) => (uint)rng.Next(10, 15);
private static uint RandomLongDelay(Random rng) => (uint)rng.Next(15, 20);
private static int[] s_delayValues = new int[32];
private static void Delay(uint n)
{
Interlocked.MemoryBarrier();
s_delayValues[16] += (int)Fib(n);
}
private static uint Fib(uint n)
{
if (n <= 1)
return n;
return Fib(n - 2) + Fib(n - 1);
}
}s |
|
Benchmark results on x64 (AMD EPYC 7763, 32core VM) Higher throughput score is better. === Baseline: === The PR: |
|
Benchmark results on ARM64 (Ampere Altra, 32core VM) Higher throughput score is better. === Baseline: The PR: |
| int currentThreadID = ManagedThreadId.Current; | ||
| if ((uint)currentThreadID <= (uint)SBLK_MASK_LOCK_THREADID) | ||
| { | ||
| if (Interlocked.CompareExchange(pHeader, currentThreadID, oldBits) == oldBits) | ||
| { | ||
| return HeaderLockResult.Success; | ||
| } | ||
| } |
There was a problem hiding this comment.
On CoreCLR the runtime sets this field to nonzero value before a thread can observe it.
The overall effect is 5%-40% improvement in throughput depending on platform and on thin/fat lock scenario.