Shared buffers as browser IPC
I don't know if the "browser is the operating system" meme is true, but for
sure, if you squint enough you can see similarities. If web pages are processes,
and pages that share the same
browsing contexts
are different instances of the same process, then Web Workers and worklets are
isolated threads / coroutines, and
postMessage
is a message-passing form of IPC that can connect different threads (by holding
a reference to them) or different processes (by using BroadcastChannel
). I'm
not sure how tight is this abstraction - probably not much - but the new
Lock
and
Atomics
and
SharedArrayBuffer
APIs are quite similar - although much less sophisticated - to
the locking mechanisms you'd expect to use in an operating system
to orchestrate different processes working on the same
shared memory.
postMessage
is probably enough for almost all the inter-thread or inter-tab
communication a web application needs to do. Still, it's an asynchronous API
that
deep-clones
or
irreversibly transfers
memory between one place or the other. It would be interesting to see if - at
the expense of simplicity - we could build a synchronous (blocking) interface to
operate on a single shared piece of memory. In addition, this interface should
be decentralized: all concurrent agents should be able to modify the underlying
shared buffer in any moment without any knowledge of the existence of other
agents, eliminating the need of leader election or presence
protocols.
This is not easy: as Javascript is a garbage collected language, we don't have
much to work on. A SharedArrayBuffer
is just a contiguous data buffer that
workers can read from or write on. There is no memory allocation API, apart from
the
grow()
and
slice()
methods that, respectively, expand or shrink the buffer dynamically. This means
that, if we want to back data structures with a SharedArrayBuffer
, we need to
start from scratch and write not only our own data structures but also our own
concurrent memory allocator, which reserves blocks of contiguous memory to
workers that request or release them. This allocator should be able to handle
concurrent requests: two workers entering the same part of the code at the same
time should always get two distinct memory blocks, no matter when their
execution is pre-empted.
One way to do that would be using the Lock API: locks are not-too-low-level
synchronization primitives and it's easy to create a readers-writers lock by
requesting the same lock in exclusive
and shared
mode. But locking the whole
allocator structure might be a bit too expensive if the allocator access is
often contended, and if we can reduce our problem to replacing an address in a
list, we might be able to write a
lockless algorithm for our allocator.
Fixed-size allocators
There are, broadly, two types of memory allocators: fixed-size block allocators (also called memory pools) and variable-sized ones. Fixed-size allocators allow allocation and disposal of equal blocks of a pre-determined size - in other words, the allocator decides the block size and how many blocks are available, and workers can only request one or more blocks. Fixed-size allocators are fast, simple and don't suffer from fragmentation. If you have a data structure with fixed-size records, you might not need anything else. Variable-sized allocators, on the contrary, come in all shapes and sizes, have more complex logic and use dynamic data structures (for example, linked lists) that can be themselves backed by fixed-sized allocators. In this article, for simplicity, we'll discuss the implementation of a shared fixed-size allocator.
Initialization
class LocklessFixedAllocator {
private buffer: SharedArrayBuffer;
private uint32View: Uint32Array;
constructor(
memoryBlocks: number,
blockSize: number,
sab?: SharedArrayBuffer,
) {
const totalBlocks = memoryBlocks + 1;
this.buffer = sab || new SharedArrayBuffer(totalBlocks * blockSize);
this.uint32View = new Uint32Array(this.buffer);
if (!sab) {
for (let i = 0; i < totalBlocks - 1; i++) {
this.uint32View[(i * blockSize) >> 2] = (i + 1) * blockSize;
}
this.uint32View[((totalBlocks - 1) * blockSize) >> 2] = 0;
}
}
getBuffer(): SharedArrayBuffer {
return this.buffer;
}
// ...
}
The initialization process is straightforward: the user specifies the quantity
and size of blocks for the allocator to manage. If an optional
SharedArrayBuffer
argument isn't given, the allocator creates and sets it up;
if it is provided, the allocator essentially does nothing. The principle here is
that the main context initializes a new allocator, gets the shared buffer via
the getBuffer
method and finally distributes this buffer to all the workers
that it spawns.
Each worker then establishes their LocklessFixedAllocator
instance, applying
the same memoryBlocks
and blockSize
parameters, and specifying the received
SharedArrayBuffer
as the final parameter. As a result, all allocator instances
refer to the same buffer and utilize the same shared data structure for
coordinated allocations.
Here's an illustration of what a typical main context might do:
import LocklessFixedAllocator from "./LocklessFixedAllocator";
import AllocatorWorker from "./worker?worker"; // Using Vite worker syntax here; you might want to create workers differently
import { totalBlocks, blockSize } from "./constants";
const allocator = new LocklessFixedAllocator(totalBlocks, blockSize);
const newWorkers = Array(numWorkers)
.fill(0)
.map(() => new AllocatorWorker());
workers.forEach((worker, index) => {
worker.postMessage({
type: "start",
sab: allocator.getBuffer(),
workerId: index,
});
});
And what a worker would do:
import LocklessFixedAllocator from "./LocklessFixedAllocator";
import { totalBlocks, blockSize } from "./constants";
let allocator: LocklessFixedAllocator;
self.onmessage = function (e: MessageEvent) {
switch (data.type) {
case "start":
allocator = new LocklessFixedAllocator(totalBlocks, blockSize, data.sab);
break;
// ...
}
};
Algorithm
The interesting part of the constructor is the part that actually initializes the blocks:
for (let i = 0; i < totalBlocks - 1; i++) {
this.uint32View[(i * blockSize) >> 2] = (i + 1) * blockSize;
}
this.uint32View[((totalBlocks - 1) * blockSize) >> 2] = 0;
This essentially assigns an address to all blocks, progressively. For instance,
if our blockSize is 64 bits, block 0 would hold 64 (the memory address of block
1), block 1 would hold 128 (the memory address of block 2), and so on. Each free
block retains the address of the subsequent block, with the exception of the
final block, which is explicitly set to 0 to signify the absence of a "next
block". The >> 2
shift operations essentially divide by 4 to get the correct
32-bit (or 4 byte) offsets, since we use a Uint32Array
view.
In our implementation, we allocate one additional block than requested, reserving this block (by convention, Block 0), that can't be allocated, to hold the index of the next free block. We need to do that because every variable that describes the state of the allocator has to be read and written by any worker, and, for this reason, it must be stored in the shared buffer. Here's a breakdown of the algorithm:
-
Initialization: Each block is set to point to the next block, meaning that:
- Block 0, which is unique in that it is reserved always contains the index of the next free block, points to the address of the first allocable block (block 1).
- Every free block points to the address of the succeeding free block.
- The last block points to 0, indicating that no further free blocks are available.
-
The initialization step is executed only once and has a
O(n)
cost, wheren
is the number of blocks.
-
Allocation: When a block is requested:
- Retrieve the address of the first free block from block 0.
- The retrieved block contains the address of the next free block: save this in block 0.
- Return the address of the allocated block to the requester for use.
-
The allocation step has a constant
O(1)
cost, but might be repeated if it clashes with another concurrent operation (see description in the "Allocation" section).
-
Deallocation: When a block is freed:
- Obtain the address of the block to be freed from the user.
- Overwrite the to-be-freed block with the address of the next free block (found in block 0).
- Overwrite block 0 with the address of the freed block, making it the next block available for allocation.
-
The deallocation step has a constant
O(1)
cost, but might be repeated if it clashes with another concurrent operation (see description in the "Deallocation" section).
To make a simple example, when we start with an empty allocator, Block 0 will point to the first allocable block, Block 1. If one worker tries to allocate a block, it will get the address of Block 1 and Block 0 will now point to the address contained in Block 1, which is Block 2. If, at some point, the worker decides to deallocate Block 1, Block 0 will be overwritten with the address of the block that is deallocated (Block 1) and Block 1 wil be overwritten with the address contained in Block 0 (Block 2), and so on.
Allocation
alloc(): number | null {
let firstFreeBlockIndex: number,
nextFreeBlockIndex: number,
prevHead: number;
do {
firstFreeBlockIndex = Atomics.load(this.uint32View, 0);
if (firstFreeBlockIndex === 0) {
return null;
}
nextFreeBlockIndex = this.uint32View[firstFreeBlockIndex >> 2];
prevHead = Atomics.compareExchange(
this.uint32View,
0,
firstFreeBlockIndex,
nextFreeBlockIndex
);
} while (prevHead !== firstFreeBlockIndex);
return firstFreeBlockIndex;
}
This code pretty much does what described in the previous section: it extracts
the next free block from Block 0, read the subsequent free block from it, and
then returns the former while overwriting Block 0 with the latter. The
difference here is the use of
Atomics.load
to read from Block 0 and
Atomics.compareExchange
to write on it, all inside a while
loop. Why?
Every concurrent algorithm relies on the ability to update a memory cell atomically; if this is not the case, the execution could be preempted, allowing another context to alter the cell before the original context resumes. Upon resumption, the original context may attempt to change the memory cell based on an outdated state. Locks circumvent this issue by permitting only a single instance of code to enter a mutually exclusive section at once. When code operates within a mutex, it has the guaranteed of exclusivity and assurance that no other instance can preempt it.
However, lockless algorithms don't have such guarantees; instead, they work more like a transaction: they attempt a write using a special primitive that can determine if the value they aim to modify has changed in the meanwhile. If it has indeed changed, they simply restart the entire process until it's successful. This pattern is called compare-and-swap or compare-and-exchange.
This is what the line featuring Atomics.compareExchange
does: it attempts to
update Block 0 (consistently located at address 0) of the shared array with
nextFreeBlockIndex
, which holds the address of the subsequent block within the
block at firstFreeBlockIndex
. This update only occurs on the condition that
Block 0, the target of the write, remains unaltered during this process. If it
has been modified, the code must re-fetch all values and attempt the update
again, persisting until successful.
The Atomics.compareExchange
method accepts, in sequence, a shared array, a
location within it, the anticipated value (what the code believes is stored at
that location), and the new value (the value the code intends to replace the
original with). Atomics.compareExchange
only updates the array if the expected
value matches the actual value at the time of its atomic invocation, and it
always returns the actual value at that location.
To verify if the update was successful, one can compare the returned value with the expected one. If they match, the location in the shared array has been updated to the new value. Otherwise, if the shared array remains unaltered, the code needs to restart the entire operation. This approach guarantees that the write operation only occurs if that specific location hasn't been changed by another concurrent operation in the interim (or it has been changed to the same value). This level of granularity is much more efficient than locking the entire data structure just to update a small segment of it.
The last question is why are we reading Block 0 with the Atomics.load
method.
When a shared array is potentially updated in another threads, a non-atomic read
could result in corrupted (torn) or non-up-to-date values. Atomics.load
guarantees that the read operation on a shared array is always executed
atomically, i.e. no write can occur at the same time.
Deallocation
free(index: number): void {
let firstFreeBlockIndex, prevHead;
do {
firstFreeBlockIndex = Atomics.load(this.uint32View, 0);
Atomics.store(this.uint32View, index >> 2, firstFreeBlockIndex);
prevHead = Atomics.compareExchange(
this.uint32View,
0,
firstFreeBlockIndex,
index
);
} while (prevHead !== firstFreeBlockIndex);
}
The free
method performs the inverse of alloc
. While alloc
doesn't require
any argument and returns the index of a freshly allocated block, free
takes in
the index of a previously allocated block, frees it, and doesn't return
anything. Mirroring alloc
, free
attempts to write the index of the current
next free block (found in Block 0) within the block being released, and then
write the index of the released block in Block 0.
This is achieved by initially using Atomics.store
— the write counterpart to
Atomics.load
— to tentatively store the next free index in the block being
freed. Then, Atomics.compareExchange
is used to write the index of the block
being freed into Block 0, provided that Block 0 hasn't been altered in the
meanwhile. If a change in Block 0 is detected, the entire procedure is
restarted, taking into account the new value in Block 0.
Browser support
The journey of SharedArrayBuffers
has been rather fragmented. Originally, some
browsers offered support for this feature as early as 2017. However, due to the
Spectre and Meltdown vulnerabilities, this feature was disabled across all
browsers in January 2018. It was only in 2020 that browsers reinstated support
for SharedArrayBuffer
s, with stricter security requirements. Now, pages
utilizing SharedArrayBuffer
are required to be served with
Cross-Origin-Opener-Policy: same-origin
and
Cross-Origin-Embedder-Policy: require-corp
headers. If these headers are not
present, attempting to use the SharedArrayBuffer
constructor results in an
exception.
Also, at the time of writing this article, all evergreen browsers offer support
for SharedArrayBuffers. However, Chromium-based browsers have a limitation -
they don't support transmitting SharedArrayBuffer
s via BroadcastChannel
.
This restricts the use of shared buffers between web workers, effectively
meaning that two tabs from the same browsing context can't access the same
shared memory buffer in Chromium-based browsers.
Demo
I made a simple interactive visualization of how the algorithm works. Please note that while the demo is static, it requires the COOP and COEP headers to be set, so it is hosted on https://glitch.com. Since I'm on the free plan, the demo may occasionally get too many requests and respond with HTTP 429. You can see (and remix) the demo here.
There is also a "Test" button below the visualization that will spawn 50 web workers trying to randomly and concurrently allocate and deallocate the 20 blocks available in the shared buffer. If you let it run for a while then press "stop", the developer console will log out which blocks every thread owns. It's not easy to test the correctness of concurrent code deterministically, but the simulation should present a good empirical proof of how blocks are never doubly-allocated by the algorithm.