Shared buffers as browser IPC

I don't know if the "browser is the operating system" meme is true, but for sure, if you squint enough you can see similarities. If web pages are processes, and pages that share the same browsing contexts are different instances of the same process, then Web Workers and worklets are isolated threads / coroutines, and postMessage is a message-passing form of IPC that can connect different threads (by holding a reference to them) or different processes (by using BroadcastChannel). I'm not sure how tight is this abstraction - probably not much - but the new Lock and Atomics and SharedArrayBuffer APIs are quite similar - although much less sophisticated - to the locking mechanisms you'd expect to use in an operating system to orchestrate different processes working on the same shared memory.

postMessage is probably enough for almost all the inter-thread or inter-tab communication a web application needs to do. Still, it's an asynchronous API that deep-clones or irreversibly transfers memory between one place or the other. It would be interesting to see if - at the expense of simplicity - we could build a synchronous (blocking) interface to operate on a single shared piece of memory. In addition, this interface should be decentralized: all concurrent agents should be able to modify the underlying shared buffer in any moment without any knowledge of the existence of other agents, eliminating the need of leader election or presence protocols.

This is not easy: as Javascript is a garbage collected language, we don't have much to work on. A SharedArrayBuffer is just a contiguous data buffer that workers can read from or write on. There is no memory allocation API, apart from the grow() and slice() methods that, respectively, expand or shrink the buffer dynamically. This means that, if we want to back data structures with a SharedArrayBuffer, we need to start from scratch and write not only our own data structures but also our own concurrent memory allocator, which reserves blocks of contiguous memory to workers that request or release them. This allocator should be able to handle concurrent requests: two workers entering the same part of the code at the same time should always get two distinct memory blocks, no matter when their execution is pre-empted.

One way to do that would be using the Lock API: locks are not-too-low-level synchronization primitives and it's easy to create a readers-writers lock by requesting the same lock in exclusive and shared mode. But locking the whole allocator structure might be a bit too expensive if the allocator access is often contended, and if we can reduce our problem to replacing an address in a list, we might be able to write a lockless algorithm for our allocator.

Fixed-size allocators

There are, broadly, two types of memory allocators: fixed-size block allocators (also called memory pools) and variable-sized ones. Fixed-size allocators allow allocation and disposal of equal blocks of a pre-determined size - in other words, the allocator decides the block size and how many blocks are available, and workers can only request one or more blocks. Fixed-size allocators are fast, simple and don't suffer from fragmentation. If you have a data structure with fixed-size records, you might not need anything else. Variable-sized allocators, on the contrary, come in all shapes and sizes, have more complex logic and use dynamic data structures (for example, linked lists) that can be themselves backed by fixed-sized allocators. In this article, for simplicity, we'll discuss the implementation of a shared fixed-size allocator.

Initialization

class LocklessFixedAllocator {
  private buffer: SharedArrayBuffer;
  private uint32View: Uint32Array;

  constructor(
    memoryBlocks: number,
    blockSize: number,
    sab?: SharedArrayBuffer
  ) {
    const totalBlocks = memoryBlocks + 1;
    this.buffer = sab || new SharedArrayBuffer(totalBlocks * blockSize);
    this.uint32View = new Uint32Array(this.buffer);

    if (!sab) {
      for (let i = 0; i < totalBlocks - 1; i++) {
        this.uint32View[(i * blockSize) >> 2] = (i + 1) * blockSize;
      }

      this.uint32View[((totalBlocks - 1) * blockSize) >> 2] = 0;
    }
  }

  getBuffer(): SharedArrayBuffer {
    return this.buffer;
  }

  // ...
}

The initialization process is straightforward: the user specifies the quantity and size of blocks for the allocator to manage. If an optional SharedArrayBuffer argument isn't given, the allocator creates and sets it up; if it is provided, the allocator essentially does nothing. The principle here is that the main context initializes a new allocator, gets the shared buffer via the getBuffer method and finally distributes this buffer to all the workers that it spawns.

Each worker then establishes their LocklessFixedAllocator instance, applying the same memoryBlocks and blockSize parameters, and specifying the received SharedArrayBuffer as the final parameter. As a result, all allocator instances refer to the same buffer and utilize the same shared data structure for coordinated allocations.

Here's an illustration of what a typical main context might do:

import LocklessFixedAllocator from "./LocklessFixedAllocator";
import AllocatorWorker from "./worker?worker"; // Using Vite worker syntax here; you might want to create workers differently
import { totalBlocks, blockSize } from "./constants";

const allocator = new LocklessFixedAllocator(totalBlocks, blockSize);

const newWorkers = Array(numWorkers)
  .fill(0)
  .map(() => new AllocatorWorker());

workers.forEach((worker, index) => {
  worker.postMessage({
    type: "start",
    sab: allocator.getBuffer(),
    workerId: index,
  });
});

And what a worker would do:

import LocklessFixedAllocator from "./LocklessFixedAllocator";
import { totalBlocks, blockSize } from "./constants";

let allocator: LocklessFixedAllocator;

self.onmessage = function (e: MessageEvent) {
  switch (data.type) {
    case "start":
      allocator = new LocklessFixedAllocator(totalBlocks, blockSize, data.sab);
      break;
    // ...
  }
};

Algorithm

The interesting part of the constructor is the part that actually initializes the blocks:

for (let i = 0; i < totalBlocks - 1; i++) {
  this.uint32View[(i * blockSize) >> 2] = (i + 1) * blockSize;
}

this.uint32View[((totalBlocks - 1) * blockSize) >> 2] = 0;

This essentially assigns an address to all blocks, progressively. For instance, if our blockSize is 64 bits, block 0 would hold 64 (the memory address of block 1), block 1 would hold 128 (the memory address of block 2), and so on. Each free block retains the address of the subsequent block, with the exception of the final block, which is explicitly set to 0 to signify the absence of a "next block". The >> 2 shift operations essentially divide by 4 to get the correct 32-bit (or 4 byte) offsets, since we use a Uint32Array view.

In our implementation, we allocate one additional block than requested, reserving this block (by convention, Block 0), that can't be allocated, to hold the index of the next free block. We need to do that because every variable that describes the state of the allocator has to be read and written by any worker, and, for this reason, it must be stored in the shared buffer. Here's a breakdown of the algorithm:

Initialization: Each block is set to point to the next block, meaning that:
- Block 0, which is unique in that it is reserved always contains the index of the next free block, points to the address of the first allocable block (block 1).
- Every free block points to the address of the succeeding free block.
- The last block points to 0, indicating that no further free blocks are available.
- The initialization step is executed only once and has a O(n) cost, where n is the number of blocks.
Allocation: When a block is requested:
- Retrieve the address of the first free block from block 0.
- The retrieved block contains the address of the next free block: save this in block 0.
- Return the address of the allocated block to the requester for use.
- The allocation step has a constant O(1) cost, but might be repeated if it clashes with another concurrent operation (see description in the "Allocation" section).
Deallocation: When a block is freed:
- Obtain the address of the block to be freed from the user.
- Overwrite the to-be-freed block with the address of the next free block (found in block 0).
- Overwrite block 0 with the address of the freed block, making it the next block available for allocation.
- The deallocation step has a constant O(1) cost, but might be repeated if it clashes with another concurrent operation (see description in the "Deallocation" section).

To make a simple example, when we start with an empty allocator, Block 0 will point to the first allocable block, Block 1. If one worker tries to allocate a block, it will get the address of Block 1 and Block 0 will now point to the address contained in Block 1, which is Block 2. If, at some point, the worker decides to deallocate Block 1, Block 0 will be overwritten with the address of the block that is deallocated (Block 1) and Block 1 wil be overwritten with the address contained in Block 0 (Block 2), and so on.

Allocation

alloc(): number | null {
  let firstFreeBlockIndex: number,
    nextFreeBlockIndex: number,
    prevHead: number;
  do {
    firstFreeBlockIndex = Atomics.load(this.uint32View, 0);
    if (firstFreeBlockIndex === 0) {
      return null;
    }
    nextFreeBlockIndex = this.uint32View[firstFreeBlockIndex >> 2];
    prevHead = Atomics.compareExchange(
      this.uint32View,
      0,
      firstFreeBlockIndex,
      nextFreeBlockIndex
    );
  } while (prevHead !== firstFreeBlockIndex);

  return firstFreeBlockIndex;
}

This code pretty much does what described in the previous section: it extracts the next free block from Block 0, read the subsequent free block from it, and then returns the former while overwriting Block 0 with the latter. The difference here is the use of Atomics.load to read from Block 0 and Atomics.compareExchange to write on it, all inside a while loop. Why?

Every concurrent algorithm relies on the ability to update a memory cell atomically; if this is not the case, the execution could be preempted, allowing another context to alter the cell before the original context resumes. Upon resumption, the original context may attempt to change the memory cell based on an outdated state. Locks circumvent this issue by permitting only a single instance of code to enter a mutually exclusive section at once. When code operates within a mutex, it has the guaranteed of exclusivity and assurance that no other instance can preempt it.

However, lockless algorithms don't have such guarantees; instead, they work more like a transaction: they attempt a write using a special primitive that can determine if the value they aim to modify has changed in the meanwhile. If it has indeed changed, they simply restart the entire process until it's successful. This pattern is called compare-and-swap or compare-and-exchange.

This is what the line featuring Atomics.compareExchange does: it attempts to update Block 0 (consistently located at address 0) of the shared array with nextFreeBlockIndex, which holds the address of the subsequent block within the block at firstFreeBlockIndex. This update only occurs on the condition that Block 0, the target of the write, remains unaltered during this process. If it has been modified, the code must re-fetch all values and attempt the update again, persisting until successful.

The Atomics.compareExchange method accepts, in sequence, a shared array, a location within it, the anticipated value (what the code believes is stored at that location), and the new value (the value the code intends to replace the original with). Atomics.compareExchange only updates the array if the expected value matches the actual value at the time of its atomic invocation, and it always returns the actual value at that location.

To verify if the update was successful, one can compare the returned value with the expected one. If they match, the location in the shared array has been updated to the new value. Otherwise, if the shared array remains unaltered, the code needs to restart the entire operation. This approach guarantees that the write operation only occurs if that specific location hasn't been changed by another concurrent operation in the interim (or it has been changed to the same value). This level of granularity is much more efficient than locking the entire data structure just to update a small segment of it.

The last question is why are we reading Block 0 with the Atomics.load method. When a shared array is potentially updated in another threads, a non-atomic read could result in corrupted (torn) or non-up-to-date values. Atomics.load guarantees that the read operation on a shared array is always executed atomically, i.e. no write can occur at the same time.

Deallocation

free(index: number): void {
  let firstFreeBlockIndex, prevHead;
  do {
    firstFreeBlockIndex = Atomics.load(this.uint32View, 0);
    Atomics.store(this.uint32View, index >> 2, firstFreeBlockIndex);
    prevHead = Atomics.compareExchange(
      this.uint32View,
      0,
      firstFreeBlockIndex,
      index
    );
  } while (prevHead !== firstFreeBlockIndex);
}

The free method performs the inverse of alloc. While alloc doesn't require any argument and returns the index of a freshly allocated block, free takes in the index of a previously allocated block, frees it, and doesn't return anything. Mirroring alloc, free attempts to write the index of the current next free block (found in Block 0) within the block being released, and then write the index of the released block in Block 0.

This is achieved by initially using Atomics.store — the write counterpart to Atomics.load — to tentatively store the next free index in the block being freed. Then, Atomics.compareExchange is used to write the index of the block being freed into Block 0, provided that Block 0 hasn't been altered in the meanwhile. If a change in Block 0 is detected, the entire procedure is restarted, taking into account the new value in Block 0.

Browser support

The journey of SharedArrayBuffers has been rather fragmented. Originally, some browsers offered support for this feature as early as 2017. However, due to the Spectre and Meltdown vulnerabilities, this feature was disabled across all browsers in January 2018. It was only in 2020 that browsers reinstated support for SharedArrayBuffers, with stricter security requirements. Now, pages utilizing SharedArrayBuffer are required to be served with Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers. If these headers are not present, attempting to use the SharedArrayBuffer constructor results in an exception.

Also, at the time of writing this article, all evergreen browsers offer support for SharedArrayBuffers. However, Chromium-based browsers have a limitation - they don't support transmitting SharedArrayBuffers via BroadcastChannel. This restricts the use of shared buffers between web workers, effectively meaning that two tabs from the same browsing context can't access the same shared memory buffer in Chromium-based browsers.

Demo

Lockless fixed allocator Demo screenshot

I made a simple interactive visualization of how the algorithm works. Please note that while the demo is static, it requires the COOP and COEP headers to be set, so it is hosted on https://glitch.com. Since I'm on the free plan, the demo may occasionally get too many requests and respond with HTTP 429. You can see (and remix) the demo here.

There is also a "Test" button below the visualization that will spawn 50 web workers trying to randomly and concurrently allocate and deallocate the 20 blocks available in the shared buffer. If you let it run for a while then press "stop", the developer console will log out which blocks every thread owns. It's not easy to test the correctness of concurrent code deterministically, but the simulation should present a good empirical proof of how blocks are never doubly-allocated by the algorithm.