Why are we worried about memory access semantics? Full barriers should be enough for anybody

Wait 5 sec.

Commenter Shawn wondered why we are so worried about memory access semantics. Back in my day, we just used full barriers everywhere, and I didn't hear nobody complainin'. Moore's law says that the number of transistors in an integrated circuit doubles about every two years. For a long time, these resources were employed to make CPUs faster: Accelerating clock speeds, expanding caches, deepening pipelines, performing more prediction and speculation, increasing the number of registers, that sort of thing. All of these improvements sped up single-threaded operations.The problem is that you can make things only so small before you start running into problems. It's hard to build things that are smaller than atoms. The speed of light constrains how quickly two components can communicate with each other. You have to look for other ways to get your performance gains. CPU speeds have largely leveled off at the 4 GHz range. You can't go faster that way any more. Instead of scaling up, we have to scale out. Rather than making a CPU core faster, we give each CPU more cores.By analogy, if you have to transport more people per hour by bus, and you can't make the bus go any faster, then you have to get more buses.¹More CPU cores means that programs need to use concurrency in order to extract maximum performance. That in turn lead to discoveries like lock convoys, the performance cost of fairness when under heavy load, and (the issue at hand) too many barriers.Continuing the bus analogy: If you have a fleet of a dozen buses, you need to make sure that nobody tries to use the same ticket to board two different buses. This means that the buses have to somehow communicate with each other to make sure each ticket is used only once. If you have a lot of buses, the ticket validation could end up being the slowest part of loading passengers!The x86 family of processors have fairly strong memory model semantics. You can dig into the Intel documents to see the precise wording, but it roughly comes down to all memory operations coming with acquire-release semantics by default. Most other processors, however, have a more relaxed memory model by default, allowing them to do more aggressive reordering to improve performance.In order to ensure any ordering beyond the default, you need to issue explicit fences. And those fences tend to be expensive. (After all, if they were cheap, then the architecture would just do them by default.) Your CPU that has grown to have caches the size of a small country? Yeah, those caches need to be flushed because the memory now has to be made visible to another processor, and they need to be invalidated because they may have been changed by another processor.Therefore, when writing code that may be used in high-performance scenarios, you want to avoid unnecessary stalls and flushes. And that means choosing the weakest barrier that still achieves the desired result.In other words, people didn't complain back then because it wasn't a problem back then. CPUs were not fast enough and programs not sufficiently multi-threaded that fences showed up in performance traces. But now they do.Bonus chatter: Commenter Ben noted that the libc++ implementation of shared_ptr is even more aggressive about avoiding barriers and skips the write-release barrier if the reference count is decrementing from 1 to 0, because decrementing to zero means that nobody else has access to the object, so it doesn't matter that they can't see what you're doing with the memory.Bonus bonus chatter: That article was primarily written as a note to myself for future reference.¹ You could also make the bug bigger so it carries more passengers. That's what SIMD tried to do: Let the CPU process more data at once. But it requires workloads that parallelize in a SIMD-friendly way.