This patch fixes AtomicExpand pass to set branch weights when it creates new basic blocks. The weights are set in a way that optimizes for the case where there are few contentions.
Details
Diff Detail
Event Timeline
First, please update with full context. I need to see the surrounding code to interpret a few of your changes.
I'm a bit concerned about the general approach. Without profiling data, unconditionally predicting no contention seems questionable. Can you quantify the performance impact here for both the contented and uncontended case?
| lib/CodeGen/AtomicExpandPass.cpp | ||
|---|---|---|
| 388 | Please extract a helper function for this shared code | |
Defined a new function which returns weights for both destination BBs.
The rationale for optimizing for the no contention case is that, if there are a lot of contentions, the program won't run fast anyway, however, I don't have data to support my decision.
Do you have any suggestions for benchmark programs that I can run to evaluate the changes I made in my patch?
For the performance question, even a simple micro benchmark would be sufficient. What is the difference in a contented vs uncontented access to a single word value with and without your change? Is the difference material enough to worry about getting it wrong when the access is contented?
| include/llvm/Analysis/BranchProbabilityInfo.h | ||
|---|---|---|
| 118 | I know you're copying the StackProtector code here, but I'm not sure I see the need for this function to be exposed on BPI. I'd just make it a static helper in the transform file. | |
Sorry for not replying soon.
I wrote several mirco-benchmarks and compiled and ran them to evaluate the impact of my patch on performance. For most of the benchmarks, my patch made no difference in the generated code or made no measurable difference in performance, but there was one benchmark that showed degradation in performance when a large number of threads (16 threads) were contending for access to a atomic variable.
Given that I couldn't demonstrate optimizing for the low-contention case generates code that is at least as fast as the code currently generated, I think I should retract this patch for now and resubmit it later when I figure out a way to set the weights that makes the code run faster.
Seems like a reasonable plan.
I'm going to list a couple of ideas for alternate approaches below, but it's been long enough I've completely lost context on this patch. Feel free to ignore if my suggestions don't seem worthwhile.
- We could add a piece of metadata to record how contented a given RMW operation is. This could be as simple as "uncontended" or a more generic "continition_count(x, of y)". With that in place, you change could be applied only for uncontended accesses. (I'd also be open to switching the default to uncontended and then having the metadata for the contended case. Your reported data made this sound plausible; you just need to make the argument on llvm-dev.)
- You could investigate why the branch weights cause a slow down in the contended cases. Looking at the loop structure, I find it slightly odd that it has any impact at all since it's not likely to effect code placement. There might be an independent tweak that could be made here.
Also, can you explain *why* you expected the branch weights to help in the first place? (i.e. what part of the optimizer/micro-architecture were you trying to exploit?) Maybe that will spark an idea for another approach.
Judging from the code clang generates, machine block placement is the pass that is making the difference. The basic blocks are laid out in a way that appears to hurt performance in the contended cases (more backward and forward branches to non-consecutive blocks).
This is part of the program that ran slower because of my patch. You can probably see the difference if you compile it for arm64 or aarch64.
struct Node {
  Node(int ii, Node *nn) : i(ii), next(nn) {}
  Node() : i(0), next(nullptr) {}
  unsigned i;
  Node *next;
};
struct List {
  atomic<Node*> head;
};
List list1;
std::atomic<unsigned> flag;
unsigned sum = 0;
void addNodes(unsigned b, unsigned e) {
  while (b != e) {
    Node *n = new Node(b++, nullptr);
    Node *expected = list1.head;
     
    do {
      n->next = expected;
    } while (!atomic_compare_exchange_weak(&list1.head, &expected, n));
  }
}Also, can you explain *why* you expected the branch weights to help in the first place? (i.e. what part of the optimizer/micro-architecture were you trying to exploit?) Maybe that will spark an idea for another approach.
I was simply assuming real workloads would typically have little contention, and if I set a low weight for the "failure" branch and a high weight for the "success" branch, the optimizing passes would generate code that would run faster in the uncontended case. I wasn't trying to exploit any optimization passes in particular, I was just expecting it should make a difference in the generated code.
I know you're copying the StackProtector code here, but I'm not sure I see the need for this function to be exposed on BPI. I'd just make it a static helper in the transform file.