Page MenuHomePhabricator

[AtomicExpand] Set branch weights when new basic blocks are created
Needs RevisionPublic

Authored by ahatanak on Feb 20 2015, 5:49 PM.

Details

Reviewers
reames
Summary

This patch fixes AtomicExpand pass to set branch weights when it creates new basic blocks. The weights are set in a way that optimizes for the case where there are few contentions.

Diff Detail

Event Timeline

ahatanak updated this revision to Diff 20448.Feb 20 2015, 5:49 PM
ahatanak retitled this revision from to [AtomicExpand] Set branch weights when new basic blocks are created.
ahatanak updated this object.
ahatanak edited the test plan for this revision. (Show Details)
ahatanak added a subscriber: Unknown Object (MLST).
reames added a subscriber: reames.Feb 25 2015, 9:42 AM

First, please update with full context. I need to see the surrounding code to interpret a few of your changes.

I'm a bit concerned about the general approach. Without profiling data, unconditionally predicting no contention seems questionable. Can you quantify the performance impact here for both the contented and uncontended case?

lib/CodeGen/AtomicExpandPass.cpp
388

Please extract a helper function for this shared code

ahatanak updated this revision to Diff 20728.Feb 25 2015, 7:41 PM

Defined a new function which returns weights for both destination BBs.

The rationale for optimizing for the no contention case is that, if there are a lot of contentions, the program won't run fast anyway, however, I don't have data to support my decision.

Do you have any suggestions for benchmark programs that I can run to evaluate the changes I made in my patch?

reames added a comment.Mar 3 2015, 9:37 AM

For the performance question, even a simple micro benchmark would be sufficient. What is the difference in a contented vs uncontented access to a single word value with and without your change? Is the difference material enough to worry about getting it wrong when the access is contented?

include/llvm/Analysis/BranchProbabilityInfo.h
118

I know you're copying the StackProtector code here, but I'm not sure I see the need for this function to be exposed on BPI. I'd just make it a static helper in the transform file.

Sorry for not replying soon.

I wrote several mirco-benchmarks and compiled and ran them to evaluate the impact of my patch on performance. For most of the benchmarks, my patch made no difference in the generated code or made no measurable difference in performance, but there was one benchmark that showed degradation in performance when a large number of threads (16 threads) were contending for access to a atomic variable.

Given that I couldn't demonstrate optimizing for the low-contention case generates code that is at least as fast as the code currently generated, I think I should retract this patch for now and resubmit it later when I figure out a way to set the weights that makes the code run faster.

reames requested changes to this revision.May 19 2015, 9:23 AM
reames added a reviewer: reames.

Seems like a reasonable plan.

I'm going to list a couple of ideas for alternate approaches below, but it's been long enough I've completely lost context on this patch. Feel free to ignore if my suggestions don't seem worthwhile.

  1. We could add a piece of metadata to record how contented a given RMW operation is. This could be as simple as "uncontended" or a more generic "continition_count(x, of y)". With that in place, you change could be applied only for uncontended accesses. (I'd also be open to switching the default to uncontended and then having the metadata for the contended case. Your reported data made this sound plausible; you just need to make the argument on llvm-dev.)
  2. You could investigate why the branch weights cause a slow down in the contended cases. Looking at the loop structure, I find it slightly odd that it has any impact at all since it's not likely to effect code placement. There might be an independent tweak that could be made here.

Also, can you explain *why* you expected the branch weights to help in the first place? (i.e. what part of the optimizer/micro-architecture were you trying to exploit?) Maybe that will spark an idea for another approach.

This revision now requires changes to proceed.May 19 2015, 9:23 AM

Seems like a reasonable plan.

I'm going to list a couple of ideas for alternate approaches below, but it's been long enough I've completely lost context on this patch. Feel free to ignore if my suggestions don't seem worthwhile.

  1. We could add a piece of metadata to record how contented a given RMW operation is. This could be as simple as "uncontended" or a more generic "continition_count(x, of y)". With that in place, you change could be applied only for uncontended accesses. (I'd also be open to switching the default to uncontended and then having the metadata for the contended case. Your reported data made this sound plausible; you just need to make the argument on llvm-dev.)
  2. You could investigate why the branch weights cause a slow down in the contended cases. Looking at the loop structure, I find it slightly odd that it has any impact at all since it's not likely to effect code placement. There might be an independent tweak that could be made here.

Judging from the code clang generates, machine block placement is the pass that is making the difference. The basic blocks are laid out in a way that appears to hurt performance in the contended cases (more backward and forward branches to non-consecutive blocks).

This is part of the program that ran slower because of my patch. You can probably see the difference if you compile it for arm64 or aarch64.

struct Node {
  Node(int ii, Node *nn) : i(ii), next(nn) {}
  Node() : i(0), next(nullptr) {}
  unsigned i;
  Node *next;
};

struct List {
  atomic<Node*> head;
};

List list1;

std::atomic<unsigned> flag;
unsigned sum = 0;

void addNodes(unsigned b, unsigned e) {
  while (b != e) {
    Node *n = new Node(b++, nullptr);
    Node *expected = list1.head;
     
    do {
      n->next = expected;
    } while (!atomic_compare_exchange_weak(&list1.head, &expected, n));
  }
}

Also, can you explain *why* you expected the branch weights to help in the first place? (i.e. what part of the optimizer/micro-architecture were you trying to exploit?) Maybe that will spark an idea for another approach.

I was simply assuming real workloads would typically have little contention, and if I set a low weight for the "failure" branch and a high weight for the "success" branch, the optimizing passes would generate code that would run faster in the uncontended case. I wasn't trying to exploit any optimization passes in particular, I was just expecting it should make a difference in the generated code.