Page MenuHomePhabricator

[AArch64] Expand bcmp() for small comparisons

Authored by evandro on Jul 16 2019, 9:20 AM.



Patch D56593 by @courbet causes bcmp() to be emitted in some cases. Unfortunately, TTI::MemCmpExpansionOptions() is not overridden by AArch64, which results in calls to bcmp() even for small constant values. In a proprietary benchmark we see a performance drop of about 12% on PNG compression before this patch, though it passes all tests.

This patch mirrors how X86 initializes TTI::MemCmpExpansionOptions() to then expand calls to bcmp() when appropriate. No tuning of the parameters was performed, but, at this point, it's enough to recover the performance drop above.

This problem also exists on ARM. Once a consensus is reached for AArch64, we can work to fix ARM as well.

Authors: Evandro Menezes (@evandro) <>, Brian Rzycki (@brzycki) <>

Diff Detail


Event Timeline

brzycki created this revision.Jul 16 2019, 9:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2019, 9:20 AM
SjoerdMeijer added inline comments.Jul 17 2019, 1:04 AM
625 ↗(On Diff #210116)

why 2?

626 ↗(On Diff #210116)

How about vector loads and alignment? Can we generate them for this, and do we then also need 16 byte loads to loadsizes?

2 ↗(On Diff #210116)

I think tests are missing for:

  • code-size
  • edge cases for NumLoadsPerBlock and MaxNumLoads

Hello @SjoerdMeijer , thank you for the review.

625 ↗(On Diff #210116)

@evandro selected these based on the x8_64 values set in llvm/lib/Target/X86/X86TargetTransformInfo.cpp:

Options.MaxNumLoads = TLI->getMaxExpandSizeMemcmp(OptSize);
Options.NumLoadsPerBlock = 2;
// ...
if (ST->is64Bit()) {

I do not know why x86_64 selected these values.

626 ↗(On Diff #210116)

Interesting question, I'll defer to @evandro when he can comment on the ticket. There is a bit more logic in x86_64 he stripped out when building this patch:

if (IsZeroCmp) {
  // Only enable vector loads for equality comparison. Right now the vector
  // version is not as fast for three way compare (see #33329).
  // TODO: enable AVX512 when the DAG is ready.
  // if (ST->hasAVX512()) Options.LoadSizes.push_back(64);
  const unsigned PreferredWidth = ST->getPreferVectorWidth();
  if (PreferredWidth >= 256 && ST->hasAVX2()) Options.LoadSizes.push_back(32);
  if (PreferredWidth >= 128 && ST->hasSSE2()) Options.LoadSizes.push_back(16);
  // All GPR and vector loads can be unaligned. SIMD compare requires integer
  // vectors (SSE2/AVX2).
  Options.AllowOverlappingLoads = true;
2 ↗(On Diff #210116)

Understood. I'll try to build cases to accommodate these checks.

FYI there are some benchmarks in that are not submitted but could be used as basis for making decisions for this patch.

625 ↗(On Diff #210116)

This does trades loads for jumps: with this setting, each block will be twice as big (more work per block) but there will be half as many blocks.

This was found to be more efficient in general on x86, and might not be true for arm. This will have to be benchmarked.

626 ↗(On Diff #210116)

For x86, vector loads are enabled only for equality comparison because there is no efficient vector-sized bswap.

I don't know much about arm, but:

  • if vector loads and equality comparisons are efficient there is no reason not to use them if IsZeroCmp
  • if there is an efficient vector-sized bswap vector sizes can be used in both cases
evandro added inline comments.Jul 23 2019, 6:43 PM
625 ↗(On Diff #210116)

As @courbet said, the best figure may be different, but this seemed like a sensible starting point. Since A64 has more registers, methinks that a higher value might be beneficial, but it depends on how many live registers there are.

626 ↗(On Diff #210116)

Vector loads are efficient, performance wise, on A64, but typically not so much power wise. That's why I'm not adding them just yet. If so, then vector byte swaps may be used.

evandro commandeered this revision.Jul 24 2019, 5:06 PM
evandro edited reviewers, added: brzycki; removed: evandro.
evandro updated this revision to Diff 211642.Jul 24 2019, 5:09 PM
evandro retitled this revision from [AArch64] Don't call bcmp for small byte comparisons to [AArch64] Expand bcmp() for small comparisons.
evandro edited the summary of this revision. (Show Details)
evandro marked 10 inline comments as done.Jul 24 2019, 5:12 PM

Still think regressions tests are missing, e.g. for code-size (see a previous comments).

SjoerdMeijer added inline comments.Aug 5 2019, 8:38 AM
29 ↗(On Diff #211642)

ah, sorry, that test case is here

SjoerdMeijer added inline comments.Aug 5 2019, 8:46 AM
625 ↗(On Diff #210116)

These options look a lot more sensible now as TLI is used, but has this actually been benchmarked?

626 ↗(On Diff #210116)

Adding this comment as a // FIXME or // TODO would be good I guess.

evandro updated this revision to Diff 213375.Aug 5 2019, 9:07 AM
evandro marked 2 inline comments as done.
evandro edited the summary of this revision. (Show Details)
evandro added inline comments.Aug 5 2019, 10:13 AM
625 ↗(On Diff #210116)

As I said above, not extensively. But enough to close the gap that motivated this patch. As it stands, even without carefully mined values, at least the egregious regression from not inlining bcmp() is addressed.

SjoerdMeijer accepted this revision.Aug 5 2019, 10:40 AM

Okay, cheers, looks reasonable to me.

This revision is now accepted and ready to land.Aug 5 2019, 10:40 AM
This revision was automatically updated to reflect the committed changes.