This is an archive of the discontinued LLVM Phabricator instance.

[X86] Tune bypassing of slow division for Intel CPUs
ClosedPublic

Authored by n.bozhenov on Dec 31 2016, 6:43 AM.

Details

Summary

64-bit integer division in Intel CPUs is extremely slow, much slower
than 32-bit division. On the other hand, 8-bit and 16-bit divisions
aren't any faster. The only important exception is Atom where DIV8
is fastest. Because of that, the patch

  1. Enables bypassing of 64-bit division for Atom, Silvermont and all big cores.
  2. Modifies 64-bit bypassing to use 32-bit division instead of 16-bit one. This doesn't make the shorter division slower but increases chances of taking it. Moreover, it's much more likely to prove at compile-time that a value fits 32 bits and doesn't require a run-time check (e.g. zext i32 to i64).

Diff Detail

Repository
rL LLVM

Event Timeline

n.bozhenov updated this revision to Diff 82763.Dec 31 2016, 6:43 AM
n.bozhenov retitled this revision from to [X86] Tune bypassing of slow division for Intel CPUs.
n.bozhenov updated this object.

This patch is part of a larger patch set:

  1. D28196
  2. D28197
  3. D28198
  4. D28199
  5. D28200
jlebar edited edge metadata.Dec 31 2016, 10:25 AM

I am confused about exactly which CPUs we do and don't want to do this transformation on, but I'll leave that up to people who know something about x86.

jlebar removed a reviewer: jlebar.Dec 31 2016, 10:25 AM
jlebar added a subscriber: jlebar.

cc'ing Simon for AMD knowledge. Based on Agner's tables, it seems like some/most of the AMD uarch's do this in hardware? Ie, the *minimum* reported latency is often the same for 32-bit and 64-bit divides even thought the maximum may be substantially longer for 64-bit. This suggests that the divider unit has some shortcut paths when the operands are determined to fit into a smaller width.

That said, it probably shouldn't hold this patch up because we can always add the feature flag to more CPUs as needed.

Can you update the CHECK lines in both of the test files using utils/update_llc_test_checks.py as a preliminary step ahead of this patch (no review required)? This has 3 benefits:

  1. We get tighter checks.
  2. It's easier to update the test files for this and future transforms.
  3. It's easier to see the diffs induced by this patch.

Note that you'll want to change the RUN lines in atom-bypass-slow-division-64.ll to include a triple ( -mtriple=x86_64-unknown-unknown ) rather than an arch, or you'll probably trigger bot failures.

test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗(On Diff #82763)

Is there some reason to choose skylake here rather than sandybridge? If not, I'd prefer SNB because that's the oldest big core where the feature is applied?

test/CodeGen/X86/slow-div.ll
29–43 ↗(On Diff #82763)

Add an 'optsize' test for the 64-bit divide too?

RKSimon added inline comments.Jan 8 2017, 4:25 AM
lib/Target/X86/X86.td
214 ↗(On Diff #82763)

Is losing the idivq-to-divw option likely to cause any problems to existing users?

test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗(On Diff #82763)

Drop the -march and move the triple in there instead.

Hi Sanjay,

Can you update the CHECK lines in both of the test files using utils/update_llc_test_checks.py as a preliminary step ahead of this patch (no review required)? This has 3 benefits:

  1. We get tighter checks.
  2. It's easier to update the test files for this and future transforms.
  3. It's easier to see the diffs induced by this patch.

Not sure what you mean exactly. Please take a look at D28469. Is it what you mean? Is any additional postprocessing needed after running update_llc_test_checks.py?

spatel added a comment.Jan 9 2017, 7:41 AM

Hi Sanjay,

Can you update the CHECK lines in both of the test files using utils/update_llc_test_checks.py as a preliminary step ahead of this patch (no review required)? This has 3 benefits:

  1. We get tighter checks.
  2. It's easier to update the test files for this and future transforms.
  3. It's easier to see the diffs induced by this patch.

Not sure what you mean exactly. Please take a look at D28469. Is it what you mean? Is any additional postprocessing needed after running update_llc_test_checks.py?

Yes, that's what I was suggesting in general. Once you have generated the baseline checks with the script, it is very simple to apply your code patch and then run the script again to update the checks.

Usually for small tests like this, we prefer to leave the auto-generated checks as-is (no hand edits), but if you think there is a lot of unnecessary checking with the script, you can remove those lines. Similarly, if the script does not capture some useful info (like loaded constant values), you can add those lines to the auto-generated checks.

n.bozhenov added inline comments.Jan 9 2017, 7:51 AM
lib/Target/X86/X86.td
214 ↗(On Diff #82763)

Is losing the idivq-to-divw option likely to cause any problems to existing users?

Well, in theory it is possible for Atoms, because DIV16 is somewhat faster for them than DIV32. But still DIV64 is way way slower than DIV32, so this change is probably a win even for Atoms because obviously DIV32 can be taken more often than DIV16. For all other Intel architectures (including Silvermont) there's no sense in idivq-to-divw transformation.

test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗(On Diff #82763)

Is there some reason to choose skylake here rather than sandybridge?

It doesn't make much difference because in D28197 I change this into -mattr=+idivq-to-divl

3 ↗(On Diff #82763)

Drop the -march and move the triple in there instead.

Ok, I will. But what is the difference?

RKSimon added inline comments.Jan 9 2017, 8:55 AM
test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗(On Diff #82763)

Drop the -march and move the triple in there instead.

Ok, I will. But what is the difference?

Triple sets the arch and the abi for you, so just avoids duplication.

n.bozhenov updated this revision to Diff 83932.Jan 11 2017, 2:19 AM
n.bozhenov edited edge metadata.

Rebased atom-bypass-slow-division*.ll tests on D28469.

Didn't rebase slow-div.ll because in the next patch I would add several RUN lines into the test which produce similar but not exactly the same assembly. So, I decided to check here only whether the transformation is applied or not.

n.bozhenov added inline comments.Jan 11 2017, 2:43 AM
test/CodeGen/X86/slow-div.ll
29–43 ↗(On Diff #82763)

Such a test is added in D28551 (see bypass-slow-division-tune.ll file)

Rebased atom-bypass-slow-division*.ll tests on D28469.

Didn't rebase slow-div.ll because in the next patch I would add several RUN lines into the test which produce similar but not exactly the same assembly. So, I decided to check here only whether the transformation is applied or not.

Please correct me if I'm wrong, but this patch is now making 2 functional changes, but only testing for the 1st one:

  1. Fix the definition of FeatureSlowDivide64 for existing CPUs (Bonnell/Silvermont)
  2. Add the FeatureSlowDivide64 to SNB and later big cores

If that's correct, you should remove the 2nd code change from this patch and submit it as a follow-up to this patch with a corresponding test. Alternatively, you could bring the minimum test changes from the other test file back into this patch since that's a small difference. But we don't want to have a functional change with no testing. I know this is a patch series with appropriate testing at the end, but it's important to have the right tests in at every step just in case one or more of those steps needs to be reverted.

n.bozhenov updated this revision to Diff 83992.Jan 11 2017, 9:33 AM

Rebased atom-bypass-slow-division*.ll tests on D28469.

Didn't rebase slow-div.ll because in the next patch I would add several RUN lines into the test which produce similar but not exactly the same assembly. So, I decided to check here only whether the transformation is applied or not.

Please correct me if I'm wrong, but this patch is now making 2 functional changes, but only testing for the 1st one:

  1. Fix the definition of FeatureSlowDivide64 for existing CPUs (Bonnell/Silvermont)
  2. Add the FeatureSlowDivide64 to SNB and later big cores

If that's correct, you should remove the 2nd code change from this patch and submit it as a follow-up to this patch with a corresponding test. Alternatively, you could bring the minimum test changes from the other test file back into this patch since that's a small difference. But we don't want to have a functional change with no testing. I know this is a patch series with appropriate testing at the end, but it's important to have the right tests in at every step just in case one or more of those steps needs to be reverted.

Does it look better now? I have added a Sandy Bridge test into atom-bypass-slow-division-64.ll

spatel accepted this revision.Jan 11 2017, 10:07 AM
spatel edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jan 11 2017, 10:07 AM
This revision was automatically updated to reflect the committed changes.