This is an archive of the discontinued LLVM Phabricator instance.

[x86] enable storeOfVectorConstantIsCheap() target hook
ClosedPublic

Authored by spatel on Sep 4 2017, 4:30 PM.

Details

Summary

This allows vector-sized store merging of constants in DAGCombiner using the existing code in MergeConsecutiveStores(). All of the twisted logic that decides exactly what vector operations are legal and fast for each particular CPU are handled separately in there using the appropriate hooks.

Some notes:

  1. For the motivating tests in merge-store-constants.ll, we already produce the same vector code in IR via the SLP vectorizer. So this is just providing a backend backstop for code that doesn't go through that pass (-O1). More details in PR24449:

https://bugs.llvm.org/show_bug.cgi?id=24449 (this change should be the last step to resolve that bug)

  1. At the minimum vector size limit (16-bytes), we're trading two 8-byte scalar immediate stores for one 16-byte constant pool load and one 16-byte store (eg, fold-vector-sext-crash2.ll::test_sext1). I think that's a reasonable trade-off because offloading any work to vector registers should ease pressure on register-starved scalar code, but let me know if there are other considerations. We could adjust this in the hook by returning true only for >2*max scalar size, so we know there would an instruction reduction.
  1. There's a likely regression in vector-sext-crash2.ll::test_zext1 and mod128.ll where we materialize a constant in scalar and then send it over to the vector unit. I know we have some bug reports related to that. A quick scan turned up:

https://bugs.llvm.org/show_bug.cgi?id=26301
...but there are probably others.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Sep 4 2017, 4:30 PM
zvi edited edge metadata.Sep 7 2017, 8:58 AM

Intuitively, it seems to me that choosing a minimum threshold, as suggested in note 2, is a better option.
Another concern for store-merging in general: are we more susceptible to losing store-to-load forwarding? I know that Intel pre-Nehalem processors had some limitations that were later improved. Sorry i can't recall the full details from the top of my head. Will look later at the Optimization Manual for the info.

In D37451#863494, @zvi wrote:

Intuitively, it seems to me that choosing a minimum threshold, as suggested in note 2, is a better option.

Yes, that should be a more clear win. It should also sidestep the scalar imm -> move to vector -> store regressions we see here.

Another concern for store-merging in general: are we more susceptible to losing store-to-load forwarding? I know that Intel pre-Nehalem processors had some limitations that were later improved. Sorry i can't recall the full details from the top of my head. Will look later at the Optimization Manual for the info.

I can see that concern in general, but we're only dealing with constant stores in this patch, so I would hope there's no problem rematerializing constants.

spatel updated this revision to Diff 114224.Sep 7 2017, 11:51 AM

Patch updated:
Don't enable vector store of constants unless we can replace more than 2 scalar stores. This avoids the borderline cases and sidesteps the regressions in the earlier rev.

big_nonzero_16_bytes() for x86-64 shows a different merging problem. We can't directly store 64-bit immediates, so we have to materialize the constants in registers and then store as a separate instruction. The test is trying to store four 32-bit imms, so we probably shouldn't have done any merging there?

big_nonzero_16_bytes() for x86-64 shows a different merging problem. We can't directly store 64-bit immediates, so we have to materialize the constants in registers and then store as a separate instruction. The test is trying to store four 32-bit imms, so we probably shouldn't have done any merging there?

On 2nd thought, that doesn't make sense. We probably should merge that case, but the hook doesn't allow us to distinguish constant values. So we don't know when an immediate can be placed directly in the store or not.

I think it's ok to only handle the larger cases in this patch and make that potential enhancement a follow-up. Fixing that might also solve the scalar imm to vector to memory bug.

spatel updated this revision to Diff 114245.Sep 7 2017, 12:59 PM

Patch updated:
On 3rd thought... :)
We can simplify the code and handle the previously mentioned test on 64-bit too. I've added another test to show the potential follow-up improvement.

RKSimon accepted this revision.Sep 15 2017, 9:39 AM

LGTM

This revision is now accepted and ready to land.Sep 15 2017, 9:39 AM
This revision was automatically updated to reflect the committed changes.