This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable interleaved memory accesses by default
ClosedPublic

Authored by mkuper on Oct 6 2016, 4:10 PM.

Details

Summary

Following r283480, we should, hopefully, have no regressions from this.
If you want to give this a spin with your workloads before I commit, let me know.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 73865.Oct 6 2016, 4:10 PM
mkuper retitled this revision from to [X86] Enable interleaved memory accesses by default.
mkuper updated this object.
mkuper added subscribers: llvm-commits, zvi, Farhana and 2 others.
mkuper added inline comments.Oct 6 2016, 5:29 PM
test/Transforms/LoopVectorize/X86/interleaving.ll
1

Note that this test's only purpose is to verify that interleaved memory access is *turned on* for x86, it's not a test for interleaved memory access functionality.
The general functional test is Transforms/LoopVectorize/interleaved-accesses.ll

RKSimon edited edge metadata.Oct 10 2016, 8:02 AM

Any performance test gains/regressions?

Any performance test gains/regressions?

We've had performance gains on internal benchmarks. (And, anecdotally, a performance regression, which turned out to be a gain - we've started vectorizing a loop we were not vectorizing before, and really should have better performance when vectorized - but the dynamic loop count on that loop tends to be 1 or 2...)
As far as I know Intel also had performance gains on the benchmarks they run, but I'll let Ayal/Elena/Dorit speak for themselves.

As to public benchmarks - I don't have an up-to-date SPEC run with this unfortunately. I can run one if you want - unless Intel already happen to have the results handy?

Anyway, if you have internal benchmarks you want to run this on pre-commit, please do - codegen isn't necessarily happy with the shuffle sequences this generates. We've had some really bad regressions initially, which led to r283480. I don't expect any more big surprises, since the x86 cost model is still *really* conservative w.r.t interleaved memory accesses (teaching it to be more precise is a separate issue, and probably ties in with Farhana's work in D24681), but who knows.

mkuper added a subscriber: dorit.Oct 10 2016, 8:25 AM

SPEC2006 looks flat.

We're seeing some minor but consistent regressions (< 1%) on some internal tests.

We're seeing some minor but consistent regressions (< 1%) on some internal tests.

That's really interesting - that means it's firing. :-)

Can you provide a reproducer? I can see basically three possible sources of regressions:

  1. We're still doing very bad lowering for some shuffle sequences.
  2. The cost model is too optimistic regarding the cost of the sequence, even with optimal lowering.
  3. "Normal" vectorizer flakiness - the cost model is doing the right thing about the interleave sequence, but we shouldn't be vectorizing regardless. So without interleaving enabled we were just getting lucky.

I really hope what you're seeing is a case of (1).

We're seeing some minor but consistent regressions (< 1%) on some internal tests.

That's really interesting - that means it's firing. :-)

Can you provide a reproducer? I can see basically three possible sources of regressions:

We're trying to drill down to the codegen diffs - its a large codebase and is putting up a fight.... The regression reduced when we enabled LTO and/or PGO.

Hi Michael,

I thought the plan was to avoid interleaved vectorization when targeting Atom. Did that plan change or get handled in some other way?

Hi Michael,

I thought the plan was to avoid interleaved vectorization when targeting Atom. Did that plan change or get handled in some other way?

No, this is my mistake, sorry about that.
I'll update the patch, thanks.

mkuper updated this revision to Diff 74553.Oct 13 2016, 11:11 AM
mkuper edited edge metadata.

Updated to disable this on Atom.
Intel are still looking into why they're getting regressions on Atom - for the same code that performs much better on Intel big cores.

Hi Michael,

We see a few eembc (mp2decode on SLM) and coremark regressions of the order of around 3-6%, and a couple of 3-6% geekbench (HSW) regression which we can follow up on.

In addition to this, we see significant (60%+) gains in denbench/rgb tests on HSW, which are nice.

Performance-wise, I think the change looks good to us.

I can give you (or anyone else) more specific on the gains/regressions if interested.

Thanks,
Zia.

Thanks Zia!

As I've discussed with Ayal, the SLM-specific regressions should probably be looked at separately - but I'd appreciate more details on the HSW regressions.

Michael

Simon, any news on your end?

Simon, any news on your end?

So looking through the before + after code we're seeing 2 types of diff:

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.). This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind) and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

Thanks for investigating this, Simon!

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.).

Is this AVX or AVX2? I mean, do we get this just because of having to perform integer ops on xmms, or is this just part of the resulting shuffle sequence?

This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

We should be able to create one and address this as a follow up issue.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.).

Is this AVX or AVX2? I mean, do we get this just because of having to perform integer ops on xmms, or is this just part of the resulting shuffle sequence?

This is AVX1 on a Jaguar CPU - so internally its a 128-bit ALU that double pumps ymm instructions. It can be sensitive to large amounts of dependent ymm code like this.

This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

I have a possible shuffle patch that cover both of these, but haven't had time to finish it - its a rewrite of lowerVectorShuffleByMerging128BitLanes that acts a bit like lowerShuffleAsRepeatedMaskAndLanePermute but in reverse (multiple input lane permute followed by repeated mask).

and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

Sorry, I meant such a change could fix the regression - and possibly allow a great deal more shuffle folding. It'll be a fine balance as to when to let it fire though.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

Yes I think I'm happy for this to go ahead, the regression areas we can work on afterward, most can be solved during lowering and are existing issues - its just interleaving makes them a little more obvious!

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

We should be able to create one and address this as a follow up issue.

Great, please CC me on the PR when you have one, I'll look into it.

This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

I have a possible shuffle patch that cover both of these, but haven't had time to finish it - its a rewrite of lowerVectorShuffleByMerging128BitLanes that acts a bit like lowerShuffleAsRepeatedMaskAndLanePermute but in reverse (multiple input lane permute followed by repeated mask).

Sounds good, feel free to add me to the review.

and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

Sorry, I meant such a change could fix the regression - and possibly allow a great deal more shuffle folding. It'll be a fine balance as to when to let it fire though.

Sorry, I wasn't clear- I understood what you meant. I'm just confused about why we're producing this pattern.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

Yes I think I'm happy for this to go ahead, the regression areas we can work on afterward, most can be solved during lowering and are existing issues - its just interleaving makes them a little more obvious!

Of course this doesn't introduce CG issues, it was just a question of which of the ones it exposes we "prefetch" vs. which ones we handle later. :)

Anyway, this sounds good to me, thanks a lot.
I'm going to push this in. If we see significant regressions coming from other people, I'm perfectly ok with reverting and reapplying after they're fixed.

RKSimon accepted this revision.Oct 20 2016, 1:51 PM
RKSimon edited edge metadata.

LGTM

This revision is now accepted and ready to land.Oct 20 2016, 1:51 PM
This revision was automatically updated to reflect the committed changes.