This is an archive of the discontinued LLVM Phabricator instance.

[X86] Replace avx2.pbroadcast intrinsics with native IR.
ClosedPublic

Authored by ab on Jun 18 2015, 4:45 PM.

Details

Summary

While working on PR23464, the broadcast intrinsics annoyed me. Let's remove them: it's one of the simplest shuffle kind, IR is good enough.

Diff Detail

Event Timeline

ab updated this revision to Diff 27972.Jun 18 2015, 4:45 PM
ab retitled this revision from to [X86] Replace avx2.pbroadcast intrinsics with native IR..
ab updated this object.
ab edited the test plan for this revision. (Show Details)
ab added a reviewer: spatel.
ab added a subscriber: Unknown Object (MLST).

Hi Ahmed -

I had a similar patch in D8276, but then it was pointed out that this may not be the best solution. The concern is that doing it this way may alter/optimize code at -O0 (debugging vector code is hard enough without the damn compiler getting in the way!).

So for my next patch on this front, D8486, I used instcombine so there would be no changes in the front-end or -O0.

Adding some other folks who might have an opinion about the best way forward.

ab added a comment.Jun 19 2015, 11:39 AM

Hi Ahmed -

I had a similar patch in D8276, but then it was pointed out that this may not be the best solution. The concern is that doing it this way may alter/optimize code at -O0 (debugging vector code is hard enough without the damn compiler getting in the way!).

So for my next patch on this front, D8486, I used instcombine so there would be no changes in the front-end or -O0.

Ah, I saw the vinsert/vextract commits and figured "this is fine"; I missed the other half!

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I guess there's a reason not to, but would it make sense to avoid inlining them at -O0? That oughta give you the best debuggability we can give, right?

-Ahmed

In D10555#191124, @ab wrote:

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I think the problem is independent of inlining and DAGCombines. As an example, consider this:

__m128 foo(__m256 a) {
  return _mm256_extractf128_ps(a, 0);
}

After D8275, this becomes a shufflevector in clang, and there's not much hope of turning it back into a vextractf128. It becomes an ISD::EXTRACT_SUBVECTOR in the DAG without any combiner opts AFAICT. Then, it turns into a EXTRACT_SUBREG machine inst. Then, it's either just a move or nothing at all in x86.

silvas edited edge metadata.Jun 19 2015, 4:48 PM
In D10555#191124, @ab wrote:

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I think the problem is independent of inlining and DAGCombines. As an example, consider this:

__m128 foo(__m256 a) {
  return _mm256_extractf128_ps(a, 0);
}

If _mm256_extractf128_ps is a proper function instead of a macro (using the enable_if trick if necessary), would Ahmed's suggestion work for keeping these debuggable?

In D10555#191124, @ab wrote:

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I think the problem is independent of inlining and DAGCombines. As an example, consider this:

__m128 foo(__m256 a) {
  return _mm256_extractf128_ps(a, 0);
}

If _mm256_extractf128_ps is a proper function instead of a macro (using the enable_if trick if necessary), would Ahmed's suggestion work for keeping these debuggable?

I tried an experiment with:

__m128i foo(__m128i x) {
  return _mm_add_epi32(x, _mm_set1_epi32(0));  // so easy to optimize, but...must...resist!
}

...because that's defined as a proper function:

static __inline__ __m128i DEFAULT_FN_ATTRS
_mm_add_epi32(__m128i __a, __m128i __b)
{
  return (__m128i)((__v4si)__a + (__v4si)__b);
}

The add is present in the unoptimized IR, but it's gone in the asm. Removing the 'inline' didn't appear to change anything in this example.
Removing 'inline' could cause a different problem - vector coders really don't want those header files showing up in profiles or stepping in/out while debugging. IIRC, that happened for some reason with gcc about 10 years ago and had to be worked around.

RKSimon edited edge metadata.Aug 15 2015, 11:15 AM

So I revisited this as I've been messing with instcombiner reduction of intrinsics a lot recently.

Looking at the O0/O1/O2 codegen, the pbroadcast (and the broadcastss/broadcastsd register variants) are well behaved and keep to the expected instructions - we're not doing anything different here to how many of the other shuffle intrinsics are already implemented in the headers. The only one that has problems is _mm256_broadcastsi128_si256 (vbroadcasti128) which isn't being proposed here.

Along with an update of avx2intrin.h to call __builtin_shufflevector directly (and suitable tests to ensure that debug code doesn't change in the future) I'd say that this should be a win, but if people are still hesitant we should at least push forward with support in instcombiner now instead of putting it off.

Along with an update of avx2intrin.h to call __builtin_shufflevector directly (and suitable tests to ensure that debug code doesn't change in the future) I'd say that this should be a win, but if people are still hesitant we should at least push forward with support in instcombiner now instead of putting it off.

My vote is for an instcombine solution just for the sake of implementation consistency...of course, I don't think there's been any consensus that the instcombine solution is the right way forward.

Possibly related: I just hit the same problem as https://llvm.org/bugs/show_bug.cgi?id=24125 , so anything we can do to make that fix easier should be considered. The current crashing is really unpleasant.

chandlerc edited edge metadata.Aug 17 2015, 11:21 AM

FWIW, I really like this patch. Is there anything we can do to make this work?

FWIW, I really like this patch. Is there anything we can do to make this work?

It appears we have a few things that need to be decided before going any further:

1 - When is it permitable to replace a (sub)target-specific intrinsic with a non-specific implementation in the headers (e.g. using __builtin_shufflevector for these broadcasts)?

As long as the expected instruction remains in debug code I'm keen for this to be encouraged - we can add suitable tests, remove those builtin intrinsics to AutoUpgrade.cpp until 4.0 and get much cleaner headers.

2 - When is it permitable to replace a (sub)target-specific intrinsic in IR/DAG creation, and should that occur in InstCombine or in the target ISel code someplace?

I'd vote for InstCombine as we already appear to have a critical mass of intrinsics here.

3 - What are we going to do to fix the issue introduced by the header refactor removing the target guards, causing a tricky to decipher 'Cannot select: intrinsic %llvm.x86.vcvtps2ph.128' style backend error for intrinsics that are implemented as macros?

A quick+nasty solution would be to add header guards at least around each of those macros.

spatel accepted this revision.Aug 18 2015, 8:38 AM
spatel edited edge metadata.

FWIW, I really like this patch. Is there anything we can do to make this work?

It appears we have a few things that need to be decided before going any further:

1 - When is it permitable to replace a (sub)target-specific intrinsic with a non-specific implementation in the headers (e.g. using __builtin_shufflevector for these broadcasts)?

As long as the expected instruction remains in debug code I'm keen for this to be encouraged - we can add suitable tests, remove those builtin intrinsics to AutoUpgrade.cpp until 4.0 and get much cleaner headers.

2 - When is it permitable to replace a (sub)target-specific intrinsic in IR/DAG creation, and should that occur in InstCombine or in the target ISel code someplace?

I'd vote for InstCombine as we already appear to have a critical mass of intrinsics here.

3 - What are we going to do to fix the issue introduced by the header refactor removing the target guards, causing a tricky to decipher 'Cannot select: intrinsic %llvm.x86.vcvtps2ph.128' style backend error for intrinsics that are implemented as macros?

A quick+nasty solution would be to add header guards at least around each of those macros.

At some point while working on one of these, Andrea told me about one more place where we do some builtin/intrinsic handling. I think that was CGBuiltin.cpp in clang's CodeGen. It seems messy that we have at least 3 ways of dealing with these things, but there are probably good reasons for each.

I don't want to hold up progress, so I don't object to this patch going in as-is (especially since Simon confirmed that -O0 code looks fine for these cases). But it would be great to answer the design questions that Simon has raised here for our collective future reference. There will surely be more intrinsics where these came from. :)

This revision is now accepted and ready to land.Aug 18 2015, 8:38 AM
This revision was automatically updated to reflect the committed changes.