Page MenuHomePhabricator

[X86] Lowering rotation intrinsics to native IR

Authored by tkrupa on May 17 2018, 10:07 AM.



This patch relies on changes introduced in D46946 and must be upstreamed after it.

Diff Detail

Event Timeline

tkrupa created this revision.May 17 2018, 10:07 AM

I don't really feel qualified to review this. But fwiw in mesa we never used rotate instructions, and for that matter we were not relying on shift intrinsics neither (we've got some cases where we actually need modulo width behavior so we're masking off the bits).
But omg do the larger-or-equal to bit width cases make things more complex... Shift is such a nice instruction - if just everybody could agree what the behavior should be when the shift count is larger than the bit width...

RKSimon added inline comments.May 18 2018, 5:35 AM

XOP handling?

Have you seen the current llvm-dev thread about adding a generic rotate intrinsic?

This transform has problems when some of the instructions get hoisted from loops (and that's likely the most important consideration for perf).

Here's a minimal example to demonstrate:

#include <immintrin.h>

void rotateInLoop(unsigned *x, unsigned N, __m128i *a, __m128i b) {
  for (unsigned i = 0; i < N; ++i)
    x[ _mm_extract_epi32(_mm_rolv_epi32(a[i], b), 0) ] = i;

Before this patch:

$ ./clang rotv.c -S -O1 -o - -mavx512vl
	vmovdqa	(%rdx), %xmm1
	vprolvd	%xmm0, %xmm1, %xmm1
	vmovd	%xmm1, %esi
	movslq	%esi, %rsi
	movl	%ecx, (%rdi,%rsi,4)
	incq	%rcx
	addq	$16, %rdx
	cmpq	%rcx, %rax
	jne	LBB0_2

After this patch:

	vmovdqa	(%rdx), %xmm2
	vpsllvd	%xmm0, %xmm2, %xmm3
	vpsrlvd	%xmm1, %xmm2, %xmm2
	vpor	%xmm3, %xmm2, %xmm2
	vmovd	%xmm2, %esi
	movslq	%esi, %rsi
	movl	%ecx, (%rdi,%rsi,4)
	incq	%rcx
	addq	$16, %rdx
	cmpq	%rcx, %rax
	jne	LBB0_2

I think you'll either need to implement this first:
...or limit this patch to the non-variable rotates, or just wait for the generic intrinsic?

Why are we going through shift intrinsics to do this? Why can't we just emit shl and lshr instructions directly?


Just create and And with 31 or 63? I believe one of the signatures of CreateAnd even takes a uint64_t as an argument.

Because emitting shifts in IR is more complicated than just adding an shl/lshr node due to those poison values (see D46946) and would create some redundant code. I guess I can use simplifyX86immShift directly instead of emitting a call here.
As for the bug - much more than one instruction gets thrown out of the loop after applying shift lowering patch - I'm leaning to leaving only non-variable intrinsics in this patch and implement variable ones after the generic intrinsic is introduced.

@tkrupa If you still interesting in working on this, converting the rotations to generic funnel shift intrinsics is the better way to go - its a single intrinsic call so you don't have IR splitting issues, the AVX512 + XOP rotates respect the modulo amount and both the variable and splat-immediate variants are fully supported by the x86 backend for lowering. Instead of InstCombine I'd probably suggest performing this both in the clang frontend and as an auto upgrade for the existing intrinsics.

I'm no longer working on this. AFAIK this task and D46946 have been reassigned to @Jianping or @LuoYuanke.

I'm no longer working on this. AFAIK this task and D46946 have been reassigned to @Jianping or @LuoYuanke.

In which case I may do the rotation work myself - D55747 is the only outstanding issue AFAICT and it'd be good to start getting more thorough funnel shift usage into the code.

@tkrupa This can now be abandoned now that the x86 vector rotation intrinsics emit/autoupgrade to generic funnel shifts.

tkrupa abandoned this revision.Jan 2 2019, 12:19 AM