This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Generalize X86ISD::BLENDI support to more value types (WIP)
ClosedPublic

Authored by RKSimon on Feb 7 2019, 5:28 AM.

Details

Summary

WIP patch for comments.

D42042 introduced the ability for the ExecutionDomainFixPass to more easily change between BLENDPD/BLENDPS/PBLENDW as the domains required.

With this ability, we can avoid most bitcasts/scaling in the DAG that was occurring with X86ISD::BLENDI lowering/combining, blend with the vXi32/vXi64 vectors directly and use isel patterns to lower to the float vector equivalent vectors.

This helps the shuffle combining and SimplifyDemandedVectorElts be more aggressive as we lose track of fewer UNDEF elements than when we go up/down through bitcasts.

I've introduced a basic blend(bitcast(x),bitcast(y)) -> bitcast(blend(x,y)) fold, there are more generalizations I can do there (e.g. widening/scaling and handling the tricky v16i16 repeated mask case).

I haven't gotten to the bottom of the vector-reduce-smin/smax regression either yet.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Feb 7 2019, 5:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 5:28 AM
craig.topper accepted this revision.Feb 8 2019, 8:51 AM

LGTM. I assume you'll continue following up on the reduce-smin/smax regression.

This revision is now accepted and ready to land.Feb 8 2019, 8:51 AM
This revision was automatically updated to reflect the committed changes.
RKSimon reopened this revision.Feb 11 2019, 6:45 AM
RKSimon added reviewers: sammccall, dlj.
RKSimon added subscribers: sammccall, dlj.

Let's try this again....

Reopening as it was reverted at rL353699 due to a rather weird regression.....

@dlj @sammccall Please can you post your repros here?

This revision is now accepted and ready to land.Feb 11 2019, 6:45 AM

I bootstrapped clang with this patch applied: on x86-64 with -O3 -msse4.2 and assertions off.

clang -c elfcore.c

yields the following

/usr/local/google/home/sammccall/elfcore.c:4:14: warning: format string contains '\0' within the string body [-Wformat]
void run() { MACRO2(a + b + c); }
             ^~~~~~~~~~~~~~~~~
/usr/local/google/home/sammccall/elfcore.c:3:19: note: expanded from macro 'MACRO2'
#define MACRO2(x) PASTE_AND_FORMAT(a, x)
                  ^~~~~~~~~~~~~~~~~~~~~~
/usr/local/google/home/sammccall/elfcore.c:2:42: note: expanded from macro 'PASTE_AND_FORMAT'
#define PASTE_AND_FORMAT(a, b) format(#a #b)
                                      ~~~^~
<scratch space>:3:8: note: expanded from here
"a P b <U+0000> c"
       ^
1 warning generated.

operator tokens (and some other characters like () get deterministically garbled when tokens are pasted in scratch space.

Thanks @sammccall I've been able to repro it now.

This revision was automatically updated to reflect the committed changes.

@dlj @sammccall I think I've fixed the root issue now (rL354358) - please reply on this ticket if there are any more issues. Cheers!

rnk added a subscriber: rnk.Feb 22 2019, 5:02 PM

@dlj @sammccall I think I've fixed the root issue now (rL354358) - please reply on this ticket if there are any more issues. Cheers!

I think I need to revert this again, the following fails after this commit:

$ cat autofit.cpp
typedef struct {
  long b, c;
  long d, f
} g;
typedef struct {
  long h;
  long height;
  long i;
  long j
} k;
struct l {
  int library;
  int *e;
  struct l *n;
  int o;
  k metrics
};
struct l *p;
q() {
  g bbox;
  r(bbox);
  bbox.d = bbox.d & 3;
  bbox.f = bbox.f & 3;
  p->metrics.h = bbox.d - bbox.b;
  p->metrics.height = bbox.f - bbox.c;
  p->metrics.i = bbox.b;
  p->metrics.j = bbox.f;
}

$ clang -cc1 -triple x86_64-unknown-linux-android -emit-obj -disable-free -main-file-name autofit.c -mrelocation-model pic -pic-level 2 -mthread-model posix -fmerge-all-constants -relaxed-aliasing -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -target-feature +sse4.2 -target-feature +popcnt -momit-leaf-frame-pointer -ffunction-sections -fdata-sections -Oz -std=c11 -ferror-limit 1 -vectorize-slp  -x c -fcomplete-member-pointers -w autofit.cpp
...
clang: /usr/local/google/home/rnk/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:418: llvm::MachineOperand &llvm::MachineInstr::getOperand(unsigned int): Assertion `i < getNumOperands() && "getOperand() out of range!"' failed.

The issue is that blendps/blendpd can be commuted to movss/movsd under optsize. This changes the number of operands. The code in TwoAddressInstructionPass tries to look for other commutable operands after making a commute in order to handle FMA3 and VPTERNLOG instructions. But the loop doesn't handle the number of operands changing.

Apparently this wasn't a problem before because FP types start with MOVSS/MOVSD due to a shuffle combine. So we leave isel with movss/movsd and two address instruction pass can commute that to blend, but won't go the other way. This increases the number of operands which isn't an issue.

To fix this I propose to use PBLENDW for the 128-bit integer case in the new isel patterns added by this patch. This keeps the blend in the integer domain and avoids the possibility of commuting to movss/movsd. I'll also fix two address instruction pass to resample the operand count after commuting to avoid this issue if there are any corner cases we haven't found yet.