This is an archive of the discontinued LLVM Phabricator instance.

[X86] Custom legalize v16i64->v16i8 truncate with avx512.
ClosedPublic

Authored by craig.topper on May 1 2020, 1:03 AM.

Details

Summary

Default legalization will create two v8i64 truncs to v8i32, concat
them to v16i32, and then truncate the rest of the way to v16i8.

Instead we can truncate directly from v8i64 to v8i8 in the lower
half of an xmm. Then concat the two halves to use vpunpcklqdq.
This is the same number of uops, but the dependency chain through
the uops is better since the halves are merged at the end.

I had to had SimplifyDemandedBits support for VTRUNC to prevent
a regression on vector-trunc-math.ll. combineTruncatedArithmetic
no longer gets a chance to shrink vXi64 mul so we were producing
the v8i64 multiply sequence using multiple PMULUDQs. With the
demanded bits fix we are able to prune out the extra ops leaving
just two PMULUDQs, one for each v8i64 half. This is twice the
width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1
uop and PMULLD is 2. We also save some truncates. It's probably
worth using PMULUDQ even when PMULLQ is available since the latter
is 3 uops, but that will require a different change.

Diff Detail

Event Timeline

craig.topper created this revision.May 1 2020, 1:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2020, 1:03 AM
Herald added a subscriber: hiraditya. · View Herald Transcript

Does anything need to be done for truncstore cases?

llvm/lib/Target/X86/X86ISelLowering.cpp
36969

Handle DemandedElts as well?

Add DemandedElts support

Does anything need to be done for truncstore cases?

It looks like previously we would form a truncstore using vmovdb after type legalization. After this patch we no longer do that.

Does anything need to be done for truncstore cases?

It looks like previously we would form a truncstore using vmovdb after type legalization. After this patch we no longer do that.

Is it still a perf gain if we lose the truncstore? I can't see any truncstore test changes in the patch, but we might just be missing test coverage.

Does anything need to be done for truncstore cases?

It looks like previously we would form a truncstore using vmovdb after type legalization. After this patch we no longer do that.

Is it still a perf gain if we lose the truncstore? I can't see any truncstore test changes in the patch, but we might just be missing test coverage.

I don’t think the store unit does the truncate. I think its still does the truncate in the shuffle unit and then does a separate store. I’ll double check uops.info.

Does anything need to be done for truncstore cases?

It looks like previously we would form a truncstore using vmovdb after type legalization. After this patch we no longer do that.

Is it still a perf gain if we lose the truncstore? I can't see any truncstore test changes in the patch, but we might just be missing test coverage.

I don’t think the store unit does the truncate. I think its still does the truncate in the shuffle unit and then does a separate store. I’ll double check uops.info.

Confirmed with uop.info that vpmovdb is just 2 shuffles and a store address and store data uop. The store address and store data appear to microfused which is pretty normal for stores.

RKSimon accepted this revision.May 3 2020, 11:47 AM

LGTM - thanks for checking.

This revision is now accepted and ready to land.May 3 2020, 11:47 AM
This revision was automatically updated to reflect the committed changes.