This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Try a lot harder to emit scalar loads
ClosedPublic

Authored by arsenm on Jun 6 2018, 6:16 AM.

Details

Summary

This has two main components. First, widen
widen short constant loads in DAG when they have
the correct alignment. This is already done a bit in
AMDGPUCodeGenPrepare, since that has access to
DivergenceAnalysis. This can't help kernarg loads
created in the DAG. Start to use DAG divergence analysis
to help this case.

The second part is to avoid kernel argument lowering
breaking the alignment of short vector elements because
calling convention lowering wants to split everything
into legal register types.

When loading a split type, load the nearest 4-byte aligned
segment and shift to get the desired bits. This extra
load of the earlier argument piece ends up merging,
and the bit extract hopefully folds out.

There are a number of improvements and regressions with
this, but I think as-is this is a better compromise between
several of the worst parts of SelectionDAG.

Particularly when i16 is legal, this produces worse code
for i8 and i16 element vector kernel arguments. This is
partially due to the very weak load merging the DAG does.
It only looks for fairly specific combines between pairs
of loads which no longer appear. In particular this
causes v4i16 loads to be split into 2 components when
previously the two halves were merged.

Worse, because of the newly introduced shifts, there
is a lot more unnecessary vector packing and unpacking code
emitted. At least some of this is due to reporting
false for isTypeDesirableForOp for i16 as a workaround for
the lack of divergence information in the DAG. The cases
where this happens it doesn't actually matter, but the
relevant code in SimplifyDemandedBits doens't have the context
to know to ignore this.

The use of the scalar cache is probably more important
than the mess of mostly scalar instructions doing this packing
and unpacking. Future work can fix this, possibly by making better
use of the new DAG divergence information for controlling promotion
decisions, or adding another version of shift + trunc + shift
combines that doesn't only know about the used types.

Diff Detail

Event Timeline

arsenm created this revision.Jun 6 2018, 6:16 AM
rampitec added inline comments.Jun 6 2018, 11:27 AM
lib/Target/AMDGPU/SIISelLowering.cpp
1104

It can happen that offset diff is zero and then shift is not needed. Say you have kernarg layout: i32 a, i16 b, i16 c. For 'b' align will be 2, but offset will be 4.

5364

Why do not you want to get pointer, pointer info, AA and the rest from original load?

arsenm added inline comments.Jun 6 2018, 11:59 AM
lib/Target/AMDGPU/SIISelLowering.cpp
1104

getNode already avoids creating a shift if the passed offset is 0

5364

Question wording is broken.

Everything remains the same, except the memory and possibly result type. I don't think the pointer info access size needs to increase, although I'm not 100% sure

rampitec added inline comments.Jun 6 2018, 12:05 PM
lib/Target/AMDGPU/SIISelLowering.cpp
1104

Ok, thanks.

5364

If later there will be some more info on an original load it will be lost here and it will need patching. IMO it is better to copy what is possible to copy.

arsenm added inline comments.Jun 6 2018, 2:47 PM
lib/Target/AMDGPU/SIISelLowering.cpp
5364

This is already copying everything?

rampitec added inline comments.Jun 6 2018, 2:58 PM
lib/Target/AMDGPU/SIISelLowering.cpp
5364

I would copy a pointer instead if creating a new undef. Ponter info, tbaa and range are probably cannot be just copied.

arsenm added inline comments.Jun 6 2018, 11:42 PM
lib/Target/AMDGPU/SIISelLowering.cpp
5364

Range needs to be adjusted or dropped, but tbaa and the rest of the pointer info should be fine. This isn't changing anything from an aliasing perspective and we don't actually use the high bits, so tbaa and the rest of the pointer info should be fine

arsenm updated this revision to Diff 150259.Jun 6 2018, 11:48 PM

Drop range metadata

rampitec accepted this revision.Jun 7 2018, 2:19 AM

LGTM. Thanks.

This revision is now accepted and ready to land.Jun 7 2018, 2:19 AM
arsenm closed this revision.Jun 7 2018, 2:59 AM

r334180