This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Supported ds_read_b128 generation; Widened vector length for local address-space
ClosedPublic

Authored by FarhanaAleen on Mar 7 2018, 8:06 AM.

Details

Summary

Starting from GCN 2nd generation, ISA supports ds_read_b128 on top of ds_read_b64. This patch supports ds_read_b128 instruction pattern and generation of this instruction.

In the vectorizer, this patch also widen the vector length so that vectorizer generates 128 bit loads for local address-space which gets translated to ds_read_b128.

Diff Detail

Repository
rL LLVM

Event Timeline

FarhanaAleen created this revision.Mar 7 2018, 8:06 AM

Have you tested this on real hardware? I remember reading that there is a hardware bug on gfx7 with this instruction. The bug may apply only to early gfx7 chips.

test/CodeGen/AMDGPU/reorder-stores.ll
2 ↗(On Diff #137397)

What does SEA mean? We usually use CI for Sea Islands.

I've implemented this before: https://github.com/arsenm/llvm/tree/ds-128

This looks mostly the same. It's not clear to me it's always better to use this. I don't think this executes any faster, and at least for ds_write_b128, this has an additional constraint that the inputs must now be in a contiguous 128-bit register instead of 2 independent 64-bit pairs, which increases register pressure and may require copies. It might be better to defer forming this until later, like in the LoadStoreOptimizer pass. Jeff had a benchmark he wanted to try with this.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
246–248 ↗(On Diff #137397)

It might be OK to say 128 anyway. You could still do adjacent ds_read2_b64 even when not using ds_read_b128. I don't think we try to do the same trick we do with 4-byte aligned 8 byte reads for the 64-bit equivalent, but you might want to look into that.

Anything you change here would also equally apply to REGION_ADDRESS

lib/Target/AMDGPU/SIISelLowering.cpp
5427 ↗(On Diff #137397)

This should be hidden inside a Subtarget->hasDS128() check

5428 ↗(On Diff #137397)

You don't need the isAligned16 helper. You just need to check that the alignment is >= 16, not % 16

test/CodeGen/AMDGPU/ds_read2_superreg.ll
100–135 ↗(On Diff #137397)

These tests have the unfortunate side effect of breaking what this test intended, which was the pass forming the read2. Maybe change all of these to reduce the alignment so you still get read2?

Given performance benefit is somewhat unclear can you put it under an option?

lib/Target/AMDGPU/SIISelLowering.cpp
5428 ↗(On Diff #137397)

Second to that.

Enabled ds_read_b128 under a switch and incorporated additional comments.

rampitec added inline comments.Mar 8 2018, 4:21 PM
lib/Target/AMDGPU/SIISelLowering.cpp
5434 ↗(On Diff #137670)

You only have pattern for v4i32, but enable operation for all 128 bit. Will it work with v8i16 for example?

FarhanaAleen added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
5434 ↗(On Diff #137670)

Yes, it works for i16/i8.

During dag combine, AMDGPU loadCombiner combines vector types of 8/16/64 to vector types of 32 bit type.

rampitec accepted this revision.Mar 9 2018, 9:28 AM

LGTM. Thanks.

This revision is now accepted and ready to land.Mar 9 2018, 9:28 AM
This revision was automatically updated to reflect the committed changes.