Page MenuHomePhabricator

AMDGPU: Add replacement export intrinsics
ClosedPublic

Authored by arsenm on Dec 12 2016, 11:39 AM.

Details

Summary

Replace llvm.SI.export with llvm.amdgcn.exp and llvm.amdgcn.exp.compr.. The compr bit is separate because it changes the number and type of the operands.

This is a WIP patch because I'm not sure whether the en mask should be an explicit operand, or whether it should be computed from which inputs are undef. Right now dead code is emitted for disabled inputs.

Diff Detail

Event Timeline

arsenm updated this revision to Diff 81117.Dec 12 2016, 11:39 AM
arsenm retitled this revision from to AMDGPU: Add replacement export intrinsics.
arsenm updated this object.
arsenm added a subscriber: llvm-commits.
mareko added a subscriber: mareko.Dec 13 2016, 6:55 AM
mareko added inline comments.
include/llvm/IR/IntrinsicsAMDGPU.td
458

// src3

470

v2f16 isn't the best choice here. The compressed export can be used with one of these types:

  • v2f16
  • v2i16
  • v2u16

Using i32 would be better, because the last 2 are packed as i32 anyway. It really depends on the output type of packing instructions. This is the complete list of instructions we should be using for compressed exports:

  • v_cvt_pkrtz_f16_f32
  • v_cvt_pknorm_u16_f32
  • v_cvt_pknorm_i16_f32
  • v_cvt_pk_u16_u32
  • v_cvt_pk_i16_i32
473

While IntrInaccessibleMemOnly makes sense for EXP in theory, in practice we might need something more limiting, because the first executed EXP instruction limits parallelism and therefore reduces the ability to hide latencies (the first EXP triggers EXP_ALLOC and if there is not enough EXP memory, the wave has to wait), so we don't want to move the first EXP across any load or store that's above it.

Depending on the chip and other parameters, EXP_ALLOC is sometimes done at wave launch, in which case the EXP scheduling doesn't matter. These are the only cases where EXP_ALLOC is done at wave launch:

  • SI: all vertex shaders (not configurable)
  • CIK-VI: all vertex shaders if the number of good CUs is <= 4 (e.g. Kabini, Mullins, Stoney, some Kaveri chips, Carrizo B4), it's configurable via a context register.
arsenm updated this revision to Diff 83775.Jan 9 2017, 9:28 PM
arsenm edited edge metadata.
arsenm marked 3 inline comments as done.

Allow mangling to v2i16 or v2f16

LGTM if I can just bitcast from i32 to v2i16.

arsenm updated this revision to Diff 84011.Jan 11 2017, 1:06 PM
arsenm edited edge metadata.

Also mangle for llvm.amdgcn.exp

arsenm accepted this revision.Jan 16 2017, 11:47 PM

r292205

This revision is now accepted and ready to land.Jan 16 2017, 11:47 PM
arsenm closed this revision.Jan 16 2017, 11:47 PM