This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Make scalable vector FMA commutable for register allocation.
ClosedPublic

Authored by craig.topper on Feb 1 2021, 10:27 AM.

Details

Summary

This adds support for commuting operands and converting between
vfmadd and vfmacc to avoid register copies.

To avoid messing up intrinsic behavior, I've added new pseudo
instructions that have the isCommutable flag set. These pseudos also
force a tail agnostic policy. The intrinsic version still use
the tail undisturbed policy.

For best results it looks like we need to start with fmadd and only
pick fmacc if its beneficial. MachineCSE commutes without contraining
the operands and then commutes back if it didn't help with CSE. So
I've made sure that when the operand choice isn't constrained, we
will keep fmadd for MachineCSE and when it does the second commute,
we get back the original instruction.

Diff Detail

Event Timeline

craig.topper created this revision.Feb 1 2021, 10:27 AM
craig.topper requested review of this revision.Feb 1 2021, 10:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2021, 10:27 AM
Herald added a subscriber: MaskRay. · View Herald Transcript

There seem to be a lot of linter errors which might make working on this code difficult. At the least it makes the diff harder to read. I know the macro cases confuse clang-tidy but is there something we can do?

llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
945

minued -> minuend.

llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
1784

Is this a FIXME, as in, are we likely to change the policy on the intrinsics?

Also I'm still a bit unclear on what the tail policy has to do with commuting operands. What am I missing?

craig.topper added inline comments.Feb 2 2021, 10:03 AM
llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
1784

We're currently using tail undisturbed policy on any instruction with a sourced tied to a destination.

I believe the example I was shown where someone expected this to work was something like this.

float foo(float *src1, float *src2, size_t n) {       
  size_t len;                    
  len = vsetvlmax_e32m8();
  vfloat32m8_t v16 = vfmv_v_f_f32m8(0.0, len);
  len = vsetvl_e32m1();
  vfloat32m1_t v24 = vfmv_s_f_f32m1(vundefined_f32m1(), 0.0, len);
  for (; (len = vl_extract(vsetvl_e32m8(n))) > 0; n -= len) {
    vfloat32m8_t v0 = vle32_v_f32m8(src1, len);
    vfloat32m8_t v0 = vle32_v_f32m8(src2, len);
    v16 = vfmacc_vv_f32m1(v16, v0, v8, len);
    src1 += len;
    src2 += len;                                                                                                                                                                         
  }
  len = vsetvlmax_e32m8();                                                                                                                                                                                                                                                                                                       
  vfloat32m1_t result = vfredosum_vs_f32m8_f32m1(v16, v24, len);                                                                                                                                                                                                                                                                
  return vfmv_f_s_f32m1_f32(result);                                                                                                                                                                                                                                                                                       
}

On the last loop iteration, len might be less than vlmax and the code depends on the tail elements of v16 being preserved from the previous iterations. After the loop a reduction is done using vlmax that will access those elements.

If we commute fmacc to fmadd, then the register used for the v16 input will not be tied to the output register used for v16 for the fmacc. This would prevent the tail elements from being preserved.

I'm not sure we should be allowing this code to work, but tail agnostic is a valid implementation of tail undisturbed. So even if we picked tail agnostic this code might work on in order CPUs and then break in the future on an out of order CPU.

jrtc27 added a comment.EditedFeb 2 2021, 10:24 AM

If you wanted to appease the linter you could do something gross like:

#define CASE_VFMA_OPCODE_COMMON(OP, TYPE, LMUL) \
  RISCV::PseudoV##OP##_##TYPE##_##LMUL##_COMMUTABLE

#define CASE_VFMA_OPCODE_LMULS(OP, TYPE) \
  CASE_VFMA_OPCODE_COMMON(OP, TYPE, MF8): \
  case CASE_VFMA_OPCODE_COMMON(OP, TYPE, MF4): \
  case CASE_VFMA_OPCODE_COMMON(OP, TYPE, MF2): \
  case CASE_VFMA_OPCODE_COMMON(OP, TYPE, M1): \
  case CASE_VFMA_OPCODE_COMMON(OP, TYPE, M2): \
  case CASE_VFMA_OPCODE_COMMON(OP, TYPE, M4): \
  case CASE_VFMA_OPCODE_COMMON(OP, TYPE, M8)

#define CASE_VFMA_SPLATS(OP) \
  CASE_VFMA_OPCODE_LMULS(OP, VF16): \
  case CASE_VFMA_OPCODE_LMULS(OP, VF32): \
  case CASE_VFMA_OPCODE_LMULS(OP, VF64)

(i.e. the leading case and trailing : are omitted) that you then use as, say:

case CASE_VFMA_OPCODE_LMULS(FMADD, VV):
case CASE_VFMA_OPCODE_LMULS(FMSUB, VV):
case CASE_VFMA_OPCODE_LMULS(FNMADD, VV):
case CASE_VFMA_OPCODE_LMULS(FNMSUB, VV): {

which is hopefully enough to get it to recognise the macros as case statements. Maybe you can even keep the case keyword down in CASE_VFMA_OPCODE_COMMON and it's just the colon that's needed in order to make it look label-ish; if so that's rather nicer (and mirrors how statement macros tend to leave off the final semicolon).

Attempt to appease clang-format

jrtc27 added inline comments.Feb 2 2021, 11:37 AM
llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
899

Ouch; maybe tactful use of // clang-format [on|off] around these macro definitions would be better?

jrtc27 added inline comments.Feb 2 2021, 12:18 PM
llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
1105

I'd leave a blank line after this given there's one between the macro definitions and the function

Disable clang-format around macros

rogfer01 added inline comments.Feb 2 2021, 10:55 PM
llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
1784

I'm not sure we should be allowing this code to work, but tail agnostic is a valid implementation of tail undisturbed.

I think you mean tail undisturbed is a valid implementation of tail agnostic?

I don't think this code should work either, but 4 ops instructions like vfmacc implicitly tie the output to one input.

Whether we should be exposing this detail to intrinsics is not clear to me (I wouldn't) and this code should be using some alternative intrinsic (like the ones sketched https://github.com/riscv/rvv-intrinsic-doc/issues/27#issuecomment-649433549 here).

Most vector codes can work well with just tail agnostic mode. So far only accumulations (or reductions) like the one you show seem to be the ones benefiting from tail undisturbed (as the accumulation happens on the full register due to the varying vl during the execution of the loop). Reading fma as "float multiply accumulate" doesn't help to make the distinction clearer.

craig.topper added inline comments.Feb 2 2021, 11:25 PM
llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
1784

I think you mean tail undisturbed is a valid implementation of tail agnostic?

Yes. Thanks

frasercrmck added inline comments.Feb 3 2021, 4:57 AM
llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
1784

Great example, thanks. I think the NOTE is clearer than the FIXME, too.

llvm/lib/Target/RISCV/RISCVInstrInfoVSDPatterns.td
556

You could perhaps put fvti.ScalarSuffix # "_" # fvti.LMul.MX # "_COMMUTABLE" into a defvar to reduce the line length.

Use defvar to shorten line length

Fix typo minued->minuend

This revision is now accepted and ready to land.Feb 8 2021, 1:40 AM
This revision was landed with ongoing or failed builds.Feb 8 2021, 10:06 AM
This revision was automatically updated to reflect the committed changes.