This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Use tail agnostic policy for vsetvli instruction emitted in the custom inserter
ClosedPublic

Authored by craig.topper on Dec 10 2020, 5:23 PM.

Details

Summary

The compiler is making no effort to preserve upper elements. To do so would require another source operand tied with the destination and a different intrinsic interface to give control of this source to the programmer.

This patch changes the tail policy to agnostic so that the CPU doesn't need to make an effort to preserve them.

Diff Detail

Event Timeline

craig.topper created this revision.Dec 10 2020, 5:23 PM
craig.topper requested review of this revision.Dec 10 2020, 5:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2020, 5:23 PM
Herald added a subscriber: MaskRay. · View Herald Transcript
This revision is now accepted and ready to land.Dec 10 2020, 5:27 PM
khchen added a subscriber: khchen.Dec 10 2020, 5:57 PM

Hi @craig.topper
I think maybe default tail undisturbed would be more friendly and intuitive for programmer or vectorizer in reduction case.
please see below example:

//scalar
float sum=0;
for(int i=0;i<n;++i) {
  sum += src1[i]*src2[i];
}
return sum;
float foo(float *src1, float *src2, size_t n) {
  size_t len;
  vsetvlmax_e32m8();
  vfloat32m8_t v16 = vfmv_v_f_f32m8(0.0);
  vsetvl_e32m1();
  vfloat32m1_t v24 = vfmv_s_f_f32m1(vundefined_f32m1(), 0.0);
  for (; (len = vl_extract(vsetvl_e32m8(n))) > 0; n -= len) {
    vfloat32m8_t v0 = vle32_v_f32m8(src1);
    vfloat32m8_t v8 = vle32_v_f32m8(src2);
#if 0
    if maxvl = 2, n = 3;
    src1 = [1, 2, 3]
    src2 = [2, 3, 4]
    1st iteration, vl=2, input v16 = [0, 0], result v16 = [2, 6]
    2nd iteration, vl=1, input v16 = [2, 6], result v16 = [14, 6] // tail is still 6 because tail undisturbed. 
#endif
    v16 = vfmacc_vv_f32m1(v16, v0, v8); 
    src1 += len;
    src2 += len;
  }
  vsetvlmax_e32m8();
  // input v16 = [14, 6], result = [20, ?]
  vfloat32m1_t result = vfredosum_vs_f32m8_f32m1(v16, v24);
  return vfmv_f_s_f32m1_f32(result);
}

Community also discussed the difference in issue before.

Hi @craig.topper
I think maybe default tail undisturbed would be more friendly and intuitive for programmer or vectorizer in reduction case.
please see below example:

//scalar
float sum=0;
for(int i=0;i<n;++i) {
  sum += src1[i]*src2[i];
}
return sum;
float foo(float *src1, float *src2, size_t n) {
  size_t len;
  vsetvlmax_e32m8();
  vfloat32m8_t v16 = vfmv_v_f_f32m8(0.0);
  vsetvl_e32m1();
  vfloat32m1_t v24 = vfmv_s_f_f32m1(vundefined_f32m1(), 0.0);
  for (; (len = vl_extract(vsetvl_e32m8(n))) > 0; n -= len) {
    vfloat32m8_t v0 = vle32_v_f32m8(src1);
    vfloat32m8_t v8 = vle32_v_f32m8(src2);
#if 0
    if maxvl = 2, n = 3;
    src1 = [1, 2, 3]
    src2 = [2, 3, 4]
    1st iteration, vl=2, input v16 = [0, 0], result v16 = [2, 6]
    2nd iteration, vl=1, input v16 = [2, 6], result v16 = [14, 6] // tail is still 6 because tail undisturbed. 
#endif
    v16 = vfmacc_vv_f32m1(v16, v0, v8); 
    src1 += len;
    src2 += len;
  }
  vsetvlmax_e32m8();
  // input v16 = [14, 6], result = [20, ?]
  vfloat32m1_t result = vfredosum_vs_f32m8_f32m1(v16, v24);
  return vfmv_f_s_f32m1_f32(result);
}

Community also discussed the difference in issue before.

Maybe we should use tail undisturbed for instructions that have something like "let Constraints = "$rd = $rs3"?

Maybe we should use tail undisturbed for instructions that have something like "let Constraints = "$rd = $rs3"?

Yes. It remind me that we had discussed realted issue here before.

If intrinsic start to model tail behavior, when user giving a non vundefined() value in maskedoff argument, the tail behavior should be tail undisturbed.