Page MenuHomePhabricator

rampitec (Stanislav Mekhanoshin)
User

Projects

User does not belong to any projects.

User Details

User Since
Apr 4 2014, 4:14 AM (245 w, 4 d)

Recent Activity

Thu, Dec 13

rampitec accepted D55539: [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .

LGTM

Thu, Dec 13, 11:13 PM
rampitec added inline comments to D55539: [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .
Thu, Dec 13, 10:47 PM
rampitec added inline comments to D55539: [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .
Thu, Dec 13, 3:53 PM
rampitec accepted D55652: [RegAllocGreedy] IMPLICIT_DEF values shouldn't prefer registers.

LGTM

Thu, Dec 13, 1:26 PM
rampitec added a comment to D55652: [RegAllocGreedy] IMPLICIT_DEF values shouldn't prefer registers.

It looks good to me, but @arsenm could you also take a look?

Thu, Dec 13, 1:00 PM
rampitec added a reviewer for D55652: [RegAllocGreedy] IMPLICIT_DEF values shouldn't prefer registers: arsenm.
Thu, Dec 13, 12:59 PM
rampitec accepted D55634: AMDGPU/GlobalISel: Legalize/regbankselect fneg/fabs/fsub.

LGTM

Thu, Dec 13, 12:22 PM
rampitec accepted D55645: AMDGPU/GlobalISel: Handle legality/regbanks for 32/64-bit shifts.

LGTM

Thu, Dec 13, 10:56 AM

Wed, Dec 12

rampitec committed rL349006: [AMDGPU] Fix build failure, second attempt.
[AMDGPU] Fix build failure, second attempt
Wed, Dec 12, 9:55 PM
rampitec committed rL349005: [AMDGPU] Fix build failure.
[AMDGPU] Fix build failure
Wed, Dec 12, 9:24 PM
rampitec committed rL349003: [AMDGPU] Simplify negated condition.
[AMDGPU] Simplify negated condition
Wed, Dec 12, 7:22 PM
rampitec closed D55402: [AMDGPU] Simplify negated condition.
Wed, Dec 12, 7:22 PM
rampitec accepted D55637: AMDGPU: Legalize/regbankselect frame_index.

LGTM

Wed, Dec 12, 6:51 PM
rampitec accepted D55636: AMDGPU/GlobalISel: Legalize/regbankselect block_addr.

LGTM

Wed, Dec 12, 6:51 PM
rampitec accepted D55635: AMDGPU: Legalize/regbankselect fma.

LGTM

Wed, Dec 12, 6:50 PM
rampitec accepted D55632: AMDGPU/GlobalISel: RegBankSelect some simple operations.

LGTM

Wed, Dec 12, 6:50 PM
rampitec accepted D55633: AMDGPU/GlobalISel: Legalize f64 fadd/fmul.

LGTM

Wed, Dec 12, 6:49 PM
rampitec added inline comments to D55634: AMDGPU/GlobalISel: Legalize/regbankselect fneg/fabs/fsub.
Wed, Dec 12, 6:49 PM
rampitec added inline comments to D55539: [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .
Wed, Dec 12, 4:30 PM

Tue, Dec 11

rampitec added inline comments to D55570: [AMDGPU] Improve SDWA generation for V_OR_B32_E32..
Tue, Dec 11, 9:38 PM
rampitec added a comment to D55539: [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .

Can we have a mir test with more than two loads? I want to see a situation where 3 loads are foldable with the same offset, but lowest address is in the middle. I.e.:

Tue, Dec 11, 2:11 PM
rampitec added inline comments to D55402: [AMDGPU] Simplify negated condition.
Tue, Dec 11, 1:19 PM
rampitec added inline comments to D55570: [AMDGPU] Improve SDWA generation for V_OR_B32_E32..
Tue, Dec 11, 1:10 PM
rampitec added a comment to D55539: [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

Tue, Dec 11, 12:20 PM

Sun, Dec 9

rampitec accepted D55435: [AMDGPU] Fix discarded result of addAttribute.

LGTM

Sun, Dec 9, 9:44 AM

Fri, Dec 7

rampitec accepted D55453: AMDGPU: Fix offsets for < 4-byte aggregate kernel arguments.

LGTM

Fri, Dec 7, 2:02 PM
rampitec updated the diff for D55402: [AMDGPU] Simplify negated condition.

Rebased.

Fri, Dec 7, 1:43 PM
rampitec updated the diff for D55402: [AMDGPU] Simplify negated condition.

Switched to MDT use.
Added tests with cross block defines.

Fri, Dec 7, 1:28 PM
rampitec added inline comments to D55402: [AMDGPU] Simplify negated condition.
Fri, Dec 7, 1:28 PM
rampitec updated the diff for D55402: [AMDGPU] Simplify negated condition.

Reduced test.

Fri, Dec 7, 12:33 PM
rampitec added inline comments to D55402: [AMDGPU] Simplify negated condition.
Fri, Dec 7, 9:18 AM

Thu, Dec 6

rampitec added inline comments to D55402: [AMDGPU] Simplify negated condition.
Thu, Dec 6, 10:04 PM
rampitec updated the diff for D55402: [AMDGPU] Simplify negated condition.
Thu, Dec 6, 10:04 PM
rampitec created D55402: [AMDGPU] Simplify negated condition.
Thu, Dec 6, 5:31 PM
rampitec added inline comments to D55301: RegAlloc: Allow targets to split register allocation.
Thu, Dec 6, 5:27 PM

Tue, Dec 4

rampitec accepted D55285: AMDGPU: Scavenge register instead of findUnusedReg.

LGTM

Tue, Dec 4, 10:08 AM

Sun, Dec 2

rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Sun, Dec 2, 6:18 PM
rampitec accepted D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

LGTM

Sun, Dec 2, 6:01 PM
rampitec accepted D55178: AMDGPU: Add f32 vectors to SGPR register classes.

LGTM

Sun, Dec 2, 2:57 PM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Sun, Dec 2, 12:02 PM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Sun, Dec 2, 9:34 AM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Sun, Dec 2, 12:20 AM

Fri, Nov 30

rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Fri, Nov 30, 5:44 PM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Fri, Nov 30, 2:41 PM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Fri, Nov 30, 8:46 AM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Fri, Nov 30, 8:43 AM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Fri, Nov 30, 8:14 AM
rampitec accepted D54340: AMDGPU: Fix various issues around the VirtReg2Value mapping.

LGTM

Fri, Nov 30, 7:29 AM

Thu, Nov 29

rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Thu, Nov 29, 11:47 AM
rampitec added inline comments to D54340: AMDGPU: Fix various issues around the VirtReg2Value mapping.
Thu, Nov 29, 7:16 AM
rampitec accepted D51994: TableGen/ISel: Allow PatFrag predicate code to access captured operands.

As far as I understand if PredicateCodeUsesOperands is not used nothing changes.
LGTM.

Thu, Nov 29, 7:06 AM
rampitec abandoned D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.

The error is that condition is actually has to be negated. However, if I just negate it it goes into an endless loop inside the combiner. I need to fix our BRCOND lowering instead.

Thu, Nov 29, 6:32 AM

Wed, Nov 28

rampitec planned changes to D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.

It does not validate in some tests. I will investigate it as I do not see right away what is wrong.

Wed, Nov 28, 7:48 PM
rampitec added inline comments to D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.
Wed, Nov 28, 5:08 PM
rampitec updated the diff for D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.

Added sext handling.

Wed, Nov 28, 3:31 PM
rampitec added a comment to D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.

Should this be a generic combine?

Wed, Nov 28, 3:04 PM
rampitec created D55030: [AMDGPU] Fold brcond (setcc zext(i1 x), 1, ne) -> brcond x.
Wed, Nov 28, 2:34 PM

Tue, Nov 27

rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Tue, Nov 27, 2:50 PM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }
 
  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);
 
    continue;
  }
}

OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;

Tue, Nov 27, 2:42 PM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }
 
  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);
 
    continue;
  }
}
Tue, Nov 27, 9:38 AM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Tue, Nov 27, 7:38 AM
rampitec committed rL347659: [AMDGPU] Disable DAG combine at -O0.
[AMDGPU] Disable DAG combine at -O0
Tue, Nov 27, 7:16 AM
rampitec closed D54358: [AMDGPU] Disable DAG combine at -O0.
Tue, Nov 27, 7:16 AM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.

addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());

which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).

Tue, Nov 27, 6:52 AM
rampitec added a comment to D54358: [AMDGPU] Disable DAG combine at -O0.

Ping.

Tue, Nov 27, 6:33 AM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

Tue, Nov 27, 6:27 AM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Tue, Nov 27, 6:20 AM

Mon, Nov 26

rampitec accepted D54761: [SelectionDAG] Improve SimplifyDemandedBits to SimplifyDemandedVectorElts simplification.

LGTM (for AMDGPU tests and core).

Mon, Nov 26, 12:00 PM
rampitec accepted D54910: AMDGPU: Record SGPR spills when restoring too.

LGTM

Mon, Nov 26, 11:49 AM
rampitec added a reviewer for D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos: vpykhtin.
Mon, Nov 26, 9:39 AM
rampitec added a comment to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.

Essentially this is a limited version of shrinking. So I have several questions:

Mon, Nov 26, 9:28 AM
rampitec added inline comments to D54882: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos.
Mon, Nov 26, 8:49 AM

Tue, Nov 20

rampitec accepted D54775: AMDGPU: Don't optimize exec masks at -O0.

LGTM

Tue, Nov 20, 5:07 PM

Mon, Nov 19

rampitec committed rL347274: Implement computeKnownBits for scalar_to_vector.
Implement computeKnownBits for scalar_to_vector
Mon, Nov 19, 3:37 PM
rampitec closed D54728: Implement computeKnownBits for scalar_to_vector.
Mon, Nov 19, 3:37 PM
rampitec created D54728: Implement computeKnownBits for scalar_to_vector.
Mon, Nov 19, 2:44 PM
rampitec committed rL347259: [AMDGPU] Restored selection of scalar_to_vector (v2x16).
[AMDGPU] Restored selection of scalar_to_vector (v2x16)
Mon, Nov 19, 12:01 PM
rampitec closed D54718: Restored selection of scalar_to_vector (v2x16).
Mon, Nov 19, 12:01 PM
rampitec added inline comments to D54358: [AMDGPU] Disable DAG combine at -O0.
Mon, Nov 19, 11:58 AM
rampitec created D54718: Restored selection of scalar_to_vector (v2x16).
Mon, Nov 19, 11:58 AM
rampitec committed rL347231: [AMDGPU] Convert insert_vector_elt into set of selects.
[AMDGPU] Convert insert_vector_elt into set of selects
Mon, Nov 19, 9:43 AM
rampitec closed D54606: [AMDGPU] Convert insert_vector_elt into set of selects.
Mon, Nov 19, 9:43 AM

Nov 16 2018

rampitec committed rL347115: Moved dag-combine-select-undef.ll into amdgpu. NFC..
Moved dag-combine-select-undef.ll into amdgpu. NFC.
Nov 16 2018, 4:20 PM
rampitec updated the diff for D54606: [AMDGPU] Convert insert_vector_elt into set of selects.

Rebased.

Nov 16 2018, 3:58 PM
rampitec committed rL347112: Fixed test after r347110.
Fixed test after r347110
Nov 16 2018, 3:42 PM
rampitec committed rL347110: DAG combiner: fold (select, C, X, undef) -> X.
DAG combiner: fold (select, C, X, undef) -> X
Nov 16 2018, 3:17 PM
rampitec closed D54646: DAG combiner: fold (select, C, X, undef) -> X.
Nov 16 2018, 3:17 PM
rampitec updated the diff for D54646: DAG combiner: fold (select, C, X, undef) -> X.

Added negative checks to select-undef.ll and rebased.

Nov 16 2018, 2:38 PM
rampitec added a parent revision for D54606: [AMDGPU] Convert insert_vector_elt into set of selects: D54646: DAG combiner: fold (select, C, X, undef) -> X.
Nov 16 2018, 2:23 PM
rampitec added a child revision for D54646: DAG combiner: fold (select, C, X, undef) -> X: D54606: [AMDGPU] Convert insert_vector_elt into set of selects.
Nov 16 2018, 2:23 PM
rampitec updated the diff for D54606: [AMDGPU] Convert insert_vector_elt into set of selects.

Converted tests to non-undef vectors, added a test with an undef vector.
This is now based on D54646 and insert_vector_elt undef, ... results just in a splat.

Nov 16 2018, 2:23 PM
rampitec updated the diff for D54646: DAG combiner: fold (select, C, X, undef) -> X.

Added test from D48376.

Nov 16 2018, 2:03 PM
rampitec added a comment to D54646: DAG combiner: fold (select, C, X, undef) -> X.

This pretty much reproduces D48376, but you should copy the test I added

The test you have added there does not work anymore.
What's the new syntax for @llvm.amdgcn.image.sample.f32.v2f32.v8i32? Is it relevant to use image sample in this test at all?

I think it's because it avoids the fact that regular loads are converted to integer loads, and it was to ensure it was an FP type. You should be able to just copy any other FP returning intrinsic use from another test

Matt, I do not understand your test. I believe it does not test what's intended. For some reason your checks are:

; GCN: s_waitcnt
; GCN-NEXT: s_setpc_b64

I.e. you seem to check there is no v_cndmask_b32, though it is never checked. But it looks like you believe that an intrinsic call with undef operands will be combined into undef. This is not the case most of the times and not even a case for image sample anymore. This is almost the test for image_sample(undef) folding, not select(undef).

I understand the point of the test, just do not believe it is correct. I see two possible ways: 1) commit this as is and extend tests from D54606 to check both undef and non-undef cases 2) Commit D54606 first and then add select-undef.ll tests to this review which would use insertelement.

Nov 16 2018, 2:01 PM
rampitec added a comment to D54646: DAG combiner: fold (select, C, X, undef) -> X.

This pretty much reproduces D48376, but you should copy the test I added

The test you have added there does not work anymore.
What's the new syntax for @llvm.amdgcn.image.sample.f32.v2f32.v8i32? Is it relevant to use image sample in this test at all?

I think it's because it avoids the fact that regular loads are converted to integer loads, and it was to ensure it was an FP type. You should be able to just copy any other FP returning intrinsic use from another test

Nov 16 2018, 1:50 PM
rampitec added a comment to D54646: DAG combiner: fold (select, C, X, undef) -> X.

This pretty much reproduces D48376, but you should copy the test I added

Nov 16 2018, 1:27 PM
rampitec created D54646: DAG combiner: fold (select, C, X, undef) -> X.
Nov 16 2018, 12:54 PM
rampitec added a comment to D54606: [AMDGPU] Convert insert_vector_elt into set of selects.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.

Hm... Ok, I will add it here.

Nov 16 2018, 10:52 AM
rampitec added a comment to D54606: [AMDGPU] Convert insert_vector_elt into set of selects.

Another change I've wanted to look at is changing what AMDGPUPromoteAlloca tries to produce. The dynamic vector indexing is going to be worse if the waterfall loop is going to be required, but it's currently preferred if both are possible.

Nov 16 2018, 10:43 AM
rampitec added a comment to D54606: [AMDGPU] Convert insert_vector_elt into set of selects.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.

Nov 16 2018, 10:16 AM
rampitec added a comment to D54606: [AMDGPU] Convert insert_vector_elt into set of selects.

Mostly looks good to me.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Nov 16 2018, 9:37 AM