Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

ruiling (Ruiling, Song)
User

Projects

User does not belong to any projects.

User Details

User Since
Feb 21 2017, 5:58 PM (344 w, 3 d)

Recent Activity

Sun, Sep 24

ruiling committed rG4c59f809c926: Coroutines: Cleanup typed pointer code in CoroFrame.cpp. NFC (authored by ruiling).
Coroutines: Cleanup typed pointer code in CoroFrame.cpp. NFC
Sun, Sep 24, 8:31 PM · Restricted Project, Restricted Project

Thu, Sep 21

ruiling added a comment to D158602: [AMDGPU] Don't wait for counters in chain function prologues.

I don't quite understand the change, now that the counters are in unknown state, shouldn't we wait for their arrival so that we can access the registers for the arguments (which may be loaded from memory in the caller)? I think we definitely need these waits at the entry of cs_chain* function.

Thu, Sep 21, 8:41 AM · Restricted Project, Restricted Project

Aug 21 2023

ruiling accepted D158167: [AMDGPU] Handle isUndef flag in LiveVariables::recomputeForSingleDefVirtReg.

LGTM

Aug 21 2023, 6:53 PM · Restricted Project, Restricted Project

Jul 19 2023

ruiling added a comment to D155343: MachineSink: Fix sinking VGPR def out of a divergent loop.

I think a more accurate description of the real problem is: sinking vgpr def which has a loop-variant sgpr source out of divergent loop is wrong for AMDGPU. For other cases, it is still legal to move out of loop. The problem is not specific to nested loops, it also applies to single loop. The test case has been over-simplified and does not show original problem (both .ll and .mir test). We can keep the code as now, but I still think it's better to update the tests to show the real problem to make it easy to revisit the issue.

Jul 19 2023, 7:53 PM · Restricted Project, Restricted Project

Jul 18 2023

ruiling added inline comments to D155343: MachineSink: Fix sinking VGPR def out of a divergent loop.
Jul 18 2023, 9:00 AM · Restricted Project, Restricted Project
ruiling added inline comments to D155343: MachineSink: Fix sinking VGPR def out of a divergent loop.
Jul 18 2023, 7:58 AM · Restricted Project, Restricted Project

Jun 29 2023

ruiling accepted D153953: Revert "[AMDGPU] Mark mbcnt as convergent".
Jun 29 2023, 9:38 PM · Restricted Project, Restricted Project, Restricted Project

Jun 7 2023

ruiling accepted D152408: AMDGPU: Avoid endpgm in middle of block for fallback trap lowering..

I think we definitely should do this! just one inline comment.

Jun 7 2023, 6:37 PM · Restricted Project, Restricted Project

Jun 6 2023

ruiling added inline comments to D151995: [AMDGPU] Add llvm.amdgcn.cs.chain intrinsic to IR & verifier.
Jun 6 2023, 12:52 AM · Restricted Project, Restricted Project

Jun 4 2023

ruiling added a comment to D147408: [AMDGPU] Iterative scan implementation for atomic optimizer..

Thanks for the new version, looks good to me.

Jun 4 2023, 6:03 PM · Restricted Project, Restricted Project

May 29 2023

ruiling added inline comments to D147408: [AMDGPU] Iterative scan implementation for atomic optimizer..
May 29 2023, 7:21 PM · Restricted Project, Restricted Project

May 18 2023

ruiling added inline comments to D150861: [AMDGPU][Uniformity] SI_IF and SI_ELSE pseudos are always divergent.
May 18 2023, 6:55 PM · Restricted Project, Restricted Project

May 15 2023

ruiling added inline comments to D150438: [LLVM][Uniformity] Improve detection of uniform registers.
May 15 2023, 7:27 PM · Restricted Project, Restricted Project
ruiling added a comment to D150246: AMDGPU: Fix issue in shl(or) combine.

I think this is OK, but wouldn't it be simpler to transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2) and remove the knownbits check? Or do you think that would make the generated code worse overall?

May 15 2023, 4:58 AM · Restricted Project, Restricted Project

May 12 2023

ruiling committed rG60d9010aaf0f: AMDGPU: Fix issue in shl(or) combine (authored by ruiling).
AMDGPU: Fix issue in shl(or) combine
May 12 2023, 4:51 AM · Restricted Project, Restricted Project
ruiling closed D150246: AMDGPU: Fix issue in shl(or) combine.
May 12 2023, 4:51 AM · Restricted Project, Restricted Project
ruiling added a comment to D150246: AMDGPU: Fix issue in shl(or) combine.

Is there a negative test for the common bits case?

May 12 2023, 2:03 AM · Restricted Project, Restricted Project

May 11 2023

ruiling retitled D150246: AMDGPU: Fix issue in shl(or) combine from AMDGPU: remove an illegal transform for shl(or) to AMDGPU: Fix issue in shl(or) combine.
May 11 2023, 7:32 AM · Restricted Project, Restricted Project
ruiling updated the diff for D150246: AMDGPU: Fix issue in shl(or) combine.

fix the issue instead of removing the optimization.

May 11 2023, 7:30 AM · Restricted Project, Restricted Project

May 10 2023

ruiling added a comment to D150246: AMDGPU: Fix issue in shl(or) combine.

It should help because SelectionDAG::isBaseWithConstantOffset knows how to match OR (if the known bits do not overlap) as well as ADD.

Do you mean we can optimize for shl(or) with the help of knownbits? I think I agree with you. But I am not sure whether it is really helpful in practical cases. The isBaseWithConstantOffset you mentioned specifically designed to work on stack slot access. I am not sure if such patterns can also be observed more broadly. And I think such kind of optimization should be done separately, maybe it should be added in the common LLVM code. And the lit-test should also be redesigned. I would rather fix the problematic transformation first. sounds ok to you?

May 10 2023, 12:30 AM · Restricted Project, Restricted Project

May 9 2023

ruiling requested review of D150246: AMDGPU: Fix issue in shl(or) combine.
May 9 2023, 8:47 PM · Restricted Project, Restricted Project

May 5 2023

ruiling added a comment to D149651: [UnreachableBlockElim] Don't remove LCSSA phi nodes.

I guess you are observing code generation bug for AMDGPU? Is it replacing a phi like %1:vgpr = phi %0:sgpr with a %1:vgpr = COPY %0:sgpr? If that is the case, I think this does not sound like a root-fix.

May 5 2023, 12:30 AM · Restricted Project, Restricted Project

Apr 28 2023

ruiling added inline comments to D149348: RFD: Do not CSE convergent calls in different basic blocks.
Apr 28 2023, 7:08 AM · Restricted Project, Restricted Project

Apr 27 2023

ruiling added inline comments to D149348: RFD: Do not CSE convergent calls in different basic blocks.
Apr 27 2023, 7:12 PM · Restricted Project, Restricted Project

Apr 26 2023

ruiling accepted D149161: AMDGPU: Fix an assertion in SIOptimizeVGPRLiveRange.

LGTM

Apr 26 2023, 7:12 AM · Restricted Project, Restricted Project

Apr 25 2023

ruiling added a comment to D149161: AMDGPU: Fix an assertion in SIOptimizeVGPRLiveRange.

Why I cannot reproduce the assertion? I have tried on latest version and a version several days before.

Apr 25 2023, 8:24 PM · Restricted Project, Restricted Project

Apr 13 2023

ruiling added inline comments to D147408: [AMDGPU] Iterative scan implementation for atomic optimizer..
Apr 13 2023, 6:11 PM · Restricted Project, Restricted Project

Apr 6 2023

ruiling added a comment to D147683: AMDGPU: mark SET_INACTIVE_* as convergent operation.

Does tablegen not infer convergent instructions from the source pattern? I assume not (but I always found this to be a dubious feature)

Apr 6 2023, 8:46 PM · Restricted Project, Restricted Project
ruiling committed rGfc06a429aae1: AMDGPU: Add a test to show bad compilation to set_inactive (authored by ruiling).
AMDGPU: Add a test to show bad compilation to set_inactive
Apr 6 2023, 6:11 PM · Restricted Project, Restricted Project
ruiling committed rG2ab6835f285f: AMDGPU: mark SET_INACTIVE_* as convergent operation (authored by ruiling).
AMDGPU: mark SET_INACTIVE_* as convergent operation
Apr 6 2023, 6:11 PM · Restricted Project, Restricted Project
ruiling closed D147683: AMDGPU: mark SET_INACTIVE_* as convergent operation.
Apr 6 2023, 6:11 PM · Restricted Project, Restricted Project
ruiling closed D147682: AMDGPU: Add a test to show bad compilation to set_inactive.
Apr 6 2023, 6:11 PM · Restricted Project, Restricted Project

Apr 5 2023

ruiling added a comment to D147408: [AMDGPU] Iterative scan implementation for atomic optimizer..

If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.

  // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
  s_mov s0, exec
  s_mov s3, 0 // accumulator
// repeat this section 32 or 64 times:
  s_ff1 s1, s0 // find lowest remaining active lane
  s_cmp_eq s1, -1
  s_cbranch_scc1 end
  s_bitset0 s0, s1
  v_readlane s2, v0, s1
  s_add s3, s2
  v_writelane v1, s3, s1
// end of repeated section
end:
Apr 5 2023, 11:52 PM · Restricted Project, Restricted Project
ruiling requested review of D147683: AMDGPU: mark SET_INACTIVE_* as convergent operation.
Apr 5 2023, 11:45 PM · Restricted Project, Restricted Project
ruiling requested review of D147682: AMDGPU: Add a test to show bad compilation to set_inactive.
Apr 5 2023, 11:44 PM · Restricted Project, Restricted Project

Apr 3 2023

ruiling added inline comments to D147408: [AMDGPU] Iterative scan implementation for atomic optimizer..
Apr 3 2023, 11:35 PM · Restricted Project, Restricted Project

Mar 22 2023

ruiling added a comment to D146523: [AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copy.
VAL = ... // VGPR
RES = ... // FINAL result of scan, active lanes will write to this VGPR
sum = 0;                               // SGPR, holds the partial sum
for (int lane = 0; lane < 64; lane++) {
    if(IsActive(lane)) {                      // check to see whether lane is active or not 
        elementToadd = readlane(VAL, lane );  // SGPR, read value which we want to add from VAL at lane id
        sum = sum + elementToadd;            // SGPR, update the value
        writelane(RES, sum, lane )          // write value sum(SGPR) to VGPR RES at lane
    } 
}

The idea here is a dangerous way to program our GPU. Please check comment below to see why we should not do this.
A possible safe way is to do something like:

// all the active threads should enter the loop.
do {
 get_first_active_lane()
 bool hasUnprocessedlane = true;
 if (is_first_active_lane) {
   // only the first active lane will go here, other threads will skip to the loop end.
   do the work for this active lane
   hasUnprocessedLane = false;
 }
}  while (hasUnprocessedLane);

The hasUnprocessedLane was used to say that the first active lane being processed in this iteration should exit the loop.

Mar 22 2023, 5:37 PM · Restricted Project, Restricted Project

Mar 9 2023

ruiling added inline comments to D145323: AMDGPU: Fix LiveVariables verifier error for values defined before SI_END_CF.
Mar 9 2023, 7:52 PM · Restricted Project, Restricted Project
ruiling added inline comments to D145323: AMDGPU: Fix LiveVariables verifier error for values defined before SI_END_CF.
Mar 9 2023, 6:47 AM · Restricted Project, Restricted Project
ruiling accepted D145572: [llvm][Uniformity] consistently handle uniform instructions.

LGTM

Mar 9 2023, 3:41 AM · Restricted Project, Restricted Project
ruiling added a comment to D145572: [llvm][Uniformity] consistently handle uniform instructions.

I am not sure what your question is:

  1. Cycle is assumed to be divergent because it is irreducible. But operations that are always uniform need not be assumed to be divergent. That is this case.
  2. Cycle has divergent exit. Value that is always uniform may still be divergent at its used. That is separately handled by temporal divergence.

I am asking for the second. thanks for the explanation.

Mar 9 2023, 12:59 AM · Restricted Project, Restricted Project

Mar 8 2023

ruiling added a comment to D145572: [llvm][Uniformity] consistently handle uniform instructions.

LGTM, as this resolves our downstream assertions seen in our CI

Why would this fix an assertion? The description sounds like this is just an improvement. I think we need a reproducer for the assertion.

Please see the comment regarding pushUsers().

Mar 8 2023, 11:14 PM · Restricted Project, Restricted Project
ruiling added a comment to D145572: [llvm][Uniformity] consistently handle uniform instructions.

LGTM, as this resolves our downstream assertions seen in our CI

Mar 8 2023, 10:31 PM · Restricted Project, Restricted Project
ruiling added inline comments to D145323: AMDGPU: Fix LiveVariables verifier error for values defined before SI_END_CF.
Mar 8 2023, 8:28 PM · Restricted Project, Restricted Project

Mar 7 2023

ruiling added inline comments to D145323: AMDGPU: Fix LiveVariables verifier error for values defined before SI_END_CF.
Mar 7 2023, 6:14 AM · Restricted Project, Restricted Project
ruiling added inline comments to D145323: AMDGPU: Fix LiveVariables verifier error for values defined before SI_END_CF.
Mar 7 2023, 5:01 AM · Restricted Project, Restricted Project

Feb 27 2023

ruiling accepted D144699: [llvm][Uniformity] provide overloads for Instruction* and Value*.

Thank you for the fix! LGTM

Feb 27 2023, 6:07 PM · Restricted Project, Restricted Project

Feb 21 2023

ruiling added a comment to D144162: [AMDGPU] Replace LegacyDA with Uniformity Analysis in AnnotateUniformValues.

I agree with @gandhi21299 ... it's best to work through the passes one at a time. But maybe we should not enable each change as it happens. We could put the changes behind a command-line option to switch between DA and UA. One single option that switches whatever passes have been updated, until all passes are updated and we are ready to flip the switch permanently.

Feb 21 2023, 5:24 PM · Restricted Project, Restricted Project, Restricted Project

Feb 19 2023

ruiling accepted D144254: [llvm][Uniformity] a phi with an undef argument is not always divergent.

Although I wanted the optimization for the undef case to be bring back in a more defined way, I think it is acceptable to match with existing divergence analysis. This would help with the transition to uniform analysis, so LGTM. Please also address Jay's suggestion. Could you also add some FIXME around the code like: optimally reporting uniform with undef input should be done in more defined way. In general CFG, it might be broken?

Feb 19 2023, 10:27 PM · Restricted Project, Restricted Project
ruiling added a comment to D144162: [AMDGPU] Replace LegacyDA with Uniformity Analysis in AnnotateUniformValues.

I think it is reasonable to match divergent analysis behavior regarding to undef. The other problem that isUniform() return true for a divergent branch instruction makes me wonder: is it the best way to replace use of divergence analysis with uniform analysis one by one? Although I am optimistic about the quality of uniform analysis, I think it may be more helpful to replace all the occurrences of divergence analysis and fix all the bugs uncovered. Ideally, we would have very little test changes. The reason is that one specific pass may not have enough test coverage. Fixing the bugs after switching all the uses of divergence analysis to uniform analysis will make us more confident that we will less likely cause regression. Any different idea?

Feb 19 2023, 10:17 PM · Restricted Project, Restricted Project, Restricted Project

Feb 17 2023

ruiling added a comment to D144162: [AMDGPU] Replace LegacyDA with Uniformity Analysis in AnnotateUniformValues.

I think for this specific case, we should report %8 as uniform, and the branch should also be uniform. But there seems something wrong in the uniform analysis, if you try opt -passes='print<uniformity>' with the test here. We will see both the condition %8 and the conditional branch br i1 %8,... are reported as divergent. but the isUniform() query return true for the branch instruction. I think there should be something wrong in uniform analysis. If the condition is divergent, I think the branch should also be divergent. Another issue I want to point out that the uniform analysis has a subtle difference with divergence analysis. There is some comment in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp#L13 to explain the issue. I think we need to fill the gap to switch to uniform analysis, otherwise we will regress code generation. The last time we discussed this, I think we want some target specific option in uniform analysis to match this behavior with divergence analysis.

Feb 17 2023, 1:11 AM · Restricted Project, Restricted Project, Restricted Project

Feb 5 2023

ruiling committed rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable (authored by ruiling).
AMDGPU: Mark control flow intrinsics non-duplicable
Feb 5 2023, 11:35 PM · Restricted Project, Restricted Project
ruiling closed D118250: AMDGPU: Mark control flow intrinsics non-duplicable.
Feb 5 2023, 11:34 PM · Restricted Project, Restricted Project

Feb 2 2023

ruiling added inline comments to D143154: [AMDGPU] Introduce divergence and uniform bit fields in tablegen .
Feb 2 2023, 7:45 AM · Restricted Project, Restricted Project
ruiling added inline comments to D141355: [AMDGPUUnifyDivergentExitNodes] Add NewPM support.
Feb 2 2023, 7:01 AM · Restricted Project, Restricted Project

Feb 1 2023

ruiling added a comment to D118250: AMDGPU: Mark control flow intrinsics non-duplicable.

I revisit the problem again. I agree we are allowed to duplicate such intrinsic based on the fact that the intrinsic lower has already support their usage in unstructured CFG. but the case shown that duplicating such intrinsics might not be helpful for generating better code. By making them non-duplicable, we would get simplified usage of control flow intrinsics. Some testing on large set of graphics shaders, the change causes no code generation differences.

Feb 1 2023, 10:29 PM · Restricted Project, Restricted Project
ruiling updated the summary of D118250: AMDGPU: Mark control flow intrinsics non-duplicable.
Feb 1 2023, 10:25 PM · Restricted Project, Restricted Project
ruiling updated the diff for D118250: AMDGPU: Mark control flow intrinsics non-duplicable.

rebase with more comment

Feb 1 2023, 10:21 PM · Restricted Project, Restricted Project

Jan 10 2023

ruiling committed rG9119d9bfcef4: AMDGPU/SIInsertWait: Skip dummy tied source (authored by ruiling).
AMDGPU/SIInsertWait: Skip dummy tied source
Jan 10 2023, 6:00 PM · Restricted Project, Restricted Project
ruiling committed rGcce24b6af099: AMDGPU: Remove IsSourceOfDivergence check (authored by ruiling).
AMDGPU: Remove IsSourceOfDivergence check
Jan 10 2023, 6:00 PM · Restricted Project, Restricted Project
ruiling committed rG5d0ff923c3a7: AMDGPU: Promote array alloca if used by memmove/memcpy (authored by ruiling).
AMDGPU: Promote array alloca if used by memmove/memcpy
Jan 10 2023, 6:00 PM · Restricted Project, Restricted Project
ruiling closed D140537: AMDGPU/SIInsertWait: Skip dummy tied source.
Jan 10 2023, 6:00 PM · Restricted Project, Restricted Project
ruiling closed D141223: AMDGPU: Remove IsSourceOfDivergence check.
Jan 10 2023, 6:00 PM · Restricted Project, Restricted Project
ruiling closed D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.
Jan 10 2023, 6:00 PM · Restricted Project, Restricted Project
ruiling added inline comments to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.
Jan 10 2023, 6:39 AM · Restricted Project, Restricted Project
ruiling updated the diff for D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

address review comments.

Jan 10 2023, 6:33 AM · Restricted Project, Restricted Project

Jan 9 2023

ruiling updated the diff for D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

Use alignment from alloca.

Jan 9 2023, 10:30 PM · Restricted Project, Restricted Project
ruiling updated the diff for D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

use cast instead of dyn_cast.

Jan 9 2023, 4:48 PM · Restricted Project, Restricted Project
ruiling added a comment to D140537: AMDGPU/SIInsertWait: Skip dummy tied source.

I like this but it would be nice to have tests showing the effect on DS, FLAT and SCRATCH instructions.

Jan 9 2023, 4:37 PM · Restricted Project, Restricted Project

Jan 8 2023

ruiling added a comment to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

Thanks for the careful review @arsenm, I have fixed most of them. Please help take a second look. Thanks!

Jan 8 2023, 11:38 PM · Restricted Project, Restricted Project
ruiling updated the diff for D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

address review comments.

Jan 8 2023, 11:26 PM · Restricted Project, Restricted Project
ruiling added inline comments to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.
Jan 8 2023, 6:32 PM · Restricted Project, Restricted Project
ruiling updated the diff for D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

rebase and ping.

Jan 8 2023, 5:59 PM · Restricted Project, Restricted Project
ruiling added inline comments to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.
Jan 8 2023, 5:46 PM · Restricted Project, Restricted Project
ruiling retitled D140537: AMDGPU/SIInsertWait: Skip dummy tied source from SIInsertWait: Skip tied source of d16 buffer instruction to AMDGPU/SIInsertWait: Skip dummy tied source.
Jan 8 2023, 5:35 AM · Restricted Project, Restricted Project
ruiling abandoned D140536: AMDGPU: Mark global/scratch_load_d16 as D16Buf.
Jan 8 2023, 5:34 AM · Restricted Project, Restricted Project
ruiling updated the diff for D140537: AMDGPU/SIInsertWait: Skip dummy tied source.

Update the change to use a new bit in TSFlags

Jan 8 2023, 5:31 AM · Restricted Project, Restricted Project
ruiling requested review of D141223: AMDGPU: Remove IsSourceOfDivergence check.
Jan 8 2023, 5:28 AM · Restricted Project, Restricted Project

Jan 3 2023

ruiling accepted D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.

I think it is good to go. Thanks!

Jan 3 2023, 11:16 PM · Restricted Project, Restricted Project, Restricted Project
ruiling added inline comments to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.
Jan 3 2023, 10:07 PM · Restricted Project, Restricted Project, Restricted Project
ruiling added inline comments to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.
Jan 3 2023, 9:02 PM · Restricted Project, Restricted Project, Restricted Project
ruiling added a comment to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

ping

Jan 3 2023, 5:16 AM · Restricted Project, Restricted Project

Jan 2 2023

ruiling added a comment to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.
  • inserted a check for switch terminator, ...

I don't think just checking for switch terminator would work. My suggestion is don't handle switch here. The pass has never been designed to work with switch terminator, and it needs non-trivial work to support switch terminator in this pass. As we already lower switch terminator before this pass, I think it is not important to support switch terminator in this pass now.

Jan 2 2023, 10:23 PM · Restricted Project, Restricted Project, Restricted Project

Dec 25 2022

ruiling added a comment to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

Shouldn't these have been expanded into loads and stores already?

We only split mem transfer intrinsics using a loop if its length is > 1024.

I think InstCombine and/or SROA do some too. Also that threshold is arbitrary, and we could start expanding the small ones in IR too

Dec 25 2022, 9:49 PM · Restricted Project, Restricted Project

Dec 22 2022

ruiling added a comment to D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.

Shouldn't these have been expanded into loads and stores already?

Dec 22 2022, 6:25 PM · Restricted Project, Restricted Project
ruiling requested review of D140599: AMDGPU: Promote array alloca if used by memmove/memcpy.
Dec 22 2022, 6:16 PM · Restricted Project, Restricted Project
ruiling added inline comments to D140537: AMDGPU/SIInsertWait: Skip dummy tied source.
Dec 22 2022, 5:49 AM · Restricted Project, Restricted Project
ruiling added inline comments to D140537: AMDGPU/SIInsertWait: Skip dummy tied source.
Dec 22 2022, 4:24 AM · Restricted Project, Restricted Project
ruiling added inline comments to D140537: AMDGPU/SIInsertWait: Skip dummy tied source.
Dec 22 2022, 4:15 AM · Restricted Project, Restricted Project
ruiling requested review of D140537: AMDGPU/SIInsertWait: Skip dummy tied source.
Dec 22 2022, 2:35 AM · Restricted Project, Restricted Project
ruiling requested review of D140536: AMDGPU: Mark global/scratch_load_d16 as D16Buf.
Dec 22 2022, 2:35 AM · Restricted Project, Restricted Project

Dec 21 2022

ruiling added inline comments to D140069: [DAGCombiner] Scalarize vectorized loads that are splatted.
Dec 21 2022, 4:34 AM · Restricted Project, Restricted Project

Dec 19 2022

ruiling added a comment to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.

I haven't looked closely but some of these test changes look worse

Dec 19 2022, 7:23 AM · Restricted Project, Restricted Project, Restricted Project

Dec 14 2022

ruiling added a comment to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.
Dec 14 2022, 11:07 PM · Restricted Project, Restricted Project, Restricted Project

Dec 12 2022

ruiling added a comment to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.

@ruiling PDT.getRoot() yields BB4. So for PDT with exactly one root, we should only allow the transformation when its terminator is a branch instruction?

Yes.

This causes the following tests to fail due to insertion of dummy blocks:

Dec 12 2022, 11:12 PM · Restricted Project, Restricted Project, Restricted Project

Dec 11 2022

ruiling added a comment to D139780: [AMDGPU] Unify divergent nodes if the PostDom tree has one root.

Thanks for working on this, but I think it is more reasonable to fix the issue early. The function is showing a problem in our cfg lowering passes for AMDGPU. The cfg lowering are being done by several passes. I know it is very tricky and fragile now, so it is not easy to determine what the right fix should be. For this specific case, the input IR does not have a return block, thus bypass the structurizeCFG pass. The AMDGPUUnifyDivergentExits happened before StructurizeCFG already has some ability to handle infinite loops, but this case was skipped because of the check if (PDT.root_size() <= 1). In this case, BB3 is a root in PDT but it is not a true exit. I would suggest we change this. If there is only one root in PostDominatorTree, please check whether it ends with return/unreachable instruction before we can skip the transformation. If the single root in PDT ends with a branch instruction, we should continue the transformation to insert a dummy return block and insert a control flow edge that is never taken with trick like br i1 true, ..., DummyReturn. By doing this, we should be able to lower the cfg correctly.

Dec 11 2022, 11:11 PM · Restricted Project, Restricted Project, Restricted Project

Dec 4 2022

ruiling abandoned D138186: InstCombine: Simplify vector load based on demanded elements.

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

Dec 4 2022, 11:50 PM · Restricted Project, Restricted Project

Nov 22 2022

ruiling committed rG0eaf6759aee9: [AMDGPU][InsertWaits] No wait for WAW for global/scratch_load (authored by ruiling).
[AMDGPU][InsertWaits] No wait for WAW for global/scratch_load
Nov 22 2022, 5:58 PM · Restricted Project, Restricted Project
ruiling closed D138476: [AMDGPU][InsertWaits] No wait for WAW for global/scratch_load.
Nov 22 2022, 5:58 PM · Restricted Project, Restricted Project
ruiling added inline comments to D138476: [AMDGPU][InsertWaits] No wait for WAW for global/scratch_load.
Nov 22 2022, 4:03 AM · Restricted Project, Restricted Project