Page MenuHomePhabricator

hliao (Michael Liao)
User

Projects

User does not belong to any projects.

User Details

User Since
Aug 7 2014, 12:01 PM (348 w, 3 d)

Recent Activity

Fri, Apr 9

hliao added inline comments to D99635: [SelectionDAG] Add extra check on asm operand legalization..
Fri, Apr 9, 5:24 PM · Restricted Project
hliao added a comment to D99635: [SelectionDAG] Add extra check on asm operand legalization..

Missing test

Fri, Apr 9, 6:50 AM · Restricted Project

Mon, Apr 5

hliao abandoned D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Apr 5, 8:06 AM · Restricted Project
hliao resigned from D96517: [AMDGPU] Optimize SGPR to scratch spilling.
Mon, Apr 5, 8:02 AM · Restricted Project
hliao added a comment to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

  1. Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly.

Let me rephrase to make sure I understood you correctly. You're saying that spilling an SGPR to a fixed lane of a VGPR may cause data of an inactive lane to be overwritten. This is a problem if the spill/reload happens in a called function, because VGPR save/reload doesn't save those inactive lanes. (HPC is irrelevant here.)

Mon, Apr 5, 7:06 AM · Restricted Project

Sat, Apr 3

hliao added a comment to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

  1. Load from a fixed lane of a VGPR using V_READLANE.
Sat, Apr 3, 8:16 AM · Restricted Project

Wed, Mar 31

hliao added inline comments to D99635: [SelectionDAG] Add extra check on asm operand legalization..
Wed, Mar 31, 6:03 PM · Restricted Project

Tue, Mar 30

hliao abandoned D92394: [amdgpu] Teach one more case for assumed global pointers..

the case is no longer valid considering concurrent kernel execution.

Tue, Mar 30, 9:05 PM · Restricted Project
hliao edited reviewers for D99635: [SelectionDAG] Add extra check on asm operand legalization., added: bogner; removed: JustinBorb.
Tue, Mar 30, 8:24 PM · Restricted Project
hliao requested review of D99635: [SelectionDAG] Add extra check on asm operand legalization..
Tue, Mar 30, 8:23 PM · Restricted Project
hliao added a comment to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..

It seems to me that we may need to revise CFG lowering to avoid updating EXEC directly and later revise it based on whether the restoring mask needs reloading or not. Here's the brief thought in my mind:

  • Instead of lowering CFG early before RA, lower it after RA. As a byproduct, it also remove the need of "terminator" version of exec mask manipulation instructions.
  • When CFG is being lowered, it could update EXEC eagerly if the merge point doesn't need to reload the mask; Otherwise, it just needs to translate as what we currently did.
Tue, Mar 30, 6:11 AM · Restricted Project

Mon, Mar 29

hliao added a comment to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

Mon, Mar 29, 12:57 PM · Restricted Project
hliao added inline comments to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Mar 29, 10:30 AM · Restricted Project
hliao added inline comments to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Mar 29, 10:03 AM · Restricted Project
hliao added inline comments to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Mar 29, 9:59 AM · Restricted Project
hliao added inline comments to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Mar 29, 9:56 AM · Restricted Project
hliao added a reviewer for D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask.: dstuttard.
Mon, Mar 29, 8:12 AM · Restricted Project
hliao added inline comments to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Mar 29, 7:54 AM · Restricted Project
hliao added a comment to D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..

This's the companion fix for D96980. That explain the myth why the original agnostic SGPR spill/reload is proposed to solve the issue SGRP spill/reload may be executed when exec mask goes to zero.
We did try to skip executing code when exec mask goes to zero by branch on EXECZ (to the target block) or EXECNZ (to the fallthrough block.) We may run instructions with zero exec mask. But, that's usually not an issue as we immediately restore the exec mask on the targeted block. Code following that exec mask restoration won't be executed in 0 mask. However, if that mask restoration needs to reload a spilled exec mask, we will run the SGPR reload with 0 mask, where v_readfristlane has undefined behavior when exec mask is zero.
This patch tries to mitigate that case by not evaluating exec mask that early or clearing the exec mask when the branch target has mask restoration following SGRP reload. Instead of checking EXECZ or EXECNZ, the exec mask evaluation is duplicated with a temporary SGRP as the destination (without updating exec mask directly), checking SCC0 is equivalent to EXECZ. Exec mask is only evaluated when the result won't be zero. For instance,

Mon, Mar 29, 7:52 AM · Restricted Project
hliao requested review of D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..
Mon, Mar 29, 7:38 AM · Restricted Project

Feb 25 2021

hliao added inline comments to D97318: [clang][CodeGen] Allow fp16 arg pass by register.
Feb 25 2021, 8:28 AM
hliao added inline comments to D97318: [clang][CodeGen] Allow fp16 arg pass by register.
Feb 25 2021, 8:24 AM

Feb 24 2021

hliao committed rG0d4e12e3c110: [amdgpu] Atomic should be source of divergence. (authored by hliao).
[amdgpu] Atomic should be source of divergence.
Feb 24 2021, 12:29 PM
hliao closed D97392: [amdgpu] Atomic should be source of divergence..
Feb 24 2021, 12:29 PM · Restricted Project
hliao added inline comments to D97392: [amdgpu] Atomic should be source of divergence..
Feb 24 2021, 12:06 PM · Restricted Project
hliao added inline comments to D97392: [amdgpu] Atomic should be source of divergence..
Feb 24 2021, 11:10 AM · Restricted Project
hliao updated the diff for D97392: [amdgpu] Atomic should be source of divergence..

Update tests with more atomic ops.

Feb 24 2021, 11:10 AM · Restricted Project
hliao added reviewers for D97392: [amdgpu] Atomic should be source of divergence.: arsenm, rampitec, cfang.
Feb 24 2021, 8:15 AM · Restricted Project
hliao updated the diff for D97392: [amdgpu] Atomic should be source of divergence..

Update summary.

Feb 24 2021, 8:15 AM · Restricted Project
hliao requested review of D97392: [amdgpu] Atomic should be source of divergence..
Feb 24 2021, 8:13 AM · Restricted Project
hliao added a reviewer for D96980: [amdgpu] Revert agnostic SGPR spill.: sebastian-ne.
Feb 24 2021, 7:04 AM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

In addition, we already heavily use v_readfirstlane in our codegen due to some patterns benefits by using vector instructions when no corresponding scalar instructions could be used. I believed it's quite safe that it's guaranteed v_readfirstlane won't be executed when exec mask goes to 0.

Feb 24 2021, 7:03 AM · Restricted Project
hliao updated the diff for D96980: [amdgpu] Revert agnostic SGPR spill..

Remove that unnecessary change and add rationale why that's safe for the original concerns.

Feb 24 2021, 6:59 AM · Restricted Project

Feb 23 2021

hliao added inline comments to D96980: [amdgpu] Revert agnostic SGPR spill..
Feb 23 2021, 9:33 AM · Restricted Project

Feb 22 2021

hliao added a comment to D96517: [AMDGPU] Optimize SGPR to scratch spilling.

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

That is the question it comes down to. If it is guaranteed that exec is never 0, i.e. at least one bit is always set, I’m in favor of your patch.

To have some numbers, I saw some functions spilling 30 SGPRs to scratch, so it can be more than just a one or two.

Feb 22 2021, 1:46 PM · Restricted Project
hliao updated the diff for D96980: [amdgpu] Revert agnostic SGPR spill..

Mark SI_SPILL_<n>_RESTORE having unwanted effect so that they would be executed under exec mask 0.

Feb 22 2021, 1:43 PM · Restricted Project
hliao added a comment to D96517: [AMDGPU] Optimize SGPR to scratch spilling.

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

That is the question it comes down to. If it is guaranteed that exec is never 0, i.e. at least one bit is always set, I’m in favor of your patch.

To have some numbers, I saw some functions spilling 30 SGPRs to scratch, so it can be more than just a one or two.

Feb 22 2021, 10:36 AM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

Feb 22 2021, 9:04 AM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

I realized SIRemoveShortExecBranches.cpp has done correctness checks. So removing SIInsertSkips should work if we make sure there will always be a branching instruction for each control flow change.

Feb 22 2021, 7:56 AM · Restricted Project
hliao added a comment to D96517: [AMDGPU] Optimize SGPR to scratch spilling.

I think once we're spilling SGPRs to memory with these tricks, you've given up on performance. We just need *anything* that works, and put effort into guaranteeing we never need to do this (aside from the degenerate inline-asm case using all VGPRs). I'm not worried about getting the ideal SGPR-to-memory sequence

Feb 22 2021, 7:47 AM · Restricted Project
hliao added a comment to D96517: [AMDGPU] Optimize SGPR to scratch spilling.

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

Feb 22 2021, 7:46 AM · Restricted Project

Feb 19 2021

hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.
exec=0 is a corner case, but I don’t think we can build on that.

Other than that, it surely is more efficient than the (worst case) 4 memory operations introduced by saving all lanes of the used VGPR :)

Feb 19 2021, 11:08 AM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

HPC workloads mostly spill 1 or 2 SGPRs.

Can you explain this a bit more? Spilling SGPRs to memory is supposed to be a very rare case. Why is it common for HPC workloads? Is there a better way to fix it?

Feb 19 2021, 8:13 AM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

Is it at all possible to encapsulate your failure case in a lit test?

Is the problem not with the exec manipulation, but really that the register scavenger choosing an inappropriate register, w.r.t. wave level CFG?

Feb 19 2021, 8:12 AM · Restricted Project

Feb 18 2021

hliao requested changes to D96517: [AMDGPU] Optimize SGPR to scratch spilling.
Feb 18 2021, 4:30 PM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

Conflicts with D96336/D96517

Feb 18 2021, 4:28 PM · Restricted Project
hliao added a comment to D96517: [AMDGPU] Optimize SGPR to scratch spilling.

Shall we optimize the cases where only 1 or 2 SGPRs are to be spilled or reloaded when there's a VGPR scavenged? In this case, we only need one or two loads/stores to spill/reload that SGPR. From the number of LD/ST, that original one (based on broadcast and v_readfirstlane) is still OK but we need less code. Also, there's no latency due to the restore of that restore of that tmp VGPR.

Feb 18 2021, 4:25 PM · Restricted Project
hliao added a comment to D96980: [amdgpu] Revert agnostic SGPR spill..

Conflicts with D96336/D96517

Feb 18 2021, 3:59 PM · Restricted Project
hliao updated the summary of D96980: [amdgpu] Revert agnostic SGPR spill..
Feb 18 2021, 11:11 AM · Restricted Project
hliao requested review of D96980: [amdgpu] Revert agnostic SGPR spill..
Feb 18 2021, 11:04 AM · Restricted Project

Feb 5 2021

hliao committed rG01bf529db2cf: Recommit of a2fdf9d4d734732a6fa9288f1ffdf12bf8618123. (authored by hliao).
Recommit of a2fdf9d4d734732a6fa9288f1ffdf12bf8618123.
Feb 5 2021, 8:28 AM
hliao added a reverting change for rG4874ff024179: Revert "[hip][cuda] Enable extended lambda support on Windows.": rG01bf529db2cf: Recommit of a2fdf9d4d734732a6fa9288f1ffdf12bf8618123..
Feb 5 2021, 8:28 AM

Feb 3 2021

hliao committed rGa2fdf9d4d734: [hip][cuda] Enable extended lambda support on Windows. (authored by hliao).
[hip][cuda] Enable extended lambda support on Windows.
Feb 3 2021, 10:40 PM
hliao closed D69322: [hip][cuda] Enable extended lambda support on Windows..
Feb 3 2021, 10:39 PM · Restricted Project
hliao added a comment to D69322: [hip][cuda] Enable extended lambda support on Windows..

@rsmith, @rnk and @rjmccall, could you review the latest change addressing the memory usage issue?

Feb 3 2021, 12:57 PM · Restricted Project

Feb 2 2021

hliao added a comment to D69322: [hip][cuda] Enable extended lambda support on Windows..

PING for review

Feb 2 2021, 2:26 PM · Restricted Project

Feb 1 2021

hliao updated the diff for D69322: [hip][cuda] Enable extended lambda support on Windows..

Revise comments.

Feb 1 2021, 10:41 PM · Restricted Project
hliao updated the diff for D69322: [hip][cuda] Enable extended lambda support on Windows..
  • Remove DeviceManglingNumber to reduce memory usage.
  • Add a table in ASTContext for device lambda mangling numbers if that lamba requires.
Feb 1 2021, 12:58 PM · Restricted Project
hliao updated the diff for D69322: [hip][cuda] Enable extended lambda support on Windows..

Rebase to the main branch.

Feb 1 2021, 12:59 AM · Restricted Project

Jan 25 2021

hliao added a comment to D87858: [hip] Add HIP scope atomic ops..

PING for review.

Jan 25 2021, 8:05 AM · Restricted Project
hliao updated the diff for D87858: [hip] Add HIP scope atomic ops..

Rebase

Jan 25 2021, 8:04 AM · Restricted Project

Jan 20 2021

hliao committed rG7b5d7c7b0a24: [hip] Fix `<complex>` compilation on Windows with VS2019. (authored by hliao).
[hip] Fix `<complex>` compilation on Windows with VS2019.
Jan 20 2021, 1:44 PM
hliao closed D95075: [hip] Fix `<complex>` compilation on Windows with VS2019..
Jan 20 2021, 1:44 PM · Restricted Project
hliao updated the diff for D95075: [hip] Fix `<complex>` compilation on Windows with VS2019..

Fix typo.

Jan 20 2021, 12:10 PM · Restricted Project
hliao requested review of D95075: [hip] Fix `<complex>` compilation on Windows with VS2019..
Jan 20 2021, 11:56 AM · Restricted Project

Jan 7 2021

hliao committed rGf78d6af7319a: [hip] Enable HIP compilation with `<complex`> on MSVC. (authored by hliao).
[hip] Enable HIP compilation with `<complex`> on MSVC.
Jan 7 2021, 2:42 PM
hliao closed D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 7 2021, 2:41 PM · Restricted Project
hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Forget that C function could be overloaded on Clang with overloadable
extension. With that, we don't need to mark functions from <ymath.h> as HD.
Instead, we could provide their device-side implementation directly.

Jan 7 2021, 2:28 AM · Restricted Project
hliao added inline comments to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 7 2021, 1:46 AM · Restricted Project

Jan 6 2021

hliao added inline comments to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 6 2021, 5:54 PM · Restricted Project
hliao added inline comments to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 6 2021, 5:32 PM · Restricted Project
hliao added inline comments to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 6 2021, 2:50 PM · Restricted Project
hliao committed rG2a29ce303451: [hip] Fix HIP version parsing. (authored by hliao).
[hip] Fix HIP version parsing.
Jan 6 2021, 2:00 PM
hliao closed D93587: [hip] Fix HIP version parsing..
Jan 6 2021, 2:00 PM · Restricted Project
hliao added inline comments to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 6 2021, 1:59 PM · Restricted Project
hliao added inline comments to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Jan 6 2021, 10:55 AM · Restricted Project
hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Only mark HD attributes in ymath.h wrapper header when compiled with MSVC.

Jan 6 2021, 10:55 AM · Restricted Project
hliao updated the diff for D93587: [hip] Fix HIP version parsing..

Revise following reviewers' comments.

Jan 6 2021, 10:45 AM · Restricted Project
hliao added inline comments to D93587: [hip] Fix HIP version parsing..
Jan 6 2021, 10:45 AM · Restricted Project

Jan 5 2021

hliao added a comment to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

PING

Jan 5 2021, 8:47 AM · Restricted Project
hliao added a comment to D93587: [hip] Fix HIP version parsing..

PING

Jan 5 2021, 8:47 AM · Restricted Project

Dec 22 2020

hliao added inline comments to D92999: [amdgpu] Enhance load widening in the constant address space..
Dec 22 2020, 9:51 AM · Restricted Project

Dec 21 2020

hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Fix the cmake to distribute that header wrapper.

Dec 21 2020, 3:33 PM · Restricted Project
hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

These functions are pure C functions.

Dec 21 2020, 2:16 PM · Restricted Project
hliao updated the summary of D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Dec 21 2020, 2:10 PM · Restricted Project
hliao committed rGbb8d20d9f3bb: [cuda][hip] Fix typoes in header wrappers. (authored by hliao).
[cuda][hip] Fix typoes in header wrappers.
Dec 21 2020, 10:03 AM
hliao added a comment to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Disclaimer: I request changes because of the next sentence, other than that I have no objection but also cannot review this.
All cuda_wrapper headers say something about complex in the first row, copy & paste error. All have the wrong license text (I think).

Dec 21 2020, 9:59 AM · Restricted Project
hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Fix license.

Dec 21 2020, 9:58 AM · Restricted Project
hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Fix typo again.

Dec 21 2020, 9:53 AM · Restricted Project
hliao updated the diff for D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Fix typo.

Dec 21 2020, 8:28 AM · Restricted Project
hliao added a comment to D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..

Beyond the enabling of the compilation with <complex> on Windows, I really have the concern on the current approach supporting <complex> compilation in the device compilation. The device compilation should not relies on the host STL implementation. That results in inconsistent compilation results across various platforms, especially Linux vs. Windows.
BTW, the use of <complex> in CUDA cannot be compiled with NVCC directly even with --expt-relaxed-constexpr, c.f. https://godbolt.org/z/3f79co

Dec 21 2020, 8:27 AM · Restricted Project
hliao requested review of D93638: [hip] Enable HIP compilation with `<complex`> on MSVC..
Dec 21 2020, 7:59 AM · Restricted Project

Dec 19 2020

hliao requested review of D93587: [hip] Fix HIP version parsing..
Dec 19 2020, 3:56 PM · Restricted Project

Dec 14 2020

hliao added inline comments to D93174: [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified..
Dec 14 2020, 10:10 AM · Restricted Project
hliao committed rG1fd1f638b68c: [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified. (authored by hliao).
[amdgpu] Fix a crash case when `V_CNDMASK` could be simplified.
Dec 14 2020, 10:08 AM
hliao closed D93174: [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified..
Dec 14 2020, 10:08 AM · Restricted Project
hliao added inline comments to D93174: [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified..
Dec 14 2020, 8:11 AM · Restricted Project
hliao added inline comments to D93174: [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified..
Dec 14 2020, 6:58 AM · Restricted Project

Dec 13 2020

hliao added a comment to rG8904ee8ac7eb: [JITLink] Add JITLinkDylib type, thread through JITLinkMemoryManager APIs..

The build is broken due to the missing file.

Dec 13 2020, 6:15 PM

Dec 12 2020

hliao requested review of D93174: [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified..
Dec 12 2020, 9:41 PM · Restricted Project