User Details
- User Since
- Aug 7 2014, 12:01 PM (348 w, 3 d)
Fri, Apr 9
Mon, Apr 5
Sat, Apr 3
Wed, Mar 31
Tue, Mar 30
the case is no longer valid considering concurrent kernel execution.
It seems to me that we may need to revise CFG lowering to avoid updating EXEC directly and later revise it based on whether the restoring mask needs reloading or not. Here's the brief thought in my mind:
- Instead of lowering CFG early before RA, lower it after RA. As a byproduct, it also remove the need of "terminator" version of exec mask manipulation instructions.
- When CFG is being lowered, it could update EXEC eagerly if the merge point doesn't need to reload the mask; Otherwise, it just needs to translate as what we currently did.
Mon, Mar 29
This's the companion fix for D96980. That explain the myth why the original agnostic SGPR spill/reload is proposed to solve the issue SGRP spill/reload may be executed when exec mask goes to zero.
We did try to skip executing code when exec mask goes to zero by branch on EXECZ (to the target block) or EXECNZ (to the fallthrough block.) We may run instructions with zero exec mask. But, that's usually not an issue as we immediately restore the exec mask on the targeted block. Code following that exec mask restoration won't be executed in 0 mask. However, if that mask restoration needs to reload a spilled exec mask, we will run the SGPR reload with 0 mask, where v_readfristlane has undefined behavior when exec mask is zero.
This patch tries to mitigate that case by not evaluating exec mask that early or clearing the exec mask when the branch target has mask restoration following SGRP reload. Instead of checking EXECZ or EXECNZ, the exec mask evaluation is duplicated with a temporary SGRP as the destination (without updating exec mask directly), checking SCC0 is equivalent to EXECZ. Exec mask is only evaluated when the result won't be zero. For instance,
Feb 25 2021
Feb 24 2021
Update tests with more atomic ops.
Update summary.
In addition, we already heavily use v_readfirstlane in our codegen due to some patterns benefits by using vector instructions when no corresponding scalar instructions could be used. I believed it's quite safe that it's guaranteed v_readfirstlane won't be executed when exec mask goes to 0.
Remove that unnecessary change and add rationale why that's safe for the original concerns.
Feb 23 2021
Feb 22 2021
Mark SI_SPILL_<n>_RESTORE having unwanted effect so that they would be executed under exec mask 0.
why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?
Feb 19 2021
Feb 18 2021
Shall we optimize the cases where only 1 or 2 SGPRs are to be spilled or reloaded when there's a VGPR scavenged? In this case, we only need one or two loads/stores to spill/reload that SGPR. From the number of LD/ST, that original one (based on broadcast and v_readfirstlane) is still OK but we need less code. Also, there's no latency due to the restore of that restore of that tmp VGPR.
Feb 5 2021
Feb 3 2021
Feb 2 2021
PING for review
Feb 1 2021
Revise comments.
- Remove DeviceManglingNumber to reduce memory usage.
- Add a table in ASTContext for device lambda mangling numbers if that lamba requires.
Rebase to the main branch.
Jan 25 2021
PING for review.
Rebase
Jan 20 2021
Fix typo.
Jan 7 2021
Forget that C function could be overloaded on Clang with overloadable
extension. With that, we don't need to mark functions from <ymath.h> as HD.
Instead, we could provide their device-side implementation directly.
Jan 6 2021
Only mark HD attributes in ymath.h wrapper header when compiled with MSVC.
Revise following reviewers' comments.
Jan 5 2021
PING
PING
Dec 22 2020
Dec 21 2020
Fix the cmake to distribute that header wrapper.
These functions are pure C functions.
Fix license.
Fix typo again.
Fix typo.
Beyond the enabling of the compilation with <complex> on Windows, I really have the concern on the current approach supporting <complex> compilation in the device compilation. The device compilation should not relies on the host STL implementation. That results in inconsistent compilation results across various platforms, especially Linux vs. Windows.
BTW, the use of <complex> in CUDA cannot be compiled with NVCC directly even with --expt-relaxed-constexpr, c.f. https://godbolt.org/z/3f79co
Dec 19 2020
Dec 14 2020
Dec 13 2020
The build is broken due to the missing file.