User Details
- User Since
- Aug 7 2014, 12:01 PM (337 w, 2 d)
Wed, Jan 20
Fix typo.
Thu, Jan 7
Forget that C function could be overloaded on Clang with overloadable
extension. With that, we don't need to mark functions from <ymath.h> as HD.
Instead, we could provide their device-side implementation directly.
Wed, Jan 6
Only mark HD attributes in ymath.h wrapper header when compiled with MSVC.
Revise following reviewers' comments.
Tue, Jan 5
PING
PING
Dec 22 2020
Dec 21 2020
Fix the cmake to distribute that header wrapper.
These functions are pure C functions.
Fix license.
Fix typo again.
Fix typo.
Beyond the enabling of the compilation with <complex> on Windows, I really have the concern on the current approach supporting <complex> compilation in the device compilation. The device compilation should not relies on the host STL implementation. That results in inconsistent compilation results across various platforms, especially Linux vs. Windows.
BTW, the use of <complex> in CUDA cannot be compiled with NVCC directly even with --expt-relaxed-constexpr, c.f. https://godbolt.org/z/3f79co
Dec 19 2020
Dec 14 2020
Dec 13 2020
The build is broken due to the missing file.
Dec 12 2020
Dec 10 2020
LGTM if you revise the test based on Sam's suggestion on the test case.
Dec 9 2020
Dec 8 2020
LGTM if there's a regression test available.
Dec 4 2020
Dec 2 2020
Dec 1 2020
Even there's no functionality change, the original one breaks the kernel extraction script, which is designed to find the .hip_fatbin section. That internal tool is still required to extract kernels from objects generated from RDC linking.
Nov 22 2020
Nov 21 2020
It turns out that the simplest way is to skip generating alloca once that byval argument is readonly. As readonly will be attributed once there's no write to that argument, it's safe to just cast that pointer to the parameter space if it has readonly. Basically, that argument lowering pass does a similar to D91590 but, instead, applies that in the backend. I verified that, for that simple test CUDA code, it would generate the same SASS.
Nov 19 2020
Do you have permission to commit?
Nov 18 2020
As mentioned earlier, that's very experimental support. Even though the SASS looks reasonable, it still needs verifying on real systems. For non-kernel functions, it seems we share the path. So that we should do a similar thing. The current approach fixes that in the codegen phase by adding back the alloca to match the parameter space semantic. Once that alloca is dynamically indexed, it won't be promoted in SROA. Only instcomb eliminates that alloca when it is only modified once by copying from a constant memory. As instcomb won't break certain patterns prepared in the codegen preparation, it won't run in the backend. That dynamically indexed alloca won't be removed.
Nov 17 2020
BTW, please add a test case with that def in back-edge with acyclic dep.
Using post-order is quite straight-forward and only involves several lines of change. Please check the attachment.
That test passed with this traverse order change.Nov 16 2020
This's an experimental or demo-only patch in my spare time on eliminating private memory usage in https://godbolt.org/z/EPPn6h. The attachment
includes both the reference and new IR, PTX, and SASS (sm_60) output. For the new code, that aggregate argument is loaded through LDC instruction in SASS instead of MOV due to the non-static address. I don't have sm_60 to verify that. Could you try that on the real hardware?Kindly ping for review.
could you elaborate more on why we need to run that iteratively? since the original one runs bottom-up, supposedly it should find all.
Nov 13 2020
Revise the interface of that target hook.
Add a dedicated test case for value reading from parameter even though most cases are already covered in the clang test.
Revise the condition check.
- Add a note in the AMDGPU usage document on the assumption made here.
- Revise the test in clang.
Nov 12 2020
Add a test case for the single element struct.
Nov 11 2020
PING for review
PING for review
Nov 10 2020
Rebase
Fix clang-tidy warnings.
Revise the fix.
Revise the commit message.
Remove aggregate kernel argument coercion only.
with multiple MMO is supported in the scheduler, this patch is no longer for performance.