Thanks for implementing this. I have added a comment inlined.
Oct 17 2022
Sep 21 2022
Sep 13 2022
This is a good idea. Thanks Joseph.
Aug 31 2022
Aug 20 2022
That one is a good point. Let us revise that. Since the effect we actually want to have is the creation of the task graph sequentially.
I agree on 2. Any recommendations? We can move some of the logic there.
CUDA streams are FIFO queues. But it’s true this will not work if the device queue is not FIFO. In the case of CUDA this works for the case you described without being a read after write dependency.
It is only valid under the assumption of non unified shared memory. State across host and device is only visible during data movements. So it is up until then when changes in the host or device data is reflected. Assuming there are no external runtimes, it is possible to synchronize only on data movements, conserving the data dependencies between host and device.
Jun 29 2022
Rebasing to master
Jun 22 2022
I'm not sure what to do about the buildbot errors I am getting. I'm accessing them but there's no info that tells me what's going on.
Fixing var name to coding standard
Changing omp_get_num_devices() for omp_get_initial_device()
Ah! I missed that. Thanks
Jun 9 2022
Fixing the warning issue on the switch statement.
Changes are fine. I am not familiar with the progress in C++20 and 23, but I trust your judgement here.
Ah! Sorry about this. Quick fix. Working on it.
Jun 8 2022
Thanks Jon for all the comments.
Jun 7 2022
Cleaning up code to remove unused enums. Removed the commented code with cache_iteration, and all related funcitons.
Jun 3 2022
I missed those. I will check well with all of them.
Jun 2 2022
I have added the HSA_ISA_INFO_NAME, made the function static, and removed
the unused elements in the enums.
@saiislam I will look into that.
Jun 1 2022
Oct 21 2021
Sep 19 2021
Thanks for adding this Michael.
I saw you are adding the tests that are expected to pass in the cmake but made changes to allow expected fails in LIT. Wouldn’t it make sense to have expected failures in cmake as well. It seems more intuitive to me to mention what is not expected to work. If new tests are added and they pass then that’s fine. It’s more those that are added that do not pass that should trigger a warning.
Other than that it looks good.
Thanks for adding this Michael.
Jul 30 2021
Jul 29 2021
Jul 27 2021
Sorry for the delay. Working on this
Removing branch dependency
Rebasing to main this time for real
Rebase to main
Rebase to main
Sync to main
Fixing @tianshilei1992 comments.
Jul 26 2021
Changing name and adding cstdio instead of iostream. Adding license headers. Other minor changes
Updating minor comments. Major re-design of a less verbose solution will be added later
Rebasing to main
Rebasing to main
Fixing final comments from @jdoerfert
Jul 25 2021
I thought about the -omp- in the name too. But I remembered that the ACC folks wanted to use the same runtime. I like both llvm-omp-device-info and llvm-omp-deviceinfo. Or we could drop the llvm- as they do in mlir- tools
My test is still failing, but it fails on an assertion on the changeToSPMD method:
Jul 24 2021
Not obvious to me that the functionality has much to do with the plugin. Could do a standalone tool instead?
I think there's a tool called nvidia-smi that does something similar. There's definitely one called rocminfo that does. The latter prints 'human readable' output, which gets in the way of scripting with it.
Jul 23 2021
- Created a single function receiving a string instead of one per attribute: foldKernelFnAttribute
- Removed pesimisticFixpoint if no kernel is found.
- Adding Check for the ReachingKernelEntries valid state
- Removed unnecesary comments
Jul 22 2021
Ups, missed one comment from Shilei
Modifying to adapt to most of the comments already in here. Will provide more detail soon. Tests are still failing.
Merging upstream main
Jul 20 2021
Forgot to run git-clang-format
I've fixed all the clang tests. It is also not possible to provide a default
for omp target because the code relies on a nullptr being returned for generating
the right runtime call. Therefore I reverted that change and use -1 to flag
this case. I've also moved some elements to the emit function that were in the
Jul 19 2021
Adding test file /clang/test/OpenMP/target_num_teams_num_threads_attributes.cpp
Making the default num teams for omp target be 1. Also fixing clang-tidy error and missing initialization.
Changing the attribute names to those sugested by @jdoerfert
Jul 15 2021
Just a comment before reviewing the patch: please don't rebase the patch as D105787 has been reverted.
Fixing syntax, removing unnecessary code. Changing a break
to a indicatePessimisticFixpoint()
Jul 14 2021
The test is breaking the compiler. I need to change it to see if it passes or not.
Jul 9 2021
We used this kind of codegen initially but later found out that it causes a large overhead when gathering pointers into a record. What about hybrid scheme where the first args are passed as arguments and others (if any) are gathered into a record?
I'm confused, maybe I misunderstand the problem. The parallel function arguments need to go from the main thread to the workers somehow, I don't see how this is done w/o a record. This patch makes it explicit though.
Pass it in a record for workers only? And use a hybrid scheme for all other parallel regions.
I still do not follow. What does it mean for workers only? What is a hybrid scheme? And, probably most importantly, how would we not eventually put everything into a record anyway?
On the host you don’t need to put everything into a record, especially for small parallel regions. Pass some first args in registers and only the remaining args gather into the record. For workers just pass all args in the record.
Could you please respond to my question so we make progress here. We *always* have to pass things in a record, do you agree?
On the GPU device, yes. And I'm absolutely fine with packing args for the GPU device. But the patch packs the args not only for the GPU devices but also for the host and other devices which may not require packing/unpacking. For such devices/host better to avoid packing/unpacking as it introduces overhead in many cases.
Mar 19 2021
Dec 22 2020
Modifying 3 more tests to reflect changes in this patch
Dec 8 2020
I'm working on the other tests right now.
Removing globalized record for parallel regions