The omp::ParallelOp is translated to a callback that is called for each thread. It uses varargs, but the parameter passing is not working properly with SSE(UP) parameter types. Thus, the need to capture the parameters into an alloca-ed struct and pass that to the callback.
|90 ms||x64 windows > LLVM.CodeGen/XCore::threads.ll|
Script: -- : 'RUN: at line 1'; c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\llc.exe -march=xcore < C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\CodeGen\XCore\threads.ll | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\CodeGen\XCore\threads.ll
Prefer CHECK over CHECK-NEXT unless the semantics of the IR changes when two operations are not adjacent. The need to change another, unrelated test in these patch is the a good illustration why :)
Addressed some CR feedback on this diff.
|30 ↗||(On Diff #305963)|
This pass operates on LLVM Dialect IR. I tried to have it as a function pass (from the very beginning), but the function pass doesn't take the LLVMFuncOp (maybe I am missing something, though). Thanks!
|227 ↗||(On Diff #305963)|
I ran the git clang-format origin/master. I thought it should have fix it. Thanks!
I have no idea why it didn't get caught.
Just finished significant testing ofg this with our AI codegen. The fix in https://reviews.llvm.org/D92189 whows some issues with number of arguments passed as varargs. The number of varargs that starts showing problems is different for different HW (it seems different for SpirV, Vulcan, MultiCore). I will update this PR.
Implemented the closure approach in the OMPIRBuilder only.
The implementation is fully encapsulated in the OMPIRBuilder only (no changes outside of this file).
It also addresses the issue/limitation of the current implementation with having more than 15 upward defined Values, passed to the parallel region as varargs.
I think the overall approach is good to solve the problem with max 16 arguments. Is this based on current master? I left remarks wrt style and other things below, I will need to go over the logic again after those are addressed.
Nit: use SmallVectorImpl w/o the size. No const on the pointers (*const); doesn't mean much anyway.
If it has to be an instruction, use cast. If it might no be, simply get the current instruction point from the builder.
This should make some of the code introduced by D92189 obsolete, right?
Use range loops above whenever possible. Use LLVM naming style for variables please, so first letter capitalized. Also no llvm::.
|17 ↗||(On Diff #311301)|
Addressed CR feedback.
Using the Impl now.
Using the current position from the Builder (moving one forward, so it is not the end() ).
Yes! Most of it. Now the data is just wrapped into a capture struct before calling the synthetic ..._fork function.
You are right ... That was one of the changes I wanted to make to the Diff. Now iterate over the parallel region blocks only.
Fixed a failing Windows test. The issue is that that the order of the operands for add operation has changed. I cant see how these changes could cause the issue, but it is a failing test that blocks push of this Diff.
Thanks for continuing to work on this.
Change the commit message as it is working for non integer types now, just not for "too many". Some more comments below.
wasn't that a spelling error before? and one \param is enough ;)
I believe we need to keep the Inner parameter here. The reasons can be found in the discussion of D92476. Short story: The callback might have the original value in a map with information attached. If we pass in only the "reloaded" value the callback cannot determine what the original was.
How do we know that -- is valid here? Couldn't Loc point to the begin of a function? If possible, let's just use Loc.IP.
Thanks! I have a couple of comments, but I will defer to @jdoerfert for approval in any case.
Nit: it looks like this file uses IP rather than InsPoint for names related to insertion points
Nit: I think https://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop applies to .size() the same way it applies to .end()
if (CapturedValues.empty()) return;
Nit: trailing dot
Nit: please reserve before pushing back in a loop
Nit: Builder.restoreIP(CaptureAllocaInsPoint) looks shorter
I suppose you may want to have alloca inserted in a block (function entry) different from the one where you store into the memory. You need to store just before calling the fork function (or, at least, so that the store postdominates all stored values). Looking at the function API, I would have assumed CaptureAllocaInsPoint to be an insertion point at the function entry block specifically for allocas, where these insertvalues are invalid.
Can we rather take each captured value and enumerate its uses, replacing those within the parallel block set?
|84–140 ↗||(On Diff #311698)|
These look irrelevant to the patch, but seem to fix a breakage upstream. Would you mind committing this separately?
Changes to MLIR are no longer necessary
No need to store this value anymore. Used the InsertBB->getTerminator(), thus guaranteeing the alloca and stores are just before the fork call (they were before that call too, since the ThreadID was called last), so even if more codegen is introduced in the future the logic deals with it.
There is always an instruction before - the ThreadID was always generated, and that is what -- points to. Changed it to use InsertBB->getTerminator(). It is much more sturdy this way. Even if the codegen is changed, the alloca, insert and store will be generated always just before the forkCall.
Refactored. No need to store the InsertPoint.
Now it is guaranteed that. the codegen of the alloca, insert, and stores are done just before the forkCall. Even if the codegen changes in the future. It was the case before because the code was generated after the ThreadID getting call (which was just before the fork).
That was the first implementation I had. The issues was that the uses() was not returning all the uses (particularly the ones introduced by the loop unroller - spent bunch of time debugging it). Iterating to all the instruction parameters of the parallelRegions just works.
|84–140 ↗||(On Diff #311698)|
Yes. This just exposes the original issue I had. I thought it is useful to have a test that verifies the underlined functionality works for MLIR.
The alloca needs to go in the OuterAllocaIP passed in by the caller of CreateParallel.
I'm not too happy with this insert/extract value scheme. Without further optimization (-O0) this might not be lowered properly. Why don't we create a GEP and load/store to the appropriate location instead?
Instead of doing this, unpack/load the location in the PrivHelper like we did before. Also, pass the loaded value as Inner to the PrivCB so that the callback has both the original value V and the reload Inner.