Page MenuHomePhabricator

[OPENMP]Fix PR48851: the locals are not globalized in SPMD mode.
AbandonedPublic

Authored by ABataev on Mar 26 2021, 11:28 AM.

Details

Summary

Need to perform general analysis on SPMD kernels to correctly identify
the variables that should be globalized because of esacaping their
declaration context.

Diff Detail

Event Timeline

ABataev created this revision.Mar 26 2021, 11:28 AM
ABataev requested review of this revision.Mar 26 2021, 11:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2021, 11:28 AM
Herald added a subscriber: sstefan1. · View Herald Transcript

Can we please always do the globalization, even in the target teams distribute parallel for case you need it if a thread shares the address of a local variable with the team and another thread uses it.
There is no argument other than "doesn't escape" that Clang can make to disprove globalization is needed, IMHO.

Can we please always do the globalization, even in the target teams distribute parallel for case you need it if a thread shares the address of a local variable with the team and another thread uses it.

Could you give a small example so I could better understand the problem? Shall we globalize the variable in SPMD mode if we pass it by reference/take address in any case?

There is no argument other than "doesn't escape" that Clang can make to disprove globalization is needed, IMHO.

jdoerfert added a comment.EditedMar 26 2021, 8:30 PM

Can we please always do the globalization, even in the target teams distribute parallel for case you need it if a thread shares the address of a local variable with the team and another thread uses it.

Could you give a small example so I could better understand the problem?

I didn't fine my old example, this should do though:
https://godbolt.org/z/b7axxzxEf

On the host or host offload I see:
Mine: 0, Other: 42

On a GPU I see:
CUDA error: Error when synchronizing stream. stream = 0x000000004294db40, async info ptr = 0x00007fffdd939838
CUDA error: an illegal memory access was encountered

Shall we globalize the variable in SPMD mode if we pass it by reference/take address in any case?

Yes. I think that is strictly speaking necessary. We should commit it together with the patches that "undo" globalization though.

There is no argument other than "doesn't escape" that Clang can make to disprove globalization is needed, IMHO.

Can we please always do the globalization, even in the target teams distribute parallel for case you need it if a thread shares the address of a local variable with the team and another thread uses it.

Could you give a small example so I could better understand the problem?

I didn't fine my old example, this should do though:
https://godbolt.org/z/b7axxzxEf

On the host or host offload I see:
Mine: 0, Other: 42

On a GPU I see:
CUDA error: Error when synchronizing stream. stream = 0x000000004294db40, async info ptr = 0x00007fffdd939838
CUDA error: an illegal memory access was encountered

Shall we globalize the variable in SPMD mode if we pass it by reference/take address in any case?

Yes. I think that is strictly speaking necessary. We should commit it together with the patches that "undo" globalization though.

There is no argument other than "doesn't escape" that Clang can make to disprove globalization is needed, IMHO.

It would be better to implement this in a separate patch. Let's fix the bug first and then implement the common functionality for locals globalization in SPMD mode (probably controlled by the compiler option/flag).

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

Unfortunately that crashes:
llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

Unfortunately that crashes:
llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

Hmm, could you provide a full stack trace?

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

Unfortunately that crashes:
llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

Hmm, could you provide a full stack trace?

At this point I am not sure I want to dig into that crash as our llvm-branch is not caught up to trunk.

I did build trunk and ran some tests on a sm_70:
-Without this patch: code fails with incomplete results
-Without this patch and with -fno-openmp-cuda-parallel-target-regions: code fails with incomplete results

-With this patch: code fails with incomplete results (thread private array)
Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac

Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7c1ebc800000

Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc

Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7c1ebc800000

gpu_results[0]: 1
gpu_results[1]: 1
FAIL

-With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7a5afc800000

Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7a5afc800000

gpu_results[0]: 2
gpu_results[1]: 2
PASS

I am concerned about team 0 and team 1 having the same address for the dist array here.

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

Unfortunately that crashes:
llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

Hmm, could you provide a full stack trace?

At this point I am not sure I want to dig into that crash as our llvm-branch is not caught up to trunk.

I did build trunk and ran some tests on a sm_70:
-Without this patch: code fails with incomplete results
-Without this patch and with -fno-openmp-cuda-parallel-target-regions: code fails with incomplete results

-With this patch: code fails with incomplete results (thread private array)
Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac

Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7c1ebc800000

Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc

Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7c1ebc800000

gpu_results[0]: 1
gpu_results[1]: 1
FAIL

-With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7a5afc800000

Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7a5afc800000

gpu_results[0]: 2
gpu_results[1]: 2
PASS

I am concerned about team 0 and team 1 having the same address for the dist array here.

It is caused by the problem with the runtime. It should work with -fno-openmp-cuda-parallel-target-regions (I think) option (it uses a different runtime function for this case) and I just want to check that it really works. Looks like currently, runtime allocates a unique array for each thread.

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

Unfortunately that crashes:
llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

Hmm, could you provide a full stack trace?

At this point I am not sure I want to dig into that crash as our llvm-branch is not caught up to trunk.

I did build trunk and ran some tests on a sm_70:
-Without this patch: code fails with incomplete results
-Without this patch and with -fno-openmp-cuda-parallel-target-regions: code fails with incomplete results

-With this patch: code fails with incomplete results (thread private array)
Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac

Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7c1ebc800000

Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc

Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7c1ebc800000

gpu_results[0]: 1
gpu_results[1]: 1
FAIL

-With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7a5afc800000

Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7a5afc800000

gpu_results[0]: 2
gpu_results[1]: 2
PASS

I am concerned about team 0 and team 1 having the same address for the dist array here.

It is caused by the problem with the runtime. It should work with -fno-openmp-cuda-parallel-target-regions (I think) option (it uses a different runtime function for this case) and I just want to check that it really works. Looks like currently, runtime allocates a unique array for each thread.

Unfortunately, this patch + the flag does not work for the larger reproducer, the CPU check passes but GPU check fails with incorrect results.
https://github.com/zjin-lcf/oneAPI-DirectProgramming/blob/master/all-pairs-distance-omp/main.cpp

In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.

Diв you try the reproducer with the applied patch?

Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?

I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?

This is on an AMDGPU but I assume the behavior would be similar for NVPTX.

It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
Adding some printfs, looking at one teams' output:

SPMD

Thread 0: dist[0]: 1
Thread 0: dist[1]: 0  // This should be 1
After reduction into dist[0]: 1  // This should be 2
gpu_results = [1,1]  // [2,2] expected

Generic Mode:

Thread 0: dist[0]: 1
Thread 0: dist[1]: 1   
After reduction into dist[0]: 2
gpu_results = [2,2]

Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?

As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.

SPMD - shows each thread has a unique address for dist array

Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc

Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7f92a5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c

Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7f92a5000000

gpu_results[0]: 1
gpu_results[1]: 1

Generic - shows each team shares dist array address amongst threads

Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
Team 0 Thread 1: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
Team 0 Thread 0: dist[1]: 1, 0x7fac01938884

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7fabc5000000
--------------------------------------------------
Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7fabc5000000

Could you check if it works with -fno-openmp-cuda-parallel-target-regions option?

Unfortunately that crashes:
llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

Hmm, could you provide a full stack trace?

At this point I am not sure I want to dig into that crash as our llvm-branch is not caught up to trunk.

I did build trunk and ran some tests on a sm_70:
-Without this patch: code fails with incomplete results
-Without this patch and with -fno-openmp-cuda-parallel-target-regions: code fails with incomplete results

-With this patch: code fails with incomplete results (thread private array)
Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac

Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4

Team 0 Thread 0: After reduction into dist[0]: 1
Team 0 Thread 0: gpu_results address: 0x7c1ebc800000

Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc

Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4

Team 1 Thread 0: After reduction into dist[0]: 1
Team 1 Thread 0: gpu_results address: 0x7c1ebc800000

gpu_results[0]: 1
gpu_results[1]: 1
FAIL

-With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 0 Thread 0: After reduction into dist[0]: 2
Team 0 Thread 0: gpu_results address: 0x7a5afc800000

Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c

Team 1 Thread 0: After reduction into dist[0]: 2
Team 1 Thread 0: gpu_results address: 0x7a5afc800000

gpu_results[0]: 2
gpu_results[1]: 2
PASS

I am concerned about team 0 and team 1 having the same address for the dist array here.

It is caused by the problem with the runtime. It should work with -fno-openmp-cuda-parallel-target-regions (I think) option (it uses a different runtime function for this case) and I just want to check that it really works. Looks like currently, runtime allocates a unique array for each thread.

Unfortunately, this patch + the flag does not work for the larger reproducer, the CPU check passes but GPU check fails with incorrect results.
https://github.com/zjin-lcf/oneAPI-DirectProgramming/blob/master/all-pairs-distance-omp/main.cpp

Ok, I'll drop this patch and prepare the more conservative one, where the kernel will be executed in non-SPMD mode instead. Later it will be improved by the LLVM optimization.

ABataev abandoned this revision.Wed, May 5, 7:35 AM