This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/test/mapping/
-
libomptarget/
-
test/
-
mapping/
-
prelock.cpp

Differential D142512

[OpenMP][libomptarget] Fix mapping/prelock.cpp test
Needs RevisionPublic

Authored by kevinsala on Jan 24 2023, 3:39 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
ye-luo
carlo.bertolli
jhuber6

Summary

Do not map a device accessible pointer which cannot be accessed by the host. Use is_device_ptr instead. Enable again the test for NextGen plugins.

Diff Detail

Event Timeline

kevinsala created this revision.Jan 24 2023, 3:39 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2023, 3:39 PM

Herald added subscribers: guansong, yaxunl. · View Herald Transcript

kevinsala requested review of this revision.Jan 24 2023, 3:39 PM

Herald added subscribers: openmp-commits, sstefan1. · View Herald TranscriptJan 24 2023, 3:39 PM

Harbormaster completed remote builds in B209768: Diff 491940.Jan 24 2023, 3:42 PM

Now that I see it, I doubt my suggestion. I'll try to come up with some consensus on the API in 3 weeks. We don't really know what we want.

This revision now requires changes to proceed.Jan 24 2023, 3:49 PM

This test works correctly with the old plugin. A locked pointer is not a device pointer. It is a host pointer that has been pinned in memory and that has been made accessible by ROCr to the device agent that will be assigned the data transfers implementing the map clause. In abstract programming terms, the second feature is only an AMD one and it does not need to have a corresponding functionality on other targets.

I am guessing this is a problem only in the next gen plugin, which needs fixing. Leads for potential issues:

Is the relevant device agent passed in to the hsa_amd_memory_async_copy?
Is the device agent passed in to the hsa_amd_memory_lock function?

You could also have a look at my patch for the old plugin as another potential lead:
https://reviews.llvm.org/D139208

@carlo.bertolli Just tested the example below with the old plugins and it also segfaults:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <omp.h>

void *llvm_omp_target_lock_mem(void *ptr, size_t size, int device_num);
void llvm_omp_target_unlock_mem(void *ptr, int device_num);

int main(int argc, char **argv) {
    const size_t size = 1000*sizeof(int);

    int *ptr1 = (int *) malloc(size);
    assert(ptr1 != NULL);

    memset(ptr1, 0, size);

    int *ptr2 = (int *) llvm_omp_target_lock_mem(ptr1, size, 0);
    assert(ptr2 != NULL);

    fprintf(stdout, "ptr1: %p, ptr2: %p\n", ptr1, ptr2);
    fprintf(stdout, "ptr1[0]: %d\n", ptr1[0]);
    fprintf(stdout, "ptr2[0]: %d\n", ptr2[0]);

    return 0;
}

When running the program, the access to ptr2[0] segfaults at the host side:

ptr1: 0x55b461f00ec0, ptr2: 0x7fc5576c8ec0
ptr1[0]: 0
Segmentation fault

So I don't think ptr2 is a locked host pointer. It's the pointer returned by hsa_amd_memory_lock as the agent_ptr output parameter. And probably it's a pointer that the GPU agents should use for asynchronous memory transfers involving the buffer [ptr1, size). I assumed the host locked pointer is ptr1 directly, after hsa_amd_memory_lock is executed. However, I'm not 100% sure since the documentation of the lock operation on HSA AMD is a bit confusing.

In D142512#4079226, @kevinsala wrote:
@carlo.bertolli Just tested the example below with the old plugins and it also segfaults:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <omp.h>

void *llvm_omp_target_lock_mem(void *ptr, size_t size, int device_num);
void llvm_omp_target_unlock_mem(void *ptr, int device_num);

int main(int argc, char **argv) {
    const size_t size = 1000*sizeof(int);

    int *ptr1 = (int *) malloc(size);
    assert(ptr1 != NULL);

    memset(ptr1, 0, size);

    int *ptr2 = (int *) llvm_omp_target_lock_mem(ptr1, size, 0);
    assert(ptr2 != NULL);

    fprintf(stdout, "ptr1: %p, ptr2: %p\n", ptr1, ptr2);
    fprintf(stdout, "ptr1[0]: %d\n", ptr1[0]);
    fprintf(stdout, "ptr2[0]: %d\n", ptr2[0]);

    return 0;
}
When running the program, the access to ptr2[0] segfaults at the host side:
ptr1: 0x55b461f00ec0, ptr2: 0x7fc5576c8ec0
ptr1[0]: 0
Segmentation fault
So I don't think ptr2 is a locked host pointer. It's the pointer returned by hsa_amd_memory_lock as the agent_ptr output parameter. And probably it's a pointer that the GPU agents should use for asynchronous memory transfers involving the buffer [ptr1, size). I assumed the host locked pointer is ptr1 directly, after hsa_amd_memory_lock is executed. However, I'm not 100% sure since the documentation of the lock operation on HSA AMD is a bit confusing.

Agreed: a pointer returned by the lock function should not be used on the host. It can be used in map clause as it will be passed to the memory copy functions along with the device agent.
This test is different from prelock.cpp and I was not expecting this to work.

Thanks!

@kevinsala if the issue is with returning the host thread a pointer that cannot be used on the host - except in a map clause and a target region - then perhaps we should not return the locked pointer at all. The lock function could operate in a "side effect" mode, where the user asks the runtime to lock memory and the ROCr keeps track of it. Does this make sense?

I don't have a strong opinion regarding returning or not the agent pointer; I don't know which use cases may have (could be the use of is_device_ptr?). However, I have doubts about the correctness of passing a non-host pointer to a map clause and be treated as part of the host data environment.

The lock function could operate in a "side effect" mode, where the user asks the runtime to lock memory and the ROCr keeps track of it. Does this make sense?

That makes sense to me. It's what we are assuming in the nextgen plugins. The user should map the host pointer, the plugin detects that it's pinned memory, and we internally use the associated agent pointer for the asynchronous transfer.

My understanding from the weekly call is that "locked" is for a device agent not the device.
It can be used in hsa memory copy calls but is_device_ptr is for device(kernel) access and thus incorrect.

Revision Contents

Path

Size

openmp/

libomptarget/

test/

mapping/

prelock.cpp

7 lines

Diff 491940

openmp/libomptarget/test/mapping/prelock.cpp

	// RUN: %libomptarget-compilexx-generic			// RUN: %libomptarget-compilexx-run-and-check-generic
	// RUN: env LIBOMPTARGET_NEXTGEN_PLUGINS=0 %libomptarget-run-generic %fcheck-generic

	// UNSUPPORTED: aarch64-unknown-linux-gnu			// UNSUPPORTED: aarch64-unknown-linux-gnu
	// UNSUPPORTED: aarch64-unknown-linux-gnu-LTO			// UNSUPPORTED: aarch64-unknown-linux-gnu-LTO
	// UNSUPPORTED: nvptx64-nvidia-cuda			// UNSUPPORTED: nvptx64-nvidia-cuda
	// UNSUPPORTED: nvptx64-nvidia-cuda-LTO			// UNSUPPORTED: nvptx64-nvidia-cuda-LTO
	// UNSUPPORTED: x86_64-pc-linux-gnu			// UNSUPPORTED: x86_64-pc-linux-gnu
	// UNSUPPORTED: x86_64-pc-linux-gnu-LTO			// UNSUPPORTED: x86_64-pc-linux-gnu-LTO

	Show All 21 Lines
	#pragma omp target teams distribute parallel for map(tofrom : unlocked[ : n])			#pragma omp target teams distribute parallel for map(tofrom : unlocked[ : n])
	for (int i = 0; i < n; i++)			for (int i = 0; i < n; i++)
	unlocked[i] += 1;			unlocked[i] += 1;

	#pragma omp target teams distribute parallel for map(tofrom : unlocked[10 : 10])			#pragma omp target teams distribute parallel for map(tofrom : unlocked[10 : 10])
	for (int i = 10; i < 20; i++)			for (int i = 10; i < 20; i++)
	unlocked[i] += 1;			unlocked[i] += 1;

	#pragma omp target teams distribute parallel for map(tofrom : locked[ : n])			#pragma omp target teams distribute parallel for is_device_ptr(locked)
	for (int i = 0; i < n; i++)			for (int i = 0; i < n; i++)
	locked[i] += 1;			locked[i] += 1;

	#pragma omp target teams distribute parallel for map(tofrom : locked[10 : 10])			#pragma omp target teams distribute parallel for is_device_ptr(locked)
	for (int i = 10; i < 20; i++)			for (int i = 10; i < 20; i++)
	locked[i] += 1;			locked[i] += 1;

	llvm_omp_target_unlock_mem(unlocked, omp_get_default_device());			llvm_omp_target_unlock_mem(unlocked, omp_get_default_device());

	int err = 0;			int err = 0;
	for (int i = 0; i < n; i++) {			for (int i = 0; i < n; i++) {
	if (i < 10 \|\| i > 19) {			if (i < 10 \|\| i > 19) {
	Show All 16 Lines