This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
libomptarget/test/offloading/
-
test/
-
offloading/
-
bug49334.cpp
-
runtime/src/
-
src/
-
kmp_tasking.cpp

Differential D97329

[OpenMP] Fixed a crash when offloading to x86_64 with target nowait
ClosedPublic

Authored by tianshilei1992 on Feb 23 2021, 12:45 PM.

Download Raw Diff

Details

Reviewers

jdoerfert

Commits

rGe5da63d5a9ed: [OpenMP] Fixed a crash when offloading to x86_64 with target nowait

Summary

PR#49334 reports a crash when offloading to x86_64 with target nowait,
which is caused by referencing a nullptr. The root cause of the issue is, when
pushing a hidden helper task in __kmp_push_task, it also maps the gtid to its
shadow gtid, which is wrong.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tianshilei1992 created this revision.Feb 23 2021, 12:45 PM

Herald added subscribers: pengfei, guansong, yaxunl. · View Herald TranscriptFeb 23 2021, 12:45 PM

tianshilei1992 requested review of this revision.Feb 23 2021, 12:45 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptFeb 23 2021, 12:45 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, sstefan1. · View Herald Transcript

I didn't include the reproducer because it cannot pass because of computation error. The same code can pass on NVPTX target.

tianshilei1992 mentioned this in D77609: [OpenMP] Added the support for hidden helper task in RTL.Feb 23 2021, 12:48 PM

LG. backport if possible and add the test (if we don't have one)

This revision is now accepted and ready to land.Feb 23 2021, 1:00 PM

Harbormaster completed remote builds in B90471: Diff 325880.Feb 23 2021, 3:42 PM

Added the test although it is expected to fail on x86_64

update test case

Harbormaster completed remote builds in B90514: Diff 325949.Feb 23 2021, 8:51 PM

Harbormaster completed remote builds in B90515: Diff 325950.Feb 23 2021, 8:55 PM

This patch fixes the segfault in __kmp_push_task, when executing the code with OMP_NUM_THREADS>1.
I accidentally ran the test case in this patch with OMP_NUM_THREADS=1 (which happens to be the default on our cluster) and could not even get a stacktrace after the crash.

This is another regression from the 11 release.

Including openmp/runtime/test/ompt/callback.h, I could identify that the segfault seems to occur inside of the OpenMP for loop. The last printed OMPT event on the crashing thread prints ompt_event_loop_begin.

If you build the OpenMP runtime with debugging symbols and include callback.h to dump the OMPT events (I just add -include .../llvm-project/openmp/runtime/test/ompt/callback.h to the compile line), you can get the OMPT thread-id with:

(gdb) print __kmp_threads[__kmp_gtid].th.ompt_thread_info.thread_data.value

This is the same ID as at the beginning of all callback.h output by that thread.

Another observation:
clang/11 would execute all code in the test on the initial thread, avoiding overloading the system. The current implementation will execute an instance of the target region on each of the hidden threads, effectively oversubscribing the system when running on a dual-core desktop. Is this the intended behavior?

Closed by commit rGe5da63d5a9ed: [OpenMP] Fixed a crash when offloading to x86_64 with target nowait (authored by tianshilei1992). · Explain WhyFeb 24 2021, 9:37 AM

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rGe5da63d5a9ed: [OpenMP] Fixed a crash when offloading to x86_64 with target nowait.

In D97329#2584863, @protze.joachim wrote:

This patch fixes the segfault in __kmp_push_task, when executing the code with OMP_NUM_THREADS>1.
I accidentally ran the test case in this patch with OMP_NUM_THREADS=1 (which happens to be the default on our cluster) and could not even get a stacktrace after the crash.

I'll take a look and fix it in another patch.

In D97329#2585197, @protze.joachim wrote:
Including openmp/runtime/test/ompt/callback.h, I could identify that the segfault seems to occur inside of the OpenMP for loop. The last printed OMPT event on the crashing thread prints ompt_event_loop_begin.

If you build the OpenMP runtime with debugging symbols and include callback.h to dump the OMPT events (I just add -include .../llvm-project/openmp/runtime/test/ompt/callback.h to the compile line), you can get the OMPT thread-id with:
(gdb) print __kmp_threads[__kmp_gtid].th.ompt_thread_info.thread_data.value
This is the same ID as at the beginning of all callback.h output by that thread.

Gotcha. Will take a look. Thanks!

Another observation:
clang/11 would execute all code in the test on the initial thread, avoiding overloading the system. The current implementation will execute an instance of the target region on each of the hidden threads, effectively oversubscribing the system when running on a dual-core desktop. Is this the intended behavior?

Yes. Actually most of the time those hidden helper threads are sleeping (waiting for the job finish or waiting for a job), it should be fine. After all, there are so many threads in the system already. :-) However, I have to admit that it is probably not the case for x86_64 target offloading. Then I suppose reducing the number of hidden helper thread to just one can help. (There is currently a bug when setting the number to 1 but I'll fix it accordingly)

The test tends to fail for x86_64 offloading. When I reduce the BS to 64, the test also fails for nvptx offloading.

Hangs on amdgpu offloading with these parameters, suggestion in D102017 to use BS = 16; / N = 256; seems to replace that with segv. Valgrind says:

==392314== Conditional jump or move depends on uninitialised value(s)
==392314==    at 0x4BCE15E: __kmp_push_task(int, kmp_task*) (in /home/amd/llvm-build/llvm/runtimes/runtimes-bins/openmp/runtime/src/libomp.so)

a little while afterwards it dereferences 0x50 and faults. Not sure I have the time to set up a debug build and go race chasing right now.

Revision Contents

Path

Size

openmp/

libomptarget/

test/

offloading/

bug49334.cpp

148 lines

runtime/

src/

kmp_tasking.cpp

3 lines

Diff 326121

openmp/libomptarget/test/offloading/bug49334.cpp

This file was added.

				// RUN: %libomptarget-compilexx-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-x86_64-pc-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-nvptx64-nvidia-cuda

				#include <cassert>
				#include <iostream>
				#include <memory>
				#include <vector>

				class BlockMatrix {
				private:
				const int rowsPerBlock;
				const int colsPerBlock;
				const long nRows;
				const long nCols;
				const int nBlocksPerRow;
				const int nBlocksPerCol;
				std::vector<std::vector<std::unique_ptr<float[]>>> Blocks;

				public:
				BlockMatrix(const int _rowsPerBlock, const int _colsPerBlock,
				const long _nRows, const long _nCols)
				: rowsPerBlock(_rowsPerBlock), colsPerBlock(_colsPerBlock), nRows(_nRows),
				nCols(_nCols), nBlocksPerRow(_nRows / _rowsPerBlock),
				nBlocksPerCol(_nCols / _colsPerBlock), Blocks(nBlocksPerCol) {
				for (int i = 0; i < nBlocksPerCol; i++) {
				for (int j = 0; j < nBlocksPerRow; j++) {
				Blocks[i].emplace_back(new float[_rowsPerBlock * _colsPerBlock]);
				}
				}
				};

				// Initialize the BlockMatrix from 2D arrays
				void Initialize(const std::vector<float> &matrix) {
				for (int i = 0; i < nBlocksPerCol; i++)
				for (int j = 0; j < nBlocksPerRow; j++) {
				float *CurrBlock = GetBlock(i, j);
				for (int ii = 0; ii < colsPerBlock; ++ii)
				for (int jj = 0; jj < rowsPerBlock; ++jj) {
				int curri = i * colsPerBlock + ii;
				int currj = j * rowsPerBlock + jj;
				CurrBlock[ii + jj * colsPerBlock] = matrix[curri + currj * nCols];
				}
				}
				}

				long Compare(const std::vector<float> &matrix) const {
				long fail = 0;
				for (int i = 0; i < nBlocksPerCol; i++)
				for (int j = 0; j < nBlocksPerRow; j++) {
				float *CurrBlock = GetBlock(i, j);
				for (int ii = 0; ii < colsPerBlock; ++ii)
				for (int jj = 0; jj < rowsPerBlock; ++jj) {
				int curri = i * colsPerBlock + ii;
				int currj = j * rowsPerBlock + jj;
				float m_value = matrix[curri + currj * nCols];
				float bm_value = CurrBlock[ii + jj * colsPerBlock];
				if (bm_value != m_value) {
				fail++;
				}
				}
				}
				return fail;
				}

				float *GetBlock(int i, int j) const {
				assert(i < nBlocksPerCol && j < nBlocksPerRow && "Accessing outside block");
				return Blocks[i][j].get();
				}
				};

				constexpr const int BS = 256;
				constexpr const int N = 1024;

				int BlockMatMul_TargetNowait(BlockMatrix &A, BlockMatrix &B, BlockMatrix &C) {
				#pragma omp parallel
				#pragma omp master
				for (int i = 0; i < N / BS; ++i)
				for (int j = 0; j < N / BS; ++j) {
				float *BlockC = C.GetBlock(i, j);
				for (int k = 0; k < N / BS; ++k) {
				float *BlockA = A.GetBlock(i, k);
				float *BlockB = B.GetBlock(k, j);
				// clang-format off
				#pragma omp target depend(in: BlockA[0], BlockB[0]) depend(inout: BlockC[0]) \
				map(to: BlockA[:BS * BS], BlockB[:BS * BS]) \
				map(tofrom: BlockC[:BS * BS]) nowait
				// clang-format on
				#pragma omp parallel for
				for (int ii = 0; ii < BS; ii++)
				for (int jj = 0; jj < BS; jj++) {
				for (int kk = 0; kk < BS; ++kk)
				BlockC[ii + jj * BS] +=
				BlockA[ii + kk * BS] * BlockB[kk + jj * BS];
				}
				}
				}
				return 0;
				}

				void Matmul(const std::vector<float> &a, const std::vector<float> &b,
				std::vector<float> &c) {
				for (int i = 0; i < N; ++i) {
				for (int j = 0; j < N; ++j) {
				float sum = 0.0;
				for (int k = 0; k < N; ++k) {
				sum = sum + a[i * N + k] * b[k * N + j];
				}
				c[i * N + j] = sum;
				}
				}
				}

				int main(int argc, char *argv[]) {
				std::vector<float> a(N * N);
				std::vector<float> b(N * N);
				std::vector<float> c(N * N, 0.0);

				for (int i = 0; i < N; ++i) {
				for (int j = 0; j < N; ++j) {
				a[i * N + j] = b[i * N + j] = i + j % 100;
				}
				}

				auto BlockedA = BlockMatrix(BS, BS, N, N);
				BlockedA.Initialize(a);
				BlockedA.Compare(a);
				auto BlockedB = BlockMatrix(BS, BS, N, N);
				BlockedB.Initialize(b);
				BlockedB.Compare(b);

				Matmul(a, b, c);

				auto BlockedC = BlockMatrix(BS, BS, N, N);
				BlockMatMul_TargetNowait(BlockedA, BlockedB, BlockedC);

				if (BlockedC.Compare(c) > 0) {
				return 1;
				}

				std::cout << "PASS\n";

				return 0;
				}

				// CHECK: PASS

openmp/runtime/src/kmp_tasking.cpp

Show First 20 Lines • Show All 320 Lines • ▼ Show 20 Lines	static void __kmp_realloc_task_deque(kmp_info_t *thread,
thread_data->td.td_deque_size = new_size;		thread_data->td.td_deque_size = new_size;
}		}

// __kmp_push_task: Add a task to the thread's deque		// __kmp_push_task: Add a task to the thread's deque
static kmp_int32 __kmp_push_task(kmp_int32 gtid, kmp_task_t *task) {		static kmp_int32 __kmp_push_task(kmp_int32 gtid, kmp_task_t *task) {
kmp_info_t *thread = __kmp_threads[gtid];		kmp_info_t *thread = __kmp_threads[gtid];
kmp_taskdata_t *taskdata = KMP_TASK_TO_TASKDATA(task);		kmp_taskdata_t *taskdata = KMP_TASK_TO_TASKDATA(task);

if (taskdata->td_flags.hidden_helper) {		// We don't need to map to shadow gtid if it is already hidden helper thread
		if (taskdata->td_flags.hidden_helper && !KMP_HIDDEN_HELPER_THREAD(gtid)) {
gtid = KMP_GTID_TO_SHADOW_GTID(gtid);		gtid = KMP_GTID_TO_SHADOW_GTID(gtid);
thread = __kmp_threads[gtid];		thread = __kmp_threads[gtid];
}		}

kmp_task_team_t *task_team = thread->th.th_task_team;		kmp_task_team_t *task_team = thread->th.th_task_team;
kmp_int32 tid = __kmp_tid_from_gtid(gtid);		kmp_int32 tid = __kmp_tid_from_gtid(gtid);
kmp_thread_data_t *thread_data;		kmp_thread_data_t *thread_data;

▲ Show 20 Lines • Show All 4,435 Lines • Show Last 20 Lines