This is an archive of the discontinued LLVM Phabricator instance.

ye-luo retitled this revision from [libomptarget] compile bc files with -O3 to [libomptarget] compile DeviceRTL bc files with -O3.Jul 7 2022, 10:13 PM

ye-luo edited the summary of this revision. (Show Details)

ye-luo added a reviewer: tianshilei1992.

ye-luo edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B174311: Diff 443140.Jul 7 2022, 10:22 PM

LG assuming this doesn't break anything anymore, we used to have problems with definitions getting optimized out but it seems to be fixed. The plan is still to remove this in favor of the static library and LTO, but this should improve things until we make the change.

This revision is now accepted and ready to land.Jul 8 2022, 7:49 AM

Closed by commit rGfca79b78c49c: [libomptarget] compile DeviceRTL bc files with -O3 (authored by ye-luo). · Explain WhyJul 8 2022, 8:00 AM

This revision was automatically updated to reflect the committed changes.

ye-luo added a commit: rGfca79b78c49c: [libomptarget] compile DeviceRTL bc files with -O3.

@ye-luo Hi,
Could you describe/share with us some benchmarks results which prove that it is worth to turn on -O3 optimization?

In D129344#3649015, @domada wrote:

@ye-luo Hi,
Could you describe/share with us some benchmarks results which prove that it is worth to turn on -O3 optimization?

I don't have any graphs, but most applications will see some performance gain when using a more optimized runtime library. I've looked at XSBench, RSBench, MiniQMC, and SU3Bench. Is there a reason having O3 is not desirable? It should only slightly increase the build times for LLVM, which is hardly worth slower execution times.

When I compared miniQMC kernel performance w/ w/o LTO, the difference comes from bc files (slower) being compiled with O1 and the LTO used static library (faster) being compiled with O3. About 30% difference on a kernel I was monitoring.
To reduce the variants among compilation options, it is better to just use O3.
For a long time, we cannot change to O3 because of the backend rejects the kernel compiled with O3. This issue has been resolved and I changed the bc compilation to O3.

@jhuber6 Your accepting comment sounds that there is a risk connected with -O3 optimization. That's why I wanted to know if it is worth to turn on O3 optimization. Thanks for explanation.
@ye-luo Thanks for your response.

Revision Contents

Path

Size

openmp/

libomptarget/

DeviceRTL/

CMakeLists.txt

6 lines

Diff 443246

openmp/libomptarget/DeviceRTL/CMakeLists.txt

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	set(src_files
${source_directory}/Reduction.cpp		${source_directory}/Reduction.cpp
${source_directory}/State.cpp		${source_directory}/State.cpp
${source_directory}/Synchronization.cpp		${source_directory}/Synchronization.cpp
${source_directory}/Tasking.cpp		${source_directory}/Tasking.cpp
${source_directory}/Utils.cpp		${source_directory}/Utils.cpp
${source_directory}/Workshare.cpp		${source_directory}/Workshare.cpp
)		)

set(clang_opt_flags -O1 -mllvm -openmp-opt-disable -DSHARED_SCRATCHPAD_SIZE=512)		set(clang_opt_flags -O3 -mllvm -openmp-opt-disable -DSHARED_SCRATCHPAD_SIZE=512)
set(link_opt_flags -O1 -openmp-opt-disable)		set(link_opt_flags -O3 -openmp-opt-disable)

# Prepend -I to each list element		# Prepend -I to each list element
set (LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL "${LIBOMPTARGET_LLVM_INCLUDE_DIRS}")		set (LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL "${LIBOMPTARGET_LLVM_INCLUDE_DIRS}")
list(TRANSFORM LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL PREPEND "-I")		list(TRANSFORM LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL PREPEND "-I")

# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags -S -x c++ -std=c++17 -fvisibility=hidden		set(bc_flags -S -x c++ -std=c++17 -fvisibility=hidden
${clang_opt_flags}		${clang_opt_flags}
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines
endforeach()		endforeach()

add_custom_target(omptarget.devicertl.amdgpu)		add_custom_target(omptarget.devicertl.amdgpu)
foreach(mcpu ${amdgpu_mcpus})		foreach(mcpu ${amdgpu_mcpus})
compileDeviceRTLLibrary(${mcpu} amdgpu -target amdgcn-amd-amdhsa -DLIBOMPTARGET_BC_TARGET -D__AMDGCN__ -nogpulib)		compileDeviceRTLLibrary(${mcpu} amdgpu -target amdgcn-amd-amdhsa -DLIBOMPTARGET_BC_TARGET -D__AMDGCN__ -nogpulib)
endforeach()		endforeach()

# Set the flags to build the device runtime from clang.		# Set the flags to build the device runtime from clang.
set(clang_lib_flags -fopenmp -fopenmp-cuda-mode -foffload-lto -fvisibility=hidden -Xopenmp-target=nvptx64-nvidia-cuda --cuda-feature=+ptx61 -mllvm -openmp-opt-disable -nocudalib -nogpulib -nostdinc -DSHARED_SCRATCHPAD_SIZE=512 -O3)		set(clang_lib_flags -fopenmp -fopenmp-cuda-mode -foffload-lto -fvisibility=hidden -Xopenmp-target=nvptx64-nvidia-cuda --cuda-feature=+ptx61 -nocudalib -nogpulib -nostdinc ${clang_opt_flags})
foreach(arch ${nvptx_sm_list})		foreach(arch ${nvptx_sm_list})
set(clang_lib_flags ${clang_lib_flags} --offload-arch=sm_${arch})		set(clang_lib_flags ${clang_lib_flags} --offload-arch=sm_${arch})
endforeach()		endforeach()
foreach(arch ${amdgpu_mcpus})		foreach(arch ${amdgpu_mcpus})
set(clang_lib_flags ${clang_lib_flags} --offload-arch=${arch})		set(clang_lib_flags ${clang_lib_flags} --offload-arch=${arch})
endforeach()		endforeach()

# Build the static library version of the device runtime.		# Build the static library version of the device runtime.
Show All 40 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] compile DeviceRTL bc files with -O3ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 443246

openmp/libomptarget/DeviceRTL/CMakeLists.txt

[libomptarget] compile DeviceRTL bc files with -O3
ClosedPublic