This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Fix Slice Duplicate in Profiler
ClosedPublic

Authored by jarmusch on Aug 11 2023, 8:22 AM.

Details

Summary

Using LIBOMPTARGET_PROFILER, duplicates are created from timing both Kernel functions and Data update functions.
I commented out the duplicate timescope and left them in the targetkernel and the targetdataupdate functions. This
way the timescope calls will be closer to the launching of the kernel and the data moving.

Diff Detail

Event Timeline

jarmusch created this revision.Aug 11 2023, 8:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2023, 8:22 AM
jarmusch requested review of this revision.Aug 11 2023, 8:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2023, 8:22 AM
tianshilei1992 requested changes to this revision.Aug 11 2023, 10:57 AM
tianshilei1992 added a subscriber: tianshilei1992.

This doesn't look the right way to handle it. The initial design is to profile all functions in the call stack.

This revision now requires changes to proceed.Aug 11 2023, 10:57 AM
jarmusch added a comment.EditedAug 14 2023, 10:01 AM

Thanks, @tianshilei1992 for your input.

I adjusted the timescope calls to address a problem with duplicate profiling slices. You can see them on chrome://tracing. The duplication occurred when both kernel functions and all data-moving functions were timed separately. Focusing the timescope calls on 'targetKernel' and 'targetdataupdate' functions provides a solution to the duplicate slices, and more accurate profiling data, while keeping the important profiling.

My intention was to strike a balance between resolving the duplication issue and keeping the important profiling. I understand the importance of comprehensive profiling and I'm open to further discussion.

(Edited - change of file - the first file was after the fix - now you see the duplicate slices)

jdoerfert added inline comments.Aug 15 2023, 1:32 PM
openmp/libomptarget/src/interface.cpp
85

Keep this, put the RegionTypeMsg into the details field of the profile entry.

131

Remove.

147–148

Remove.

161–162

Remove.

177–178

Remove.

188–189

Remove.

204–205

Remove.

247

Keep this.

297

Remove, above is better / more true to the time spend "offloading"

322–323

Remove.

openmp/libomptarget/src/omptarget.cpp
533–534

Remove, but verify all callees go through targetDataMapper<...>

575

This is not tracked anymore.

1006–1007

Remove, but verify all callees go through targetDataMapper<...>

1099–1100

Remove, but verify all callees go through targetDataMapper<...>

1144

Remove, but verify all callees go through targetDataMapper<...>

Thanks, @tianshilei1992 for your input.

I adjusted the timescope calls to address a problem with duplicate profiling slices. You can see them on chrome://tracing. The duplication occurred when both kernel functions and all data-moving functions were timed separately. Focusing the timescope calls on 'targetKernel' and 'targetdataupdate' functions provides a solution to the duplicate slices, and more accurate profiling data, while keeping the important profiling.

My intention was to strike a balance between resolving the duplication issue and keeping the important profiling. I understand the importance of comprehensive profiling and I'm open to further discussion.

(Edited - change of file - the first file was after the fix - now you see the duplicate slices)

Okay. Then don't keep commented out code.

From my perspective, the issue is not in the measurement. Having nested time regions is a normal thing. Using the same Ident also doesn't seem to be the problem. For most nested measured regions, the visualization shows the nesting correctly. I think, the problem is in the visualization. Without verifying the regions where the nesting fails and the slices are printed into the same line, I think that chrome://tracing has an issue, if nested regions start at the same timestamp due to time resolution. An example is

{"pid":2590370,"tid":2590370,"ph":"X","ts":2394356,"dur":22893,"name":"targetKernel","args":{"detail":";main.cpp;main;60;7;;"}},
{"pid":2590370,"tid":2590370,"ph":"X","ts":2394356,"dur":22894,"name":"__tgt_target_kernel","args":{"detail":";main.cpp;main;60;7;;"}}

This should be fixed in the visualization, not in the measurement.

jarmusch updated this revision to Diff 550781.Aug 16 2023, 9:08 AM

I addressed @jdoerfert's recommendations while the discussion is ongoing.

@protze.joachim, from your description, I understand that the issue might not lie in the
measurement itself but rather in the visualization process. That was my initial thought as
well. Moreover, based on my brief understanding of the code, I noticed that __tgt_target_kernel
always calls targetKernel, resulting in the timescope being invoked twice at practically the
same time. Is this measurement necessary, or does it create a duplication that requires
chrome://tracing to remove one for proper visualization?

jdoerfert retitled this revision from Fix Slice Duplicate in Profiler to [OpenMP] Fix Slice Duplicate in Profiler.Aug 16 2023, 9:49 AM

Could you share how the json looks now?

Having nested time regions is a normal thing.

Yes, but our nesting did never provide real benefit. We did not have good reason for the placement. Let's go back to minimal placement and then see where it makes sense to add nested timers.

jarmusch marked 15 inline comments as done.Aug 16 2023, 10:04 AM

here you go @jdoerfert

- from the same benchmark, asmooth-omp, with the profiler fix

here you go @jdoerfert

- from the same benchmark, asmooth-omp, with the profiler fix

Looks good to me now. @tianshilei1992 @protze.joachim wdyt?

If the region doesn't add value, removing this duplicate seems reasonable. I just commented on the technical tracing aspect.

This revision is now accepted and ready to land.Aug 17 2023, 7:35 AM
jarmusch updated this revision to Diff 552067.Aug 21 2023, 10:11 AM

edited function messages with more information for profiler

jdoerfert accepted this revision.Aug 21 2023, 10:16 AM

Please, land the patch for me. Thanks

This revision was automatically updated to reflect the committed changes.

This seems to haven broken the AMPGPU buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/37266

FAILED: openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o 
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang++ --target=x86_64-unknown-linux-gnu -DGTEST_HAS_RTTI=0 -DOMPT_SUPPORT=1 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_LIBCPP_ENABLE_HARDENED_MODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/llvm/include -I/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/include -I/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -O3 -DNDEBUG -fPIC  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -MF openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o.d -o openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -c /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:85:3: error: invalid operands to binary expression ('const char *' and 'const char *')
   85 |   TIMESCOPE_WITH_RTM_AND_IDENT(RegionTypeMsg, Loc);
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/private.h:434:72: note: expanded from macro 'TIMESCOPE_WITH_RTM_AND_IDENT'
  434 |   llvm::TimeTraceScope TimeScope(__FUNCTION__, SI.getProfileLocation() + RegionTypeMsg)
      |                                                ~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:7: error: unknown type name 'Loc'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |       ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:12: error: unknown type name 'DeviceId'; did you mean 'DeviceTy'?
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |            ^~~~~~~~
      |            DeviceTy
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/include/device.h:398:8: note: 'DeviceTy' declared here
  398 | struct DeviceTy {
      |        ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:22: error: unknown type name 'ArgNum'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |                      ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:30: error: unknown type name 'ArgsBase'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |                              ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:40: error: unknown type name 'Args'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |                                        ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:46: error: unknown type name 'ArgSizes'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |                                              ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:56: error: unknown type name 'ArgTypes'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |                                                        ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:196:66: error: unknown type name 'ArgNames'
  196 |       Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, ArgTypes, ArgNames,
      |                                                                  ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:197:7: error: unknown type name 'ArgMappers'
  197 |       ArgMappers, targetDataUpdate, "Updating data within the OpenMP data region with update_mapper", "update");
      |       ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:197:19: error: unknown type name 'targetDataUpdate'
  197 |       ArgMappers, targetDataUpdate, "Updating data within the OpenMP data region with update_mapper", "update");
      |                   ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:197:37: error: expected parameter declarator
  197 |       ArgMappers, targetDataUpdate, "Updating data within the OpenMP data region with update_mapper", "update");
      |                                     ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:197:37: error: expected ')'
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:195:32: note: to match this '('
  195 |   targetDataMapper<AsyncInfoTy>(
      |                                ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:195:3: error: parameter name cannot have template arguments
  195 |   targetDataMapper<AsyncInfoTy>(
      |   ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:195:3: error: a type specifier is required for all declarations
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:197:112: error: expected ')'
  197 |       ArgMappers, targetDataUpdate, "Updating data within the OpenMP data region with update_mapper", "update");
      |                                                                                                                ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:189:44: note: to match this '('
  189 | EXTERN void __tgt_target_data_update_mapper(ident_t *Loc, int64_t DeviceId,
      |                                            ^
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/openmp/libomptarget/src/interface.cpp:198:1: error: extraneous closing brace ('}')
  198 | }
      | ^
17 errors generated.

@jplehr thanks, I fixed the broken commit

fixed with this commit - 1ff0bdb86dbfb7c70bbbabcae3affff59a0e0806

Thanks for the timely fix! Build is now working.
Now there seem to be three test failures: https://lab.llvm.org/buildbot/#/builders/193/builds/37270

it will take some time to look into the errors, lyk as soon as possible

Depending on how much time you need, it may be better to revert the patch in the meantime.

Let me know if I can assist for the failing AMD case.