This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Remove global state in lldCommon
ClosedPublic

Authored by aganea on Aug 27 2021, 2:50 PM.

Details

Summary

As discussed in our last Windows/COFF call, and as proposed by Reid in [1], this patch implements a first iteration to make LLD free of global initializers & global state. Here we only remove global state in lldCommon, however the same concept can be applied to other LLD libs.

The idea is to move all variables at file-scope or function-static-scope into a hosting structure (lld::CommonLinkingContext) that lives at lldMain()-scope. Drivers will inherit from this structure and add their own global state, in the same way as for the existing LinkingContext and MachOLinkingContext.

Before, code was directly referencing state in the .data section. It now goes through an indirection to a global **LinkingContext returned by lld::context(). In the future, we plan on supporting several instances of lld::**LinkingContext that would coexist in memory (when the hosting application is similar to D86351).

[1] https://lists.llvm.org/pipermail/llvm-dev/2021-June/151184.html

Diff Detail

Event Timeline

aganea created this revision.Aug 27 2021, 2:50 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
aganea requested review of this revision.Aug 27 2021, 2:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 27 2021, 2:50 PM
aganea edited the summary of this revision. (Show Details)Aug 27 2021, 2:53 PM
int3 added a subscriber: int3.Aug 27 2021, 2:57 PM
int3 added inline comments.
lld/MachO/Driver.cpp
168

why not just saver()?

This is a larger change than I anticipated. I was surprised by the need to have the pointer be thread local (and then ensure that the thread local pointers for new threads are set up, too). If I'm understanding correctly, the idea is that a single program could have n threads doing n links at the same time. Is that right? I'm not arguing against that. I'm just trying to make sure I understand why I was surprised.

It also introduces more indirection than I expected. I believe Rui (and others) favored globals in order to go as fast as possible. Are there lld benchmarks to give us an idea of the cost of this? If there's pushback from the performance die-hards, it would be nice to have an actual measurement or two to show (I hope) that the cost for this flexibility is low.

Anyway, this looks pretty good to me. I'm not well-versed in lld, and--honestly--I didn't scrutinize all the changes that looked mostly mechanical.

lld/COFF/Chunks.cpp
13

I guess we're using ASCIIbetical rather than alphabetical ordering. :-)

lld/MachO/DriverUtils.cpp
263

In most cases, it's just saver(). Is there something special about these instances where you explicitly pass a pointer to the common globals?

lld/include/lld/Common/Globals.h
67 ↗(On Diff #369181)

s/beween/between/

lld/tools/lld/lld.cpp
151

Given that this is the not-exiting-early path, should this thread detach from the global state before deleting it? That might help catch instances where something in the chain of destructors inadvertently tries to reference a global.

MaskRay added inline comments.Aug 27 2021, 9:18 PM
lld/Common/ErrorHandler.cpp
44

add empty line

47

add empty line

lld/Common/Globals.cpp
74 ↗(On Diff #369181)
80 ↗(On Diff #369181)
lld/include/lld/Common/Globals.h
51 ↗(On Diff #369181)
aganea updated this revision to Diff 369778.Aug 31 2021, 1:22 PM
aganea marked 9 inline comments as done.

Rebase + changes as suggested.

aganea added a comment.EditedAug 31 2021, 1:23 PM

Thanks for taking a look!

This is a larger change than I anticipated.

It's a first basis for discussions. Please let me know if I should take this in a different direction.

I was surprised by the need to have the pointer be thread local (and then ensure that the thread local pointers for new threads are set up, too). If I'm understanding correctly, the idea is that a single program could have n threads doing n links at the same time. Is that right? I'm not arguing against that. I'm just trying to make sure I understand why I was surprised.

Yes that is correct.

The dream is, getting to a point where LLVM_PARALLEL_LINK_JOBS wouldn't be needed (if linking in-process). Currently each LLD process spawns its own thread pool, and the applications don't talk to each other. That increases the memory usage and (surprisingly) sometimes can starve the CPU, for example when doing ThinLTO. I feel that we could have better control over resources if we had a single "build/compile/link" application.

It also introduces more indirection than I expected.

Indeed, more memory loads are now required to get to the global. In the vast majority of cases it shouldn't matter, but in tight loops we will need to pass down the globals pointer.

I believe Rui (and others) favored globals in order to go as fast as possible. Are there lld benchmarks to give us an idea of the cost of this? If there's pushback from the performance die-hards, it would be nice to have an actual measurement or two to show (I hope) that the cost for this flexibility is low.

There seems to be a really tiny departure from baseline after this change, but it's hard to tell since LLD is mostly I/O-bound.
The data set is ~5 GB input .OBJs, approx. 2.5 GB .PDB, 475 MB executable.
Sometimes just running the same command line again exchanges the outcome, "before" comes before "after" or vice versa:

Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz (6 cores)
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp                                                                                     
  Time (mean ± σ):     13.740 s ±  0.578 s    [User: 0.0 ms, System: 53.5 ms]                                                         
  Range (min … max):   13.391 s … 15.321 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp
  Time (mean ± σ):     13.737 s ±  0.198 s    [User: 4.4 ms, System: 55.8 ms]                                                         
  Range (min … max):   13.566 s … 14.262 s    10 runs

Summary
  'after\lld-link.exe @link.rsp' ran
    1.00 ± 0.04 times faster than 'before\lld-link.exe @link.rsp'


AMD Ryzen 9 5950X 16-Core Processor @ 3.40 GHz
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp                                                                      
  Time (mean ± σ):     10.178 s ±  0.091 s    [User: 3.0 ms, System: 5.2 ms]                                           
  Range (min … max):    9.978 s … 10.303 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp
  Time (mean ± σ):     10.243 s ±  0.088 s    [User: 4.5 ms, System: 5.2 ms]                                           
  Range (min … max):   10.115 s … 10.357 s    10 runs

Summary
  'before\lld-link.exe @link.rsp' ran
    1.01 ± 0.01 times faster than 'after\lld-link.exe @link.rsp'


AMD Ryzen Threadripper PRO 3975WX 32-Cores @ 3.50 GHz
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp                                                                          
  Time (mean ± σ):     15.451 s ±  0.545 s    [User: 1.4 ms, System: 8.6 ms]                                               
  Range (min … max):   15.024 s … 16.849 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp
  Time (mean ± σ):     15.243 s ±  0.087 s    [User: 0.0 ms, System: 8.6 ms]                                               
  Range (min … max):   15.109 s … 15.392 s    10 runs

Summary
  'after\lld-link.exe @link.rsp' ran
    1.01 ± 0.04 times faster than 'before\lld-link.exe @link.rsp'


2x Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (48 cores, dual CPU)
>hyperfine "before\lld-link.exe @link.rsp" "after\lld-link.exe @link.rsp"
Benchmark #1: before\lld-link.exe @link.rsp                                                                                                   
  Time (mean ± σ):     17.433 s ±  0.852 s    [User: 0.0 ms, System: 6.5 ms]                                                                        
  Range (min … max):   15.202 s … 18.267 s    10 runs

Benchmark #2: after\lld-link.exe @link.rsp
  Time (mean ± σ):     17.661 s ±  0.876 s    [User: 1.4 ms, System: 2.6 ms]                                                                        
  Range (min … max):   15.304 s … 18.442 s    10 runs

Summary
  'before\lld-link.exe @link.rsp' ran
    1.01 ± 0.07 times faster than 'after\lld-link.exe @link.rsp'

Profiling shows a difference in functions that seem to read from memory-mapped files, doing hard page faults (last column is milliseconds):


First line (the lambda) is the first loop in GSIHashStreamBuilder::finalizeBuckets.
The second line is the JamCRC calculation in createSectionContrib in lld/COFF/PDB.cpp.

I thought some cache lines could evicted by the read of CommonGlobls TLS. However two random runs of "before" builds are exhibiting the same difference as abolve. It seems really we're only seeing statistical fluctuations. I'd be happy to run tests on anything that you judge a representative data set.

lld/COFF/Chunks.cpp
13

;-)

lld/MachO/Driver.cpp
168

Changed. Please see my other comment at lld/MachO/DriverUtils.cpp, L263.

lld/MachO/DriverUtils.cpp
263

Only the fact that it's a loop and I wanted to clarify the intent, rather than relying on the optimizer to (maybe) prevent the call to commonGlobals(). But I switched to just saver(), since other operations in this function would dominate.

Ideally we should be passing state down the callstack and not rely on TLS. This could be done in time-critical functions.

lld/tools/lld/lld.cpp
151

It's detaching just at the end of ~CommonGlobals().

rnk added a comment.Aug 31 2021, 3:22 PM

Thanks!

Regarding performance, yes, the memory mapped I/O makes measurement very noisy. I'm not worried about this change regressing performance.

lld/include/lld/Common/Globals.h
30 ↗(On Diff #369778)

I feel like the LLVM-y naming convention for this would be to use the word Context somehow. Either Link(|ing|er)Context, LldContext, LldContextBase. I could imagine inheriting the COFF context from the linker context, or something like that.

I don't like bikeshedding, but I also feel like the best time to get the name right is at the beginning.

See also the prior art of MachOLinkingContext in the old MachO linker.

aganea updated this revision to Diff 369991.Sep 1 2021, 10:55 AM
aganea edited the summary of this revision. (Show Details)

Changed CommonGlobals to CommonLinkingContext. Use it along with MachOLinkingContext -- this demonstrates how it will be used for other drivers.

I have to say I think I will miss the simplicity of the current lld implementation: No relocations, no indirection, just global data. However I do see the advantages you are going for here .. and the discussion about the overall direction has mostly been had on the mailing list at this point.

akhuang added a subscriber: akhuang.Sep 1 2021, 3:27 PM

This is cool, I was also looking at refactoring out some of the globals in lldCOFF (for now just trying to move some globals into existing classes and passing the classes around). Looks good to me, although I'm not very familiar with structuring code in this way / making things work with threads.

lld/Common/Globals.cpp
45 ↗(On Diff #369991)

does static make the anonymous namespace unnecessary?

aganea updated this revision to Diff 370678.Sep 3 2021, 3:13 PM
aganea marked an inline comment as done.

Remove anon namespace.

aganea added a comment.Sep 3 2021, 3:14 PM

This is cool, I was also looking at refactoring out some of the globals in lldCOFF (for now just trying to move some globals into existing classes and passing the classes around). Looks good to me, although I'm not very familiar with structuring code in this way / making things work with threads.

Both concepts can coexist, passing references to objects across calls, or accesing them through a global (essentially what the mechanism this patch provides). In some cases it just isn't practical to pass references (lld::outs for example, or anything related to the lld::ErrorHandler).

lld/Common/Globals.cpp
45 ↗(On Diff #369991)

Yes indeed, fixed!

seems like everyone's happy with the direction and there are no more comments -- perhaps this is good to be committed now?

I think this should be staged into two patches:

  1. LLD-only: includes all of the Context changes, saver(), etc, but the context is still kept in a global pointer.
  2. LTO.h and Parallel.h: make the context pointer thread local, allowing the possibility of two linker instances in the same process, and add the task wrapper API

The goal is to make the patch that changes Support as small as possible. MLIR uses the parallel APIs, so we should ask them (@mehdi_amini, @rriddle, maybe others) to review the changes to llvm/Support/Parallel.h.

rnk added inline comments.Sep 14 2021, 2:54 PM
lld/tools/lld/lld.cpp
168

This probably needs to be initialized or static analysis tools may complain, even though the invalid case is handled above.

Nice work!

lld/Common/ErrorHandler.cpp
96–97

I'm curious about this change: you're protecting what seems to be global methods with a non-global mutex. The code comment for the mutex says `

// The functions defined in this file can be called from multiple threads,
// but lld::outs() or lld::errs() are not thread-safe. We protect them using a
// mutex.
This made sense when the mutex was global, less so now. Can you clarify this? (and update the comment)

The goal is to make the patch that changes Support as small as possible. MLIR uses the parallel APIs, so we should ask them (@mehdi_amini, @rriddle, maybe others) to review the changes to llvm/Support/Parallel.h.

Actually MLIR does not use Parallel.h anymore, especially to avoid relying on any global. We're using instead a ThreadPool owned by the MLIRContext, the client applications can inject a thread pool in the context to manage sharing threads how they see fit.

Our wrappers are in https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Threading.h and operate around llvm/Support/ThreadPool.h.

The mechanism introduced in Parallel.h in this revision (wrapThreadState) seems mostly like some sort of "workaround" to be able to continue to rely on some global way to pass state around in lld instead of having something more explicit, which we favored in MLIR (where like in LLVM, the Context is fairly centric and not accessed through a global).

llvm/lib/Support/Parallel.cpp
87 ↗(On Diff #370678)

Seems like worth documenting.

aganea updated this revision to Diff 373349.Sep 17 2021, 3:04 PM
aganea marked 2 inline comments as done.
aganea edited the summary of this revision. (Show Details)

The mechanism introduced in Parallel.h in this revision (wrapThreadState) seems mostly like some sort of "workaround" to be able to continue to rely on some global way to pass state around in lld instead of having something more explicit, which we favored in MLIR (where like in LLVM, the Context is fairly centric and not accessed through a global).

It was totally a temporary workaround, until we explicitly pass contexts around. I've tried passing contexts around, but it is a big change and I'm not sure it's worth it. Removed this code for now.

I think this should be staged into two patches:

  1. LLD-only: includes all of the Context changes, saver(), etc, but the context is still kept in a global pointer.

Updated the patch to only do 1. like you suggested.

lld/Common/ErrorHandler.cpp
96–97

Before, the mutex and the functions were global. They both have the same lifetime as the call to lldMain(). I was planning on supporting several concurrent calls to lldMain() in the future. Updated the comment.

llvm/lib/Support/Parallel.cpp
87 ↗(On Diff #370678)

Removed.

rnk added a comment.Sep 17 2021, 3:41 PM

Actually MLIR does not use Parallel.h anymore, especially to avoid relying on any global. We're using instead a ThreadPool owned by the MLIRContext, the client applications can inject a thread pool in the context to manage sharing threads how they see fit.

Our wrappers are in https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Threading.h and operate around llvm/Support/ThreadPool.h.

The mechanism introduced in Parallel.h in this revision (wrapThreadState) seems mostly like some sort of "workaround" to be able to continue to rely on some global way to pass state around in lld instead of having something more explicit, which we favored in MLIR (where like in LLVM, the Context is fairly centric and not accessed through a global).

I see, I guess that is new as of June, so news to me. :)

I think there are really two key operations done throughout the linker that need the context:

  • memory allocation (via BumpPtrAllocator)
  • diagnostics (warnings & errors)

Conventional C++ programs use a global heap because passing around a pool or arena is too much work.

I think I would like to see central LLD context classes, but I'm open to using thread locals on the way to that restructuring.


As far as this iteration of the patch goes, I think it looks good, but I think other linker port owners should sign off on it too.

aganea updated this revision to Diff 373420.Sep 18 2021, 8:55 AM

Fix linter suggestions.

Ping.

lld/Common/ErrorHandler.cpp
96–97

I meant now, they both have the same lifetime as the call to lldMain().

Sorry for the slow review from the Mach-O end.

This seems like a pretty clean solution overall. I was concerned that removing global state would mean we have to thread a ton of context through our entire call stack, but switching to function calls instead is easy enough.

I'll measure the performance of this change this week; we care heavily about linking speed because it dominates incremental build speeds (so fast links can be a huge developer experience boost). I'm not expecting any change from the global variable access changes (especially with LTO), but I'm curious if switching the type-specific bump allocators to using a DenseMap for lookup will have any measurable effect.

I see a bunch of discussions about thread-locals; was that for an earlier version of this patch, or am I just missing something? (I don't see anything thread-local related in the current version of the patch.)

To make sure I'm understanding this correctly, can the calls to freeArena be removed because the destructor for the context will take care of that now?

I'd like @int3 and/or @thakis to take a look on the Mach-O side as well (since they're far more active), and I imagine @MaskRay is the best person for the ELF side.

lld/COFF/InputFiles.cpp
1065

What was the motivation for saving to a local variable here? Is it the use of saver in the loop below? (It makes sense to me; I just wanted to confirm my understanding.)

lld/Common/CommonLinkingContext.cpp
22 ↗(On Diff #373420)

It'd be nice if there was a way to put this in the header, so that it could be inlined even in non-LTO builds, but that'd prevent lctx from being file-static, which seems undesirable. I imagine LTO or ThinLTO would be happy to inline this (and I'll confirm when I do performance measurements), so it shouldn't matter too much in the end.

23 ↗(On Diff #373420)

I'm a bit concerned about the potential performance implications of this assert, but I'll measure first.

lld/Common/Memory.cpp
14–24

Can these be put in the header, so that they can be inlined even in non-LTO builds?

lld/include/lld/Common/Memory.h
26–28

This comment should probably be updated? The tracking is happening in the context now.

45–51

This comment needs to be updated (or perhaps just removed).

46

The contents don't matter at all, just the address (which is used as a key for the instances map inside the context), right? Might be helpful to have a comment to that effect, otherwise the introduction of what seems like new global state is confusing.

aganea updated this revision to Diff 396348.Dec 27 2021, 3:33 PM
aganea marked 7 inline comments as done.

Rebase.
Integrate with @akhuang's COFFLinkerContext.
Address comments from @smeenai.

Thanks for taking a look!

I see a bunch of discussions about thread-locals; was that for an earlier version of this patch, or am I just missing something? (I don't see anything thread-local related in the current version of the patch.)

Yes, please take a look at "Diff 4" and before, lctx used to be a THREAD_LOCAL to accommodate for several instances of LLD running in the same process. But @rnk suggested to go in steps, and first move all scattered statics into a global static context.

To make sure I'm understanding this correctly, can the calls to freeArena be removed because the destructor for the context will take care of that now?

Yes.

lld/COFF/InputFiles.cpp
1065

Yes, precisely, to clarify the intention that saver shouldn't change in that loop.

lld/Common/CommonLinkingContext.cpp
22 ↗(On Diff #373420)

Did you have the chance to run any tests?

23 ↗(On Diff #373420)

Since this would only trigger in LLVM_ENABLE_ASSERTIONS=ON builds, the performance shouldn't matter that much? Or should it? Otherwise we could only have it in LLVM_ENABLE_EXPENSIVE_CHECKS builds?

Will llvm::DenseMap<void *, SpecificAllocBase *> instances; make make<XXX> too slow?

For ELF, I have measured that symbol/section initialization time matters. As an alternative, lld/ELF can do bulk initialization instead.
If an ELF .o file has N sections, 1/3 are discarded sections (SHT_RELA/SHT_SYMTAB/etc) while 1/7 are discarded COMDAT sections.
When linking chrome, perhaps 6/7 are discarded COMDAT sections (too many templates?). It wastes some memory if we don't compute the number of needed sections, but that may be fine.

Will llvm::DenseMap<void *, SpecificAllocBase *> instances; make make<XXX> too slow?

I measured a previous version of this patch on Windows with COFF ​and a Chromium-sized output, and I only saw a very tiny departure (the fluctuations could be attributed to mmap'ing). Please see: https://reviews.llvm.org/D108850#2975472
Would you be able to try this patch on Linux linking Chromium?

MaskRay accepted this revision as: MaskRay.EditedDec 28 2021, 2:57 PM

The patch makes my Linux x86-64 lld (-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_PIC=off -DCMAKE_CXX_FLAGS='-fno-asynchronous-unwind-tables') 130+KiB smaller.

I tried linking chrome and clang release/relwithdebinfo with this patch. No significant difference. Here is chrome's --threads=8 output

% hyperfine --warmup 2 --min-runs 20 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8 -o lld.8"
Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld.8
  Time (mean ± σ):      6.699 s ±  0.032 s    [User: 8.815 s, System: 2.439 s]
  Range (min … max):    6.637 s …  6.745 s    20 runs
 
Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld.8
  Time (mean ± σ):      6.691 s ±  0.027 s    [User: 8.856 s, System: 2.396 s]
  Range (min … max):    6.647 s …  6.743 s    20 runs
 
Summary
  'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld.8' ran
    1.00 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld.8'

Checked a large internal program, no difference.

Hope that https://maskray.me/blog/2021-12-19-why-isnt-ld.lld-faster#parallel-initialization-of-sections and parallel initialization of symbol tables are not harder to refactor :)

This revision is now accepted and ready to land.Dec 28 2021, 2:57 PM
MaskRay requested changes to this revision.Dec 28 2021, 3:08 PM

-DLLVM_USE_SANITIZER=Address has some leaks.

This revision now requires changes to proceed.Dec 28 2021, 3:08 PM
aganea updated this revision to Diff 396735.Dec 30 2021, 5:07 PM

Fix memory leaks.

MaskRay accepted this revision as: MaskRay.Dec 30 2021, 8:07 PM

LGTM.

lld/tools/lld/lld.cpp
150
163

If an error occurs, link() return false.

After r becomes int, return may be ambitious on whether it applies to the function or to the r variable.

164

You can make r an int and avoid repeated ? 1 : 0 below.

166

Call _Exit to skip destructors if exitEarly is true.


The original comment was

// Exit immediately if we don't need to return to the caller.
// This saves time because the overhead of calling destructors
// for all globally-allocated objects is not negligible.

Is it worth copying some words to the new comment?

MaskRay accepted this revision.Dec 30 2021, 8:08 PM
This revision is now accepted and ready to land.Dec 30 2021, 8:08 PM
aganea updated this revision to Diff 396816.Dec 31 2021, 11:41 AM
aganea marked 4 inline comments as done and an inline comment as not done.

As suggested by @MaskRay

MaskRay added inline comments.Jan 5 2022, 7:41 PM
lld/tools/lld/lld.cpp
139

static

MaskRay added inline comments.Jan 5 2022, 7:42 PM
lld/COFF/Driver.cpp
65

Just raw_ostream

aganea updated this revision to Diff 397974.Jan 6 2022, 1:14 PM
aganea marked 3 inline comments as done.

As suggested by @MaskRay

aganea added a comment.Jan 6 2022, 1:17 PM

@smeenai @int3 @thakis Any further comments? Can I go ahead with this?

I wasn't able to get the perf testing done at the time, and I won't have access to one of my testing machines for the next few days, but I'm not anticipating anything dramatic. If @int3 / @thevinster / @oontvoo / @thakis can run tests on their end, that's great, but this looks good to me regardless.

I wasn't able to get the perf testing done at the time, and I won't have access to one of my testing machines for the next few days, but I'm not anticipating anything dramatic. If @int3 / @thevinster / @oontvoo / @thakis can run tests on their end, that's great, but this looks good to me regardless.

Thanks for the headsup. Yeah, I can run a few bms - will share the results in a couple of hours

oontvoo accepted this revision.Jan 7 2022, 11:07 AM

perf-impact seems minimal to me. LGTM!

x base.txt
+ with_patch.txt

SYSTEM CPU time: 
    N           Min           Max        Median           Avg        Stddev
x  10          0.99          1.01          1.01         1.003   0.008232726
+  10          0.98          1.04          1.02         1.014   0.022705848
No difference proven at 95.0% confidence

USER CPU time: 
    N           Min           Max        Median           Avg        Stddev
x  10          3.56          3.61          3.57         3.575   0.018408935
+  10          3.56          3.71          3.62         3.625   0.053176854
Difference at 95.0% confidence
	0.05 +/- 0.0373876
	1.3986% +/- 1.04581%
	(Student's t, pooled s = 0.0397911)

WALL time: 
    N           Min           Max        Median           Avg        Stddev
x  10          4.06          4.12          4.08         4.085   0.024152295
+  10          4.04          4.23          4.14          4.14   0.070867639
Difference at 95.0% confidence
	0.055 +/- 0.0497434
	1.34639% +/- 1.21771%
	(Student's t, pooled s = 0.0529413)

Thanks for testing @oontvoo!

This revision was automatically updated to reflect the committed changes.

`@aganea This change broke SerializeToHsaco in MLIR, can you help resolve this issue or revert? See https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L440

I can't quite tell what the new boolean arguments mean.

`@aganea This change broke SerializeToHsaco in MLIR, can you help resolve this issue or revert? See https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L440

I can't quite tell what the new boolean arguments mean.

Fixing now.

yaxunl added a subscriber: yaxunl.Feb 7 2022, 9:21 PM

This patch caused regressions for HIP on Windows. Basically, any HIP programs using device variables will fail to link on Windows. The issue does not happen on Linux.

To reproduce,

$ cat a-hip-amdgcn-amd-amdhsa-gfx1031.s
        .text
        .amdgcn_target "amdgcn-amd-amdhsa--gfx1031"
        .p2alignl 6, 3214868480
        .fill 48, 4, 3214868480
        .protected      xxx                     ; @xxx
        .type   xxx,@object
        .data
        .globl  xxx
        .p2align        2
xxx:
        .long   123                             ; 0x7b
        .size   xxx, 4

        .ident  "clang version 14.0.0 (git@github.amd.com:Compute-Mirrors/llvm-project.git ce53ba9aaee8d053604883d5befaf07bad60ba84)"
        .section        ".note.GNU-stack"
        .addrsig
        .amdgpu_metadata
---
amdhsa.kernels:  []
amdhsa.target:   amdgcn-amd-amdhsa--gfx1031
amdhsa.version:
  - 1
  - 1
...

        .end_amdgpu_metadata

$  "C:\\hip\\bin\\clang++.exe" -cc1as -triple amdgcn-amd-amdhsa -filetype obj -target-cpu gfx1031 -mrelocation-model pic -mincremental-linker-compatible --mrelax-relocations -o a-hip-amdgcn-amd-amdhsa-gfx1031.o a-hip-amdgcn-amd-amdhsa-gfx1031.s

$ "C:\\hip\\bin\\lld.exe" -flavor gnu -shared a-hip-amdgcn-amd-amdhsa-gfx1031.o
lld: error: duplicate symbol: xxx
>>> defined at a-hip-amdgcn-amd-amdhsa-gfx1031.o:(xxx)
>>> defined at a-hip-amdgcn-amd-amdhsa-gfx1031.o:(.data+0x0)
aganea added a comment.Feb 8 2022, 8:55 AM

@yaxunl I have a fix, I will send a patch later today.