Continuing the work from https://reviews.llvm.org/D33240, this change introduces an element unordered-atomic memset intrinsic. This intrinsic is essentially memset with the implementation requirement that all stores used for the assignment are done with unordered-atomic stores of a given element size.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Codewise, this looks fine and clearly implements the semantics you document. However, I wondering if those are the most useful semantics. Once we answer the design question below, if we still want to go in this direction the patch LGTM.
This can be used to represent any pattern memset currently can. In particular it does handle larger elements with repeating bytewise values. It does not full cover the Arrays.fill case though because a value such as "i32 15" can't be represented. I could see an argument in favour of handling that case entirely separately or all of it in a single family of intrinsics. What do you think?
I'd had the same thought... that case feels more like a special case of memset_pattern to me, as it matches the semantics of that function exactly. So, I think that case would be better handled as a separate thing -- memset_pattern is a lib function instead of an intrinsic right now.
I'm also a little skeptical this is a good idea; the "obvious" intrinsic to provide is one where the size of the value is equal to the atomic element size. The only reason to prefer this version is to slightly reduce the amount of work required to port existing transforms, and that doesn't seem compelling.
I don't like the idea of having an element atomic memset intrinsic that doesn't follow the semantics of memset -- memset specifically sets every byte of a block of memory to the same value. It does not have super wide usefulness; it's basically just useful for initialization-type stuff like zeroing out blocks of memory. An element atomic version of memset, defined as in this patch, allows us to use memset in IR sourced from, say, Java in the same ways that it would be used in IR sourced from C or Fortran. I don't think that the limited applicability of memset should be a strike against it.
If we want to be able to turn code like "for (int i=0; i<N; i++) a[i] = <some 32-bit integer constant>;" into an intrinsic call that can be optimized & reasoned about, then I think that's a different intrinsic from memset. In fact, it's memset_pattern; but, memset_pattern currently only exists as a libcall in LLVM.
Philip/Eli -- what are your thoughts? Should we try proposing adding memset_pattern as an LLVM intrinsic?
There's a lot of history going into this.
The memset API isn't really very good; memset is enshrined as an intrinsic most because it's available as part of the C library (and therefore well-optimized for most targets). This saves us a bunch of work, and it's good enough for the most common case of initializing memory with an all-zero pattern.
The problem with adding a memset_pattern intrinsic is that we can't really support it well on all targets, at least not without a bunch of implementation work. If we're going to emit a call, we need an implementation somewhere, which means we need to add one to compiler-rt, and probably write target-specific code to optimize it. And even if we did have a good implementation for a bunch of targets in compiler-rt, people trying to use clang with libgcc or a freestanding target would complain about not having that implementation available. So we have the current solution, where we don't have an intrinsic, and depend on vectorization/unrolling to generate reasonably fast code for initialization which can't be lowered to memset.
If you're writing a new API which isn't tied to the history of memset, the same interface doesn't really make sense; the only possible advantage is sharing code in the optimizer. But maybe that's what we're stuck with; the potential advantages aren't important compared to the burden of implementing/maintaining an intrinsic with a different API.
On a side-note, it's sort of a problem for the other unordered-atomic intrinsics that we currently don't have an implementation available in compiler-rt... but we can let it slide for now because most languages don't use unordered atomics anyway.
Daniel, and I ended up talking about this one a decent amount offline with one of our other non-LLVM compiler folks. Our conclusions was that the more generic fill pattern isn't really any more common in Java than it is in C/C++. The overwhelmingly most important pattern we want to catch is just zero-initialization of a block of memory, (i.e. atomic bzero) which is covered by the current proposal.
My take here is that we should move forward with the current design. If we decide we want to generalize at some point we can, but there's no reason to block the addition of the element wise atomic code on that generalization decision. It looks like we're probably not going to want to generalize this in the near future given the work required that Eli worked out and the low perceived profitability.
Eli's point about not having a compiler-rt implementation for these is sound and something I hadn't previously considered. We should at minimum file a bug for that. :) It might be reasonable to provide a simple implementation just to make sure the whole system works end to end as well.
Unless anyone objects to this, I intend to LGTM this and move forward with the original approach. Any concerns?
- rebase
- add entries to WebAssembly runtime lib lists due to change in RTLIB::Libcall enum.