This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Add generic TargetLowering::shouldAlignPointerArgs() implementation
AcceptedPublic

Authored by arichardson on Sep 20 2022, 6:23 AM.

Details

Summary

This function was added for ARM targets, but aligning global/stack pointer
arguments passed to memcpy/memmove/memset can improve code size and
performance for all targets that don't have fast unaligned accesses.
This adds a generic implementation that adjusts the alignment to pointer
size if unaligned accesses are slow.
Review D134168 suggests that this significantly improves performance on
synthetic benchmarks such as Dhrystone on RV32 as it avoids memcpy() calls.

TODO: It should also improve performance for other benchmarks, would be
good to get some numbers.

Diff Detail

Event Timeline

arichardson created this revision.Sep 20 2022, 6:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2022, 6:23 AM
arichardson requested review of this revision.Sep 20 2022, 6:23 AM
arichardson added inline comments.
llvm/test/Transforms/CodeGenPrepare/RISCV/adjust-memintrin-alignment.ll
3

Not sure if there is a way to get the default target datalayout from opt, so I've hardcoded the relevant bits here.

efriedma added inline comments.Sep 20 2022, 2:05 PM
llvm/lib/CodeGen/TargetLoweringBase.cpp
967

The question is what size load/store ops would we prefer to use for the memcpy, and whether those ops require alignment. Using the alignment of a pointer seems arbitrary; we aren't loading or storing a pointer.

I'd prefer not to call getPointerAlignment() here if we can avoid it; the caller already does the math to figure out the current alignment and the increase. shouldUpdatePointerArgAlignment just needs to know what alignment it wants, not whether the call currently satisfies that alignment.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1927

You want to make this more aggressive by default? Maybe... but we probably want different heuristics for small copies. (For example, aligning a 3-byte copy to 8 bytes makes no sense; we can't take advantage of alignment greater than 2 bytes.)

JojoR added inline comments.Sep 20 2022, 8:24 PM
llvm/lib/CodeGen/TargetLoweringBase.cpp
967

As @efriedma said, we can set default PrefAlign according to PointerSize,
but we should return final PrefAlign from backend du to backend's requirement,
different ISAs maybe have different alignment :)

Address feedback

arichardson added inline comments.Sep 28 2022, 4:08 AM
llvm/lib/CodeGen/TargetLoweringBase.cpp
967

I agree that this is a backend-specific choice. I would assume loading an aligned pointer is an efficient operation on most/all targets, and for the ones where this is not true, they can override shouldUpdatePointerArgAlignment().

I believe this change should match now what you did for RISC-V: MinSize==XLEN PrefAlign==XLEN.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1927

I've reverted this part of the diff and added test to show we don't adjust 3/7 byte objects

arichardson added inline comments.Sep 28 2022, 4:11 AM
llvm/lib/CodeGen/TargetLoweringBase.cpp
966

@jrtc27 we may want to adjust these values for CHERI to not require 16-byte alignment&size, but I think even without a 8-byte fallback this should be a (minor) net win.

efriedma added inline comments.Sep 28 2022, 10:55 AM
llvm/include/llvm/CodeGen/TargetLowering.h
1933

It looks like the argument "Arg" is now unused?

Fix unused argument

efriedma added inline comments.Oct 4 2022, 12:31 PM
llvm/include/llvm/CodeGen/TargetLowering.h
1933

Still here? Did you mean to upload a different change?

arichardson added inline comments.Oct 4 2022, 12:36 PM
llvm/include/llvm/CodeGen/TargetLowering.h
1933

I was using it for allowsMisalignedMemoryAccesses() alignment, but dropped it in the previous diff.
I think having the current argument that is being processed could be useful for overrides (since they could make decisions based on the current alignment).

I've added the use back now, but am also happy to drop it.

1933

Arg is used in the call to allowsMisalignedMemoryAccesses(getPointerMemTy() to determine if it's already a fast operation.
I can drop it if you prefer, I just thought it is potentially useful to avoid additional aligning.

efriedma added inline comments.Oct 7 2022, 11:47 AM
llvm/include/llvm/CodeGen/TargetLowering.h
1933

I'm not sure how calling getPointerAlignment() avoids additional alignment, assuming PrefAlign is correct. shouldAlignPointerArgs is supposed to return the minimum "fast" alignment for the given call. If the actual alignment is already greater than or equal to that, the caller does nothing, even if shouldAlignPointerArgs returns true.

(If there's some alignment between 1 and PrefAlign that the target considers "fast", I guess you could run into an issue, but that implies PrefAlign is wrong.)

llvm/lib/CodeGen/TargetLoweringBase.cpp
967

Maybe TargetTransformInfo::getRegisterBitWidth() is a better default than getPointerPrefAlignment()? I guess that's the same thing for most targets, but it probably makes the intent a bit more clear...

I agree there isn't any default that's going to be correct for all targets.

arichardson planned changes to this revision.Oct 7 2022, 12:08 PM
arichardson added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
1933

Fair enough, I'll drop the argument and update the patch to avoid looking at the current alignment.

llvm/lib/CodeGen/TargetLoweringBase.cpp
967

I was not aware of that function. Thanks for pointing it out that does indeed sounds like the best default.

Address review feedback

efriedma added inline comments.Nov 16 2022, 4:07 PM
llvm/lib/CodeGen/TargetLoweringBase.cpp
967

Did you mean to address this?

Address missed feedback - adds two more affected tests due to pointer/scalar register size differences

Rebase, fix unsused variable.

efriedma accepted this revision.Feb 7 2023, 6:14 PM

LGTM

This revision is now accepted and ready to land.Feb 7 2023, 6:14 PM

(I apologize for the delay here. I meant to get to this earlier.)

That error is probably going to be hard for anyone on an non-AIX machine to reproduce; a reduced testcase would be helpful.

This runs into the error The symbol .rodata.str1.1L...str is not defined. if we specify an optimization flag greater than 0.

clang++ -O1 -c foo.cc

#include <string>
namespace benchmark {
class State {
public:
  void SkipWithError(char *);
};
} // benchmark
struct a {
  std::string b;
};
int c(char *, std::initializer_list< a >);
void e(benchmark::State f) { f.SkipWithError("error message"); }
int g = c("", {{"error message"}});

If it's too difficult to debug this test case without an AIX machine please let me know

We need a series of commands that could be run on a non-AIX machine to identify the problem. So explicitly specify the correct triple, and don't depend on external headers (you can use preprocessed source if necessary).

It fails with a different assembler error on different platforms (tried on AIX, linux, and mac). No error if you remove the AIX target.

clang++ -target powerpc64-ibm-aix -c foo.cc
namespace {
template <class a, class d> void aa(a, d);
template <class> struct e;
class g;
template <class ac, class = ac, class = g> class h;
template <class ad, class f> void i(ad *j, ad *k, f l) {
  long a(k - j);
  __builtin_memmove(l, j, a);
  aa(j, l);
}
template <class af, class ag, class ah> void ai(af j, ag k, ah l) {
  i(j, k, l);
}
template <class aj, class ak> void al(aj j, aj k, ak l) { ai(j, k, l); }
template <class aj, class am, class ak> ak an(aj j, am k, ak l) {
  al(j, j + k, l);
}
template <> struct e<char> {
  static char m(char *j, const char *k, long l) { an(k, l, j); }
};
template <class> struct as {
  as(int);
};
template <class d> class n : as<int>, as<d> {
public:
  using au = as;
  using av = as<d>;
  template <class aw, class ax> n(aw, ax) : au(0), av(0) {}
};
template <class, class, class> class h {
  n<g> ay;

public:
  h(char *j) : ay(int(), int()) {
    long b;
    o(j, b);
  }
  void o(const char *, int);
};
template <class ac, class bc, class bd>
void h<ac, bc, bd>::o(const char *j, int k) {
  e<char>::m(0, j, k);
}
} // namespace
struct {
  h<int> b;
} c{"error message"};

It fails with a different assembler error on different platforms (tried on AIX, linux, and mac). No error if you remove the AIX target.

clang++ -target powerpc64-ibm-aix -c foo.cc

@Jake-Egan,

I had to add -O before I got the bad assembly:

clang++ -target powerpc64-ibm-aix -O -S -o - foo.cc

The bad assembly does not have the 1-byte element/1-byte alignment read-only string section defined, but it has the TOC entry for it:

        .toc
L..C0:
        .tc .rodata.str1.1L...str[TC],.rodata.str1.1L...str[RO]
L..C1:
        .tc .rodata.str1.8L...str[TC],.rodata.str1.8L...str[RO]

Here's an IR only example which will exhibit the invalid AIX assembly after this change:

target datalayout = "E-m:a-i64:64-n32:64-S128-v256:256:256-v512:512:512"
target triple = "powerpc64-ibm-aix"

@.str = private unnamed_addr constant [14 x i8] c"error message\00"

define internal fastcc void @_ZN12_GLOBAL__N_12anIPKclPcEET1_T_T0_S4_() {
entry:
  store i64 0, ptr @.str, align 8
  unreachable
}

define internal fastcc void @_ZN12_GLOBAL__N_12alIPKcPcEEvT_S4_T0_(i64 %0) {
entry:
  tail call void @llvm.memcpy.p0.p0.i64(ptr null, ptr @.str, i64 %0, i1 false)
  ret void
}

; Function Attrs: nocallback nofree nounwind willreturn memory(argmem: readwrite)
declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg) #0

attributes #0 = { nocallback nofree nounwind willreturn memory(argmem: readwrite) }

We see the same TOC reference without a def:

.csect .rodata.str1.8L...str[RO],2
.tc .rodata.str1.1L...str[TC],.rodata.str1.1L...str[RO]
.tc .rodata.str1.8L...str[TC],.rodata.str1.8L...str[RO]

The problem seems to lie in the fact that we codegen the first function with no explicit alignment for @.str resulting in TOC references to the .rodata.str1.1L section , but after this change when doing CodeGenPrepare we modify the alignment on @.str to 8, which changes the MCSection (i.e. csect) name for the final object we emit.

efriedma reopened this revision.Feb 23 2023, 12:20 PM

Is there any way we can make AIX targets more resilient handling this kind of alignment change? If we have to, we can choose a point after which passes aren't allowed to increase the alignment of globals, and move this transform before that. But I'd like to avoid that if possible.

This revision is now accepted and ready to land.Feb 23 2023, 12:20 PM

Is there any way we can make AIX targets more resilient handling this kind of alignment change? If we have to, we can choose a point after which passes aren't allowed to increase the alignment of globals, and move this transform before that. But I'd like to avoid that if possible.

Not that I know of without negative side-effects. For example, if we access using a TOC entry for a label for the individual string, we would end up with more TOC entries which, in turn, can lead to TOC overflow.

As info, the object generation with this patch leads to:

llvm-project/llvm/lib/MC/XCOFFObjectWriter.cpp:620: virtual void (anonymous namespace)::XCOFFObjectWriter::recordRelocation(llvm::MCAssembler &, const llvm::MCAsmLayout &, const llvm::MCFragment *, const llvm::MCFixup &, llvm::MCValue, uint64_t &): Assertion `SectionMap.find(SymASec) != SectionMap.end() && "Expected containing csect to exist in map."' failed.

Is there any way we can make AIX targets more resilient handling this kind of alignment change? If we have to, we can choose a point after which passes aren't allowed to increase the alignment of globals, and move this transform before that. But I'd like to avoid that if possible.

Not that I know of without negative side-effects. For example, if we access using a TOC entry for a label for the individual string, we would end up with more TOC entries which, in turn, can lead to TOC overflow.

I'm wondering if there is really a need to split these rodata csect by alignment the way we do for XCOFF. This behaviour seems to have it's origin in the ELF mergable string handling, and we don't have the same linker features.

I have a draft of such a change, that when combined with this patch, seems to resolve the issue we are seeing. I'll do some more evaluation offline to see if this is a viable resolution and report back.

Not that I know of without negative side-effects. For example, if we access using a TOC entry for a label for the individual string, we would end up with more TOC entries which, in turn, can lead to TOC overflow.

Looking at a case with more than one string, it seems we currently have separate csects and TOC entries for each string anyway. I am not sure that is a state we want to stay in though. I do not want to make changing that more difficult.

Not that I know of without negative side-effects. For example, if we access using a TOC entry for a label for the individual string, we would end up with more TOC entries which, in turn, can lead to TOC overflow.

Looking at a case with more than one string, it seems we currently have separate csects and TOC entries for each string anyway. I am not sure that is a state we want to stay in though.

Whether there are separate csects actually depends on the -fdata-sections setting, though I'm surprised about the extra TOC entries. Agree that's probably not a state we want to stay in.

> I do not want to make changing that more difficult.

I'm not sure the alignment from the identifier will complicate those orthogonal changes, but let me post the patch and we can continue the discussion there.

Did the AIX string issues discussed here ever get resolved?

Did the AIX string issues discussed here ever get resolved?

https://reviews.llvm.org/D156202 removes the alignment-sensitive XCOFF csect determination, so it should resolve the original AIX issue with this patch.

There may be interactions between this patch and https://reviews.llvm.org/D155730. @stefanp, can you comment?

There may be interactions between this patch and https://reviews.llvm.org/D155730

They shouldn't interact? Or at least, they shouldn't interact in a way that affects correctness; the proposed PPCMergeStringPool is a ModulePass.

There may be interactions between this patch and https://reviews.llvm.org/D155730

They shouldn't interact? Or at least, they shouldn't interact in a way that affects correctness; the proposed PPCMergeStringPool is a ModulePass.

Maybe there is no correctness issue, but pipeline ordering matters for the opportunities that this patch is meant to enable.

Also, (and I hope I am not mischaracterizing what was said) my understanding, from earlier "offline" discussion with @stefanp, is that modifying the properties of a global variable (as this patch enables more widely) is inappropriate to do in a function pass like CodeGenPrepare.

Perhaps there are good reasons to expect that increasing the alignment is "safe", but it has a good chance of causing timing issues (such as changing function codegen when functions are reordered within a TU).

Whether it's a FunctionPass doesn't really matter; despite the "layering violation", there aren't any datastructures that actually care. getOrEnforceKnownAlignment has been doing similar modifications for a long time. I guess I can see an argument that we should try to avoid alignment modifications for all globals after isel has run for any global. Not sure what, exactly, that implies for the latest point it's legal to modify a global.

We could try to move the optimization much earlier, like into the InferAlignment pass proposed in D158529.

getOrEnforceKnownAlignment has been doing similar modifications for a long time. I guess I can see an argument that we should try to avoid alignment modifications for all globals after isel has run for any global. Not sure what, exactly, that implies for the latest point it's legal to modify a global.

We could try to move the optimization much earlier, like into the InferAlignment pass proposed in D158529.

I was just informed that @stefanp is away until some time next week. I am hoping to get his input on the importance of trying to get to a "better state" (both in general and for the specific case of the current optimization).

getOrEnforceKnownAlignment has been doing similar modifications for a long time. I guess I can see an argument that we should try to avoid alignment modifications for all globals after isel has run for any global. Not sure what, exactly, that implies for the latest point it's legal to modify a global.

We could try to move the optimization much earlier, like into the InferAlignment pass proposed in D158529.

I was just informed that @stefanp is away until some time next week. I am hoping to get his input on the importance of trying to get to a "better state" (both in general and for the specific case of the current optimization).

Sorry for the late reply.

I am certainly a little nervous about changing global variables in a function pass. If two functions exist with access to the same global is this going to cause a problem when the alignment is changed? Is it possible for one function to act on a changed alignment and the other to act on the original alignment? At this point I can't think of a situation like that so it's probably fine.

In terms of interaction with the String Pooling pass I would say that it mainly depends on the order in which the passes run. If the string pool pass runs first then we may miss the opportunity to over-align something here since the pool is treated as a single variable. If this pass runs first then the string pooling pass will miss pooling a global variable if it is over-aligned. This is actually a limitation of the string pooling pass at the moment but I was hoping to fix it later. In either situation I don't see a functional issue.