This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][PowerPC] Estimate int and float register pressure separately in loop-vectorize
ClosedPublic

Authored by wuzish on Sep 4 2019, 12:48 AM.

Details

Summary

In loop-vectorize, interleave count and vector factor depend on target register number. Currently, it does not estimate different register pressure for different register class separately(especially for scalar type, float type should not be on the same position with int type), so it's not accurate. Specifically, it causes too many times interleaving/unrolling, result in too many register spills in loop body and hurting performance.

So we need classify the register classes in IR level, and importantly these are abstract register classes, and are not the target register class of
backend provided in td file. It's used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types.

For POWER target, register num is special when VSX is enabled. When VSX is enabled, the number of int scalar register is 32(GPR), float is 64(VSR), but for int and float vector register both are 64(VSR). So there should be 2 kinds of register class when vsx is enabled, and 3 kinds of register class when VSX is NOT enabled.

I test it on POWER target, it makes big(+~30%) performance improvement in one specific bmk of spec2017 and no other obvious degressions. Could anyone help to adjust the register num and verify in other targets?

Diff Detail

Event Timeline

wuzish created this revision.Sep 4 2019, 12:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 4 2019, 12:49 AM
wuzish edited the summary of this revision. (Show Details)Sep 4 2019, 1:09 AM
wuzish edited the summary of this revision. (Show Details)Sep 4 2019, 1:23 AM
wuzish edited the summary of this revision. (Show Details)Sep 4 2019, 2:06 AM

Thanks for looking at this (it's been a problem for a long time). Let me suggest a different interface, which I believe will improve generality and reduce code duplication in the register-pressure estimator, and let me know what you think...

// Return the number of registers in the target-provided register class.
unsigned getNumberOfRegisters(unsigned ClassID = 0) const;

// Return the target-provided register class for the provided type.
unsigned getRegisterClassForType(Type *Ty) const;

The idea, then, is that we just calculate register usage for each register class separately (i.e., keep a hash table), and then when computing the interleaving factor, etc. we just iterate over all of the register classes returned by the target, and pick the smallest interleaving factor calculated over all of the register classes. There's probably even a nice way to construct a default implementation of this in the backend (although that we'd save for follow-up work).

Thanks for looking at this (it's been a problem for a long time). Let me suggest a different interface, which I believe will improve generality and reduce code duplication in the register-pressure estimator, and let me know what you think...

// Return the number of registers in the target-provided register class.
unsigned getNumberOfRegisters(unsigned ClassID = 0) const;

// Return the target-provided register class for the provided type.
unsigned getRegisterClassForType(Type *Ty) const;

The idea, then, is that we just calculate register usage for each register class separately (i.e., keep a hash table), and then when computing the interleaving factor, etc. we just iterate over all of the register classes returned by the target, and pick the smallest interleaving factor calculated over all of the register classes. There's probably even a nice way to construct a default implementation of this in the backend (although that we'd save for follow-up work).

Yes. Using ClassID is more general and fine-grained. But I think there is no need to iterate all kinds of register class because target register class is flexible and many register classes target provided are only different width with same position(such as gprc and g8rc), which makes too many kinds of register classes to maintain and some are just the same thing. It's not just use smallest interleaving factor calculated over all of the register classes and we also need to care about the overlapping relationship between different register class.

In all, it would be too fine-grained and little over-design, many register classes are just same behavior as separate with scalar and vector, int and float. What's your opinion?

Thanks for looking at this (it's been a problem for a long time). Let me suggest a different interface, which I believe will improve generality and reduce code duplication in the register-pressure estimator, and let me know what you think...

// Return the number of registers in the target-provided register class.
unsigned getNumberOfRegisters(unsigned ClassID = 0) const;

// Return the target-provided register class for the provided type.
unsigned getRegisterClassForType(Type *Ty) const;

The idea, then, is that we just calculate register usage for each register class separately (i.e., keep a hash table), and then when computing the interleaving factor, etc. we just iterate over all of the register classes returned by the target, and pick the smallest interleaving factor calculated over all of the register classes. There's probably even a nice way to construct a default implementation of this in the backend (although that we'd save for follow-up work).

Yes. Using ClassID is more general and fine-grained. But I think there is no need to iterate all kinds of register class because target register class is flexible and many register classes target provided are only different width with same position(such as gprc and g8rc), which makes too many kinds of register classes to maintain and some are just the same thing. It's not just use smallest interleaving factor calculated over all of the register classes and we also need to care about the overlapping relationship between different register class.

In all, it would be too fine-grained and little over-design, many register classes are just same behavior as separate with scalar and vector, int and float. What's your opinion?

I think that you misunderstood my suggestion. I did not mean that the backend register classes would be directly mapped into the classes returned by the TTI interface. These would, especially for non-trivial architectures, be abstracted for the purpose of the TTI interface. For PPC, Altivec, we'd have three class IDs for now (scalar float, scalar int, vectors), all with 32 registers. For VSX, we'd have two register class IDs, one for scalar ints (with 32 registers), and one for everything else (with 64 registers). (*) These don't need to have anything to do with the register classes actually defined in the backend. Do we need to capture finer-grained details than that in the heuristic (e.g. is your overlap suggestion capturing more than this)?

(*) Actually, for both cases, we can also have a separate class ID for scalar i1 types (with 8 registers).

(*) Actually, for both cases, we can also have a separate class ID for scalar i1 types (with 8 registers).

Oops, I mean with 32 registers.

Thanks for looking at this (it's been a problem for a long time). Let me suggest a different interface, which I believe will improve generality and reduce code duplication in the register-pressure estimator, and let me know what you think...

// Return the number of registers in the target-provided register class.
unsigned getNumberOfRegisters(unsigned ClassID = 0) const;

// Return the target-provided register class for the provided type.
unsigned getRegisterClassForType(Type *Ty) const;

The idea, then, is that we just calculate register usage for each register class separately (i.e., keep a hash table), and then when computing the interleaving factor, etc. we just iterate over all of the register classes returned by the target, and pick the smallest interleaving factor calculated over all of the register classes. There's probably even a nice way to construct a default implementation of this in the backend (although that we'd save for follow-up work).

Yes. Using ClassID is more general and fine-grained. But I think there is no need to iterate all kinds of register class because target register class is flexible and many register classes target provided are only different width with same position(such as gprc and g8rc), which makes too many kinds of register classes to maintain and some are just the same thing. It's not just use smallest interleaving factor calculated over all of the register classes and we also need to care about the overlapping relationship between different register class.

In all, it would be too fine-grained and little over-design, many register classes are just same behavior as separate with scalar and vector, int and float. What's your opinion?

I think that you misunderstood my suggestion. I did not mean that the backend register classes would be directly mapped into the classes returned by the TTI interface. These would, especially for non-trivial architectures, be abstracted for the purpose of the TTI interface. For PPC, Altivec, we'd have three class IDs for now (scalar float, scalar int, vectors), all with 32 registers. For VSX, we'd have two register class IDs, one for scalar ints (with 32 registers), and one for everything else (with 64 registers). (*) These don't need to have anything to do with the register classes actually defined in the backend. Do we need to capture finer-grained details than that in the heuristic (e.g. is your overlap suggestion capturing more than this)?

(*) Actually, for both cases, we can also have a separate class ID for scalar i1 types (with 8 registers).

Thank you. I see what you mean. The classID value can be defined by target itself and just use to distinguish. It's cool to use this method to express the overlapping so that it can be consistent to use smallest factor algorithm instead of separating considering.

wuzish updated this revision to Diff 219872.EditedSep 12 2019, 3:19 AM

Address comments. Add 3 function interfaces.

/// \return the number of registers in the target-provided register class.
unsigned getNumberOfRegisters(unsigned ClassID) const;

/// return the target-provided register class for the provided type.
unsigned getRegisterClassForType(Type *Ty, bool Vector) const;

/// return the target-provided register class name
const char* getRegisterClassName(unsigned ClassID) const;

Use register class enum to distinguish different llvm types that residing in different register positions. Every target can has its own register class mapping from llvm type to register class ID. For general backend implementation, there are 3 default register class, GenericIntScalarRC = 1, GenericFloatScalarRC = 2, GenericVectorRC = 3.

Thanks for exploring this direction.

llvm/include/llvm/Analysis/TargetTransformInfo.h
798

I'm not sure that these defaults make sense. Many targets won't even have these as distinct classes (e.g., PowerPC with VSX). I think that we should have the default implementation just return one register class, 0, with its current default (which I suppose is 8 registers), and the default implementation will put everything in that one class. Then, I don't think that we need this enum at all.

My impression is that you decided to do it this way so that you could write in the other targets:

bool Vector = (ClassID == TargetTransformInfo::GenericVectorRC);

but I think it's better to just give all of the other targets which did something with Vector two register classes, and return the second one for all types which are vector types. That should match the current behavior and then the targets can customize as they see fit. But I'd leave this all within each target (there's no need to expose generic classes because there's no need for a generic meaning).

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
1390

This can be:

unsigned TTIRegNum = TTI->getNumberOfRegisters(TTI->getRegisterClassForType(F.getType(), false)) - 1;
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

I think that we can just make a separate function for this:

TTI->hasVectorRegisters()

(and then use that here and in the SLP vectorizer).

wuzish marked an inline comment as done.Sep 15 2019, 11:01 PM
wuzish added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

I think it's could be like TTI->getRegisterClassForType(F.getType(), true) above

Address comments to simplify the default implementation for targets.

Use 1 and 0 to represent default register classes for vector and scalar type to keep other targets behavior as before.
Targets can reimplement it like PowerPC.

hfinkel added inline comments.Sep 16 2019, 1:42 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

But above, F.getType() gives back the right scalar type because F is the LSR::Formula. Here F in the function, right? I don't think it makes sense to ask for the register class of the function type.

wuzish marked an inline comment as done.Sep 16 2019, 8:16 PM

Change the interface prototype of getRegisterClassForType and make second parameter with default value.

unsigned getRegisterClassForType(bool Vector, Type *Ty = nullptr)
wuzish marked an inline comment as done.Sep 16 2019, 8:23 PM
wuzish added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

Yes. And above case would return nullptr, so we need care about this situation. And here we can left type to be nullptr as default value argument.

arsenm added inline comments.Sep 16 2019, 8:26 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
801

I don't like spreading the concept of register classes corresponding to types.

I also don't think register classes as a concept should be leaking out to the IR

wuzish marked an inline comment as done.Sep 16 2019, 10:19 PM
wuzish added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
801

I think it's not the concept of register class in backend. It's an abstraction of register class in backend and just to classify and distinguish different kinds of data residing in different register position to help estimate register pressure.

hfinkel added inline comments.Sep 16 2019, 11:28 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
801

Yeah, I think that it's pretty important that these are *abstract* register classes - used to establish the mapping between the types of IR values and the number of simultaneous live ranges to which we'd like to limit for some set of those types - it's probably worth stating that explicitly in the description.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

I suppose that this makes sense if we assume that TTI->getRegisterClassForType is called only on legal types? I'm a bit worried here because, at the IR level, all types are supported and should be legalized into *some* register class. We should better document the expected behavior here one way or the other.

wuzish edited the summary of this revision. (Show Details)Sep 17 2019, 12:13 AM
wuzish marked 2 inline comments as done.Sep 17 2019, 12:31 AM
wuzish added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

If I am not misunderstanding what you mean, then I think the type should be isSingleValueType. We can document it at the comments in getRegisterClassForType prototype declare.

Add some comments at function interface prototype.

wuzish marked an inline comment as done.Sep 17 2019, 7:24 PM

Gentle pin...

@hfinkel Any more comments or advice?

Gentle pin...

I apologize for the delay; I've been contemplating what to recommend here...

(except for the comments below, I think this looks good)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

I agree, but I think that's insufficient. isSingleValueType is true for vector types, and so if we have an architecture with no vector registers, and we call TTI->getRegisterClassForType on a vector type, it would return the class associated with the scalarized type. I think that what you intend to say is something like:

getRegisterClassForType returns the register class associated with the provided type, accounting for type promotion and other type-legalization techniques that the target might apply, however, it specifically does not account for the scalarization or splitting of vector types. Should a vector type require scalarization or splitting into multiple underlying vector registers, that type should be mapped to a register class containing no registers.

In some sense, this seems reasonable, because the interface does not provide any way to figure out *how many* of a particular register in a register class the type might use, and so we can't return a sensible answer in cases where splitting or scalarization is required. It's a bit unfortunate, however, because the same consideration applies to scalar types too (e.g., i256 probably takes up multiple scalar registers). What we really should do, I think, is expand this interface to return, perhaps optionally, the number of registers of the provided class. In the implementation, in such a case, could start by running:

std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);

and then using the MVT for making the register-class decisions.

However, it does not seem possible to make this change while retaining the current behavior for other targets (because we'd change what happens for illegal types), and thus, I recommend that a comment be added along the lines of my suggestion above, and then we also add this:

// FIXME: It's not currently possible to determine how many registers are used by the provided type.

and we address this aspect later in a different patch.

wuzish marked an inline comment as done.Sep 26 2019, 7:52 PM
wuzish added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7768–7769

May the number issue be addressed by such code in LoopVectorize.cpp?

// A lambda that gets the register usage for the given type and VF.
  auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) {
    if (Ty->isTokenTy())
      return 0U;
    unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType());
    return std::max<unsigned>(1, VF * TypeSize / WidestRegister);
  };

And the WidestRegister is from TTI.getRegisterBitWidth which I think should be related with Type*.
Yes, the only one interface TLI->getTypeLegalizationCost would be more consistent and easy to maintain.

update the comment of getRegisterClassForType. Make it landed firstly and update the interface to leverage MVT in follow-up work.

wuzish marked an inline comment as done.Sep 27 2019, 12:27 AM

We can land it firstly and update the interface to use MVT as simplest type to getRegisterClassForType parameter and call TLI->getTypeLegalizationCost before.
This can apply to where GetRegUsage works

hfinkel accepted this revision as: hfinkel.Sep 27 2019, 8:18 AM

We can land it firstly and update the interface to use MVT as simplest type to getRegisterClassForType parameter and call TLI->getTypeLegalizationCost before.
This can apply to where GetRegUsage works

LGTM. Thanks!

This revision is now accepted and ready to land.Sep 27 2019, 8:18 AM

Rebase it and rerun bmk 2017 to make last check before launch.

nhaehnle added inline comments.Sep 30 2019, 10:54 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
801

Can we explicitly call out the fact that these are not the CodeGen register classes?

Also, how about making the interface a getRegisterClassForValue as opposed to getRegisterClassForType? This would allow a context-sensitive determination of register classes. Specifically, in the AMDGPU backend, it would potentially allow us to distinguish between uniform and divergent values (in the overall program sense).

hfinkel added inline comments.Oct 1 2019, 12:20 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
801

Can we explicitly call out the fact that these are not the CodeGen register classes?

I think that's a good idea. We can say that this is designed to provide a simple, high-level view of the register allocation later process later performed by the backend. These register classes don't necessarily map onto the register classes used by the backend.

Also, how about making the interface a getRegisterClassForValue as opposed to getRegisterClassForType? This would allow a context-sensitive determination of register classes. Specifically, in the AMDGPU backend, it would potentially allow us to distinguish between uniform and divergent values (in the overall program sense).

I don't object, but we should make sure to think about the overall computational complexity of the analysis the backend would need to use, and whether this per-value interface will allow that analysis to be performed efficiently.

wuzish marked an inline comment as done.Oct 1 2019, 10:48 PM
wuzish added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
801

if use getRegisterClassForValue, I find a scene to distinguish 2 different register class in PowerPC target. For the i1 type generated from icmp, there are 8 CRRC, but there are 32 CRBITRC for the i1 type generated from other operation. Is this required? @hfinkel

I'd like to upstream it first, and interface modification would be done in follow-up patches.

wuzish updated this revision to Diff 223724.Oct 7 2019, 7:39 PM

Add more comments and rebase to up-to-date.

This revision was automatically updated to reflect the committed changes.

Hi,

This is breaking our internal benchmarks, could you please take a look? Thanks!

Steven

add ; REQUIRES: asserts for test case

update test case

bjope added a subscriber: bjope.Oct 12 2019, 12:09 AM