This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Add new infrastructure to select load/store instructions, update P8/P9 load/store patterns.
ClosedPublic

Authored by amyk on Dec 15 2020, 10:37 PM.

Details

Summary

This patch introduces a new infrastructure that is used to select the load and store instructions in the PPC backend.

The primary motivation is that the current implementation of selecting load/stores is dependent on the ordering of patterns in TableGen.
Given this limitation, we are not able to easily and reliably generate the P10 prefixed load and stores instructions (such as when
the immediates that fit within 34-bits). This refactoring is meant to provide us with more control over the patterns/different forms to exploit,
as well as eliminating dependency of pattern declaration in TableGen.

The idea of this refactoring is that it introduces a set of addressing modes that correspond to different instruction formats
of a particular load and store instruction, along with a set of common flags that describes a load/store. Whenever a load/store
instruction is being selected, we analyze the instruction and compute a set of flags for it. The computed flags are then used to
select the most optimal load/store addressing mode.

This computation of flags is done in computeMOFlags(), while selecting the optimal addressing mode is done through
getAddrModeForFlags(); in which this functions searches for a set of address flags stored in a map that relates common flags to
addressing modes. Once the optimal addressing mode is determined, this information is given to SelectOptimalAddrMode(),
where we set the base and displacement of the load/store accordingly based on the addressing mode.

Another thing to note is the SelectForceXForm() function is similar to SelectAddressRegRegOnly(), with an updated naming
and a removed condition to better suit the refactoring that is being done on the loads/stores.

This patch is the first of a series of patches to be committed - it contains the initial implementation of the refactored load/store
selection infrastructure and also updates P8/P9 patterns to adopt this infrastructure. The idea is that incremental patches will
add more implementation and support, and eventually the old implementation will be removed.

Diff Detail

Event Timeline

amyk created this revision.Dec 15 2020, 10:37 PM
amyk requested review of this revision.Dec 15 2020, 10:37 PM

So, is this patch still incomplete as I didn't see the test change ?

llvm/lib/Target/PowerPC/PPCInstr64Bit.td
1070

Not sure if this is handled correctly as you are removing the align restrict for LWA. Technical speaking, we need to remove all such kind of alignment restrict in the load/store as far as we did some analysis in the source code. But I notice that we are removing the align restrict here but still keep it for LD.

lkail added a subscriber: lkail.Dec 16 2020, 4:55 AM
lkail added inline comments.
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
2543

Should it be a static function?

amyk added a comment.Dec 16 2020, 10:32 AM

@steven.zhang The idea is that this patch should be NFC. All existing load/store test cases should pass with this refactoring. I do think there should be more tests added, perhaps in a follow up patch. What do you think?

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
2543

Good point. I will fix this.

llvm/lib/Target/PowerPC/PPCISelLowering.h
721

Add comments to describe the functions.

llvm/lib/Target/PowerPC/PPCInstr64Bit.td
1070

You're right; I meant to remove the align/unalign in the other LD,LWA patterns, as well, as the new load/store infrastructure is meant to use load/sextload/zextload and compute alignment based on the flags.

I think it is better to explicitly give some reasons why we need this big refactoring, in other words, what's the disadvantage/limitation of legacy implementation? Thank you for the big effort.

llvm/lib/Target/PowerPC/PPCISelLowering.h
702

The following features should not be associated to one specific memory operation? Should we add them into the flag set of each memory operation in computeMOFlags? Is it possible to get the sub-target info when we do am selection after we get all flags?

amyk updated this revision to Diff 312325.Dec 16 2020, 5:01 PM

Add more comments to functions, make provablyDisjointOr() static, fix pattern in PPCInstr64Bit.td.

All existing tests in the backend run successfully with this patch.

amyk added a comment.Dec 16 2020, 5:14 PM

I think it is better to explicitly give some reasons why we need this big refactoring, in other words, what's the disadvantage/limitation of legacy implementation? Thank you for the big effort.

That is a good point. Thank you for bringing this up. The primary reason behind this refactoring is that it would allow us to easily exploit prefixed load and stores when the offset is large (such as for immediates that fit within 34-bits).

The current implementation of selecting load/stores is very dependent on the ordering of patterns within TableGen, and we aren't able to easily and reliably generated the prefixed load/stores (also since those patterns would also be written in a different file).
Essentially, the refactoring of how the loads/stores are handled is meant to provide us with more control over the patterns/different forms to exploit. Additionally, we want to eliminate the dependency of pattern declaration ordering as much as possible so regardless of the order of the patterns that the compiler tries to match, we can generate the best possible code.

amyk edited the summary of this revision. (Show Details)Dec 16 2020, 5:17 PM
amyk added inline comments.Dec 16 2020, 5:26 PM
llvm/lib/Target/PowerPC/PPCISelLowering.h
702

I apologize, I think I do not quite follow. If you could clarify, that would be great.

Are you suggesting that these should not be flags, but more of Subtarget checks after the flags are computed?

I think it can be useful to store the subtarget information in the set of flags is that we can easily know which instructions we can produce on P9 and P10.

I was thinking for instance, if we have flags set for SubtargetP9, ScalFlt and RPlusSImm16Mult4 (register + signed 16 bit immediate, multiple of 4), then we know we can generate the DS-Form (corresponding to DFLOAD pseudoinstruction. Or, when we have PPC::MOF_SubtargetP10 and PPC::MOF_RPlusSImm34 (signed 34-bit immediate), we can know we can generate prefixed load/stores.

The primary motivation is that the current implementation of selecting load/stores is dependent on the ordering of patterns in TableGen.

I think this also can be resolved by AddedComplexity in td files? Is it convenient to solve the issue in D91279 when the worst case happens in instruction selection for load/store. With legacy infra, we have to select a xform with one zero register operand for worst case. But seems on P10, we prefer a dform with 0 offset, it is good for some linker opt. That issue does not have a simple fix with legacy infra because the worst case is handled in select xform only function, we must select a xform, so we must use zero register. Could it be solved easier under the new infra? Thanks.

llvm/lib/Target/PowerPC/PPCISelLowering.h
702

Are you suggesting that these should not be flags, but more of Subtarget checks after the flags are computed?

Yes, that was my first thought. For normal load/store instructions before ISEL, we can get type, address info(zext/sext/imm) from the instruction itself. But sub-target is not loads/stores characteristic. we can check the sub-target info when we do the address mode selection when we have other flags. But I am ok if making the sub-target as a flag for load/store for implementation convenience.

@steven.zhang The idea is that this patch should be NFC. All existing load/store test cases should pass with this refactoring. I do think there should be more tests added, perhaps in a follow up patch. What do you think?

Yeah, you'd better tag the revision as NFC if it is.

Just my general $0.02 regarding this refactoring effort...

The existing infrastructure was fine for quite a while. We had D-Form, DS-Form and X-Form which made things reasonably simple. However, we started adding new addressing forms - DQ-Form, MLS:D-Form, 8LS:D-Form, etc. which really started adding complexity to the existing infrastructure that was difficult to understand. Ultimately, the patterns were asking the wrong questions that required an increasing amount of context to answer. Questions like "is this a reg+reg operation with an immediate displacement that is a multiple of 4/16" (SelectAddrIdxX4/SelectAddrIdxX16). A question like that doesn't even fundamentally make sense. If the address is represented as an addition of two registers, the question of whether one of them is a multiple of anything is meaningless. Of course, we needed that in order to avoid selecting an X-Form when a D[SQ]-Form is available, but that is still a weird way to structure a query.
Ultimately, selection of memory access instructions wants to know one thing - "What is the optimal addressing form for this access?" So when selecting an instruction, the question is "is my addressing form the optimal one?"
That is what this refactoring aims to accomplish. So in the future, if we get new D<whatever>-Form instructions that have different requirements for alignment/displacement size/displacement alignment/etc. we should be able to easily extend this.

Furthermore, this reduces/avoids the need for hacks such as AddedComplexity and CodeSize.

amyk retitled this revision from [PowerPC] Add new infrastructure to select load/store instructions, update P8/P9 load/store patterns. to [PowerPC][NFC] Add new infrastructure to select load/store instructions, update P8/P9 load/store patterns..Dec 17 2020, 7:04 AM

I just had some minor comments. I think it makes sense overall.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
2529

I find it odd that you are zero extending this (ie unsigned) and then casting it to a signed.
You can get into all kinds of issues with this kind of thing.
For example:

int16_t a = -1;  // This is 0xFFFF
int32_t b = zeroExtend(a); // This is 0x0000FFFF (Not -1 but 65535)

It is important to think about what the possible value types are for N. If the only possible types are MVT::i32 and MVT::i64 then we are fine.

2533

You can probably simplify this a little bit. See what others think too...

Imm = (int32_t)cast<ConstantSDNode>(N)->getZExtValue();
int64_t Imm64 = (int64_t)cast<ConstantSDNode>(N)->getZExtValue();
return isInt<32>(Imm64);
17034

For this case it may be easier to just work with the APInt instead of creating your own functions.
So, you may not need isIntS32Immediate and isIntS34Immediate. Especially since you don't use Imm34.

I should have said this in my first comments. I like this refactor. I just want to make sure I understand this refactoring more clear^_^. Thanks for your detailed explanation. @nemanjai

For legacy infra, I have one pain for the matching efficiency. When we match a worst-case in td files, we will
1: first check dform candidate and in dform candidate handling function SelectAddressRegImm, we check xform candidate. If it fails:
2: check xform candidate and in xform candidate handling function SelectAddressRegReg, we will check dform candidate. If it fails:
3: check xform only candidate and in xform only candidate handling function SelectAddressRegRegOnly, we will get the desired hardware instructions.

I think there must be logic redundant. We check again and again for the load/store address mode.

So for the new infra, I am thinking should we avoid this redundant logic? I still see redundant checks/flags collects for one load/store instruction. We first check SelectDSForm and then check SelectXForm and then SelectForceXForm according to the order in td files. In each SelectXXXForm(), we will collect the flags once. There should also be some redundant logics?

Can we select the address mode not starting from td files? Instead, we start the selection from cpp file for IR level load? For example in function PPCDAGToDAGISel::Select(), change case ISD::LOAD and inside the case, we call SelectOptimalAddrMode and then select the PPC instruction directly in cpp files.

So for the new infra, I am thinking should we avoid this redundant logic? I still see redundant checks/flags collects for one load/store instruction. We first check SelectDSForm and then check SelectXForm and then SelectForceXForm according to the order in td files. In each SelectXXXForm(), we will collect the flags once. There should also be some redundant logics?

Can we select the address mode not starting from td files? Instead, we start the selection from cpp file for IR level load? For example in function PPCDAGToDAGISel::Select(), change case ISD::LOAD and inside the case, we call SelectOptimalAddrMode and then select the PPC instruction directly in cpp files.

I think this may be an interesting follow-up to optimize this slightly. However, I would imagine we would do this in PreprocessISelDAG(). Namely, we can replace ISD::LOAD with something like PPCISD::LOAD_W_ADDRMODE (and similarly for the stores). That way we can still specify more complex patterns in .td files rather than reimplementing everything in C++ code. Namely, we can still add things like:

def : Pat<(f128 (sint_to_fp (i64 (load xaddrX4:$src)))),
          (f128 (XSCVSDQP (LXSDX xaddrX4:$src)))>;

except it would be something like:

def : Pat<(f128 (sint_to_fp (i64 (PPCloadWAddForm DSForm:$src)))),
          (f128 (XSCVSDQP (LXSD DSForm:$src)))>;

However, considering the number of loads/stores in a typical DAG and how quickly these flags, etc. can be computed, this isn't a huge compile time improvement.

amyk updated this revision to Diff 315122.Jan 7 2021, 6:26 AM
amyk retitled this revision from [PowerPC][NFC] Add new infrastructure to select load/store instructions, update P8/P9 load/store patterns. to [PowerPC] Add new infrastructure to select load/store instructions, update P8/P9 load/store patterns..
  • Updated names of the selection functions in the td patterns
  • Address the comment of using APInt when computing address flags
  • Removed NFC from the title as there is one test case update that we expect a DSForm in (instead of an XForm instruction)
amyk marked an inline comment as done.Jan 7 2021, 6:28 AM

Removed the addition of isIntS32Immediate() - it is no longer needed after addressing the comment of using APInt.

Thanks a lot for working on that Amy ! I have some comments on the patch.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
16959

The address flags are are stored -> The address flags are stored

16966–16967

I think you can directly use the Subtarget field of the PPCTargetLowering class instead of using a static_cast here.

16980

This line should be moved below just before the assert.

17027

consstants -> constants

17056–17057

Should we name this flag MOF_NotAddNorCst then ?

17105–17107

I know this is code we already have in SelectAddressRegRegOnly, but isn't that condition weird ? Or at least it doesn't match the comment above. If I understand correctly, this is the case where we get rid of the add, but looking at the condition, we get rid of it only if it is not an add of a value and a 16-bit signed constant or if one of the operands doesn't have a single use.
Shouldn't the condition be (N.getOpcode() == ISD::ADD && !isIntS16Immediate(N.getOperand(1), ForceXFormImm) && N.getOperand(1).hasOneUse() && N.getOperand(0).hasOneUse()) instead ?

llvm/lib/Target/PowerPC/PPCISelLowering.h
695–700

The flag names are not really uniform, I guess DWInt is for DoubleWordInt ? I think we should drop the abbreviation here so it is more clear what the flags represent. Something like:

MOF_SubWordInt = 1 << 15,
MOF_WordInt = 1 << 16,
MOF_DoubleWordInt = 1 << 17,
MOF_ScalarFloat = 1 << 18,
MOF_Vector = 1 << 19,
MOF_Vector256 = 1 << 20,
703

Maybe we should name this flag MOF_SubtargetBeforeP9 or something like that, to make it clear that the flag is not set on P10.

722–723

I think the map and the function declaration should be moved with the others private fields below.

1089

This method should be private.

1094

This method should be private.

amyk updated this revision to Diff 318115.Jan 20 2021, 10:54 PM

Addressed review comments:

  • Rename some of the MemOpFlags
  • Fixed typo in comments
  • Moved variables near their use
  • Made some of the methods private
amyk marked 10 inline comments as done.Jan 21 2021, 3:12 PM
amyk added inline comments.
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
17105–17107

That's a good point, Baptiste. You're right in that this is the case where we eliminate the add.

I thought about it a bit I believe the this condition might match the comment more (only get rid of the add if we do not have an add of a value and a signed 16-bit immediate, and the two operands don't have a single use)

if (N.getOpcode() == ISD::ADD &&
      (!(isIntS16Immediate(N.getOperand(1), imm) &&
         N.getOperand(1).hasOneUse() && N.getOperand(0).hasOneUse())))

which I believe, should then be equivalent to the condition in the code.

But yes, the condition was previously in SelectAddressRegRegOnly() which is why I kept it.
If there are any more concerns on the condition and/or comment, I can probably adjust it.

bsaleil accepted this revision as: bsaleil.Jan 22 2021, 3:10 PM

Thanks for addressing the comments, LGTM now.

This revision is now accepted and ready to land.Jan 22 2021, 3:10 PM
amyk updated this revision to Diff 320765.Feb 2 2021, 5:59 AM

Update patch to rebase with latest trunk, and rearrange/clean up code slightly.

amyk updated this revision to Diff 322972.Feb 11 2021, 5:58 AM

Update patch to move the check for if we have a value with an offset that fits into a 34-bit immediate into it's own condition instead of inside an else if.

amyk updated this revision to Diff 323854.Feb 15 2021, 6:00 PM

Update patch to fix a small issue when setting the Base and Disp for DForms when we have constant that fits in 32-bits.
Previously I used a uint64_t when it should have been a uint16_t.

amyk updated this revision to Diff 330636.Mar 15 2021, 6:41 AM

Rebase this patch to the latest changes.

nemanjai requested changes to this revision.Mar 20 2021, 3:42 PM

Thank you for handling this much needed refactoring. My comments are mostly related to readability so this is very close to approval, but let's have another look when you address the comments.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1432

s/instruction formats/addressing modes

1437

I think you should add some comments regarding sample instructions that correspond to these sets of flags. For example:

PPC::MOF_ZExt | PPC::MOF_RPlusSImm16 | PPC::MOF_WordInt,    // LWZ
PPC::MOF_ZExt | PPC::MOF_RPlusSImm16 | PPC::MOF_SubWordInt, // LBZ, LHZ
PPC::MOF_SExt | PPC::MOF_RPlusSImm16 | PPC::MOF_SubWordInt, // LHA
...
2521

I think we no longer need to put the name of the function in Doxygen comments.

2523–2524

Replace:

This is for when we have an OR of disjoint bitfields, we can codegen it as an add (for better address arithmetic).

with

An OR of two provably disjoint values is equivalent to an ADD. Most PPC load/store instructions compute the effective address as a sum, so doing this conversion is useful.
16952

s/return an X-Form instructions can always be matched/return an X-Form as it is the most general addressing mode

16959–16960

Remove this part of the comment. That is not done here AFAICT.

16970

I think this should be checking for prefixed instructions. It is entirely possible to have
-mcpu=pwr9 -Xclang -target-feature -Xclang -prefix-instrs (or llc -mattr=-prefix-instrs).

16983

This is a large lambda that is only called once. AFAICT, it only captures FlagSet. I think it is probably better off as a static function and FlagSet can be passed by reference.

Removing it will also remove SetAlignFlagsForImm from this function and the whole thing should become significantly more readable.

Also as a minor nit, you are less likely to have these mismatches in naming/capitalization such as between SetAlignFlagsForImm and computeFlagsForAddressComputation. Both are lambdas, one is capitalized as a variable and the other as a function - understandable omission since a lambda is both.

17044

s/Vectors only/Integer vectors.

17055

Seems like we should have another unreachable here in case we end up with an illegal floating point type such as half precision or some weird FP type (such as Intel's 80-bit FP).

17078

Add this to the comment:

// We set the extension mode to zero extension so we don't have
// to add separate entries in AddrModesMap for stores and loads.
17084–17085
// If we don't have prefixed instructions, 34-bit constants should be
// treated as PPC::MOF_NotAddNorCst so they can match D-Forms.
17086–17089
bool Is34BitConstNoP10 =
    (PPC::MOF_RPlusSImm34 | PPC::MOF_AddrIsSImm32 | PPC::MOF_SubtargetP10) &
    FlagSet == PPC::MOF_RPlusSImm34;
if (N.getOpcode() != ISD::ADD && N.getOpcode() != ISD::OR &&
    IsNonP1034BitConst)
  FlagSet |= PPC::MOF_NotAddNorCst;
17156

This condition seems superfluous. Why do we check that the operand is a signed 16-bit constant (and then create another constant with the same value) when we have PPC::MOF_RPlusSImm16 set? Doesn't the flag already assure us that this is so?

Can we not just assert that this is so?

17191

Why can't this be
(CNType == MVT::i32 || isInt<32>(CNImm))

17193

For purposes where the width of the value matters, please refrain from using types with an implicit size and use explicitly sized types (i.e. int32_t vs. int, int16_t vs. short).

llvm/lib/Target/PowerPC/PPCISelLowering.h
678

// Extension mode for integer loads.

684–692
MOF_NotAddNorCst = 1 << 5,      // Not const. or sum of ptr and scalar.
MOF_RPlusSImm16 = 1 << 6,       // Reg plus signed 16-bit constant.
MOF_RPlusLo = 1 << 7,           // Reg plus signed 16-bit relocation
MOF_RPlusSImm16Mult4 = 1 << 8,  // Reg plus 16-bit signed multiple of 4.
MOF_RPlusSImm16Mult16 = 1 << 9, // Reg plus 16-bit signed multiple of 16.
MOF_RPlusSImm34 = 1 << 10,      // Reg plus 34-bit signed constant.
MOF_RPlusR = 1 << 11,           // Sum of two variables.
MOF_PCRel = 1 << 12,            // PC-Relative relocation.
MOF_AddrIsSImm32 = 1 << 13,     // A simple 32-bit constant.
698–699
MOF_ScalarFloat = 1 << 18, // Scalar single or double precision.
MOF_Vector = 1 << 19,      // Vector types and quad precision scalars.
1383

s/are are/are

This revision now requires changes to proceed.Mar 20 2021, 3:42 PM

I understand this refactoring was done to simplify adding instructions both now and in the future. Would it be possible to add comments in the code to instruct the next developer that attempts to add instructions on the how to do it.

amyk marked 24 inline comments as done.Mar 23 2021, 7:51 AM
amyk added inline comments.
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
16970

I actually added a flag for prefixed instructions in https://reviews.llvm.org/D96075. Is this approach still acceptable, or would you prefer it to be done here instead in the approach you mentioned?

amyk updated this revision to Diff 333291.Mar 25 2021, 6:57 AM
  • Create a static function to compute flags for address computation
  • Create static function to set alignment flags for FrameIndex
  • Update comments
amyk updated this revision to Diff 338417.Apr 18 2021, 7:48 PM
amyk marked 2 inline comments as done.
  • Rebase patch
  • Add documentation regarding refactored load and store implementation
  • Add the PPC::MOF_SubtargetP10 flag if the subtarget has prefixed instructions
nemanjai accepted this revision.Apr 23 2021, 2:29 PM

The rest of the comments I have are minor comment and code restructuring nits. Those can be addressed on the commit. Otherwise, LGTM.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
16963

Please reduce nesting by flipping this and early-exiting:

FrameIndexSDNode *FI =
  dyn_cast<FrameIndexSDNode>(IsAdd ? N.getOperand(0) : N);
if (!FI)
  return;
// The rest of the code is not nested in this if
16966

It might be clear to call this something like FrameIndexAlign since it seems to refer to alignment rather than a value.

17194
// This is a register plus a 16-bit immediate. The base will be the
// register and the displacement will be the immediate unless it
// isn't sufficiently aligned.
17201–17206
Disp = DAG.getTargetConstant(Imm, DL, N.getValueType());
Base = Op0;
if (FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(Op0)) {
  Base = DAG.getTargetFrameIndex(FI->getIndex(), N.getValueType());
  fixupFuncForFI(DAG, FI->getIndex(), N.getValueType());
}
17210

This is not necessarily a load I think.

// This is a register plus the @lo relocation. The base is the register
// and the displacement is the global address.
17220
// This is a constant address at most 32 bits. The base will be
// zero or load-immediate-shifted and the displacement will be
// the low 16 bits of the address.
This revision is now accepted and ready to land.Apr 23 2021, 2:29 PM
This revision was landed with ongoing or failed builds.Apr 30 2021, 7:53 AM
This revision was automatically updated to reflect the committed changes.