Page MenuHomePhabricator

[Analysis] TTI: Add CastContextHint for getCastInstrCost
ClosedPublic

Authored by dmgreen on Apr 30 2020, 2:38 AM.

Details

Summary

Currently, getCastInstrCost has limited information about the cast it's rating, often just the opcode and types.
Sometimes there is an instruction as well, but it isn't trustworthy: for instance, when the vectorizer is rating a plan, it calls getCastInstrCost with the old instructions when, in fact, it's trying to evaluate the cost of the instruction post-vectorization.
Thus, the current system isn't good enough as the cost of a cast can vary greatly based on the context in which it's used.

For example, if the vectorizer queries getCastInstrCost to evaluate the cost of a sext (load) with tail predication enabled, getCastInstrCost will think it's free most of the time, but it's not always free. On ARM MVE, if the destination type doesn't fit in a 128 bits register, those kind of casts can be very expensive (2 or more instructions per vector element: each element is converted individually!). (Note: the fix for this is the child revision).

To fix that, this path adds a new parameter to getCastInstrCost to give it a hint about the context of the cast. It adds a CastContextHint enum which contains the type of the load/store being created by the vectorizer - one for each of the types it can produce.

Diff Detail

Event Timeline

Pierre-vh created this revision.Apr 30 2020, 2:38 AM
Pierre-vh created this object with visibility "No One".
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2020, 2:38 AM
Pierre-vh edited the summary of this revision. (Show Details)Apr 30 2020, 3:32 AM
Pierre-vh added reviewers: RKSimon, spatel, craig.topper.

This feels like a hack to me too, I think we need to move away from passing snippets of information to the instruction cost hooks. There are many other places in the vectorizer where instruction costs are calculated too. What information do we have at this point and what do we need to know? I like the sound of TTI taking something like the LoopVectorizationLegality object, but doing it at a higher level than on a per-instruction basis, allowing TTI to look at the loop.

This feels like a hack to me too, I think we need to move away from passing snippets of information to the instruction cost hooks. There are many other places in the vectorizer where instruction costs are calculated too. What information do we have at this point and what do we need to know? I like the sound of TTI taking something like the LoopVectorizationLegality object, but doing it at a higher level than on a per-instruction basis, allowing TTI to look at the loop.

What information do we have at this point and what do we need to know?

For D79163, we'd need to know whether a LoadInstr or a StoreInstr will become a masked load/store or not (and LoopVectorizationLegality can answer that question).

I like the sound of TTI taking something like the LoopVectorizationLegality object, but doing it at a higher level than on a per-instruction basis, allowing TTI to look at the loop.

What do you mean by "doing it at a higher level"?
LoopVectorizationLegality already contains the Loop and LoopInfo objects. We'd just need to add more (const) getters to the class so TTI can access them (because they're private right now).

What do you mean by "doing it at a higher level"?

I mean that trying to attach extra bits of information to individual instruction costs hooks doesn't scale. Also, if we want to add in-depth information, such as the legalization object, then we'd probably have to create a whole new API for the vectorizer to use too. Instead, we could enable TTI to calculate the cost of the loop, not just each instruction. This would give us the freedom to evaluate all the memory operations, evaluating their extends/truncs together, and enable us to make a good decision.

samparker added reviewers: Ayal, kbarton.

What do you mean by "doing it at a higher level"?

I mean that trying to attach extra bits of information to individual instruction costs hooks doesn't scale. Also, if we want to add in-depth information, such as the legalization object, then we'd probably have to create a whole new API for the vectorizer to use too. Instead, we could enable TTI to calculate the cost of the loop, not just each instruction. This would give us the freedom to evaluate all the memory operations, evaluating their extends/truncs together, and enable us to make a good decision.

This sounds like a much better solution than what I proposed, but wouldn't that "getLoopCost" method also need to access the legalization object? How would it know which load/stores will be masked? Would it guess it by looking at the loop as a whole?

Pierre-vh edited the summary of this revision. (Show Details)Apr 30 2020, 6:21 AM

but wouldn't that "getLoopCost" method also need to access the legalization object?

Indeed. But if were have a getLoopVectorizationCost API, then it would seem reasonable enough to pass that object there.

I don't think I agree that this is a hack, exactly. At least if it was cleaned up. It follows the same method as getArithmeticInstrCost where the type of the parameter is passed through, allowing us to get the information we need but not forcing us to pin this to a potentially incorrect or non-existent instructions. This separation seems like a good thing if we can do it well.

We (ARM/MVE) need to do the same thing for the cost of gather and interleaved loads. Whether the sext/zext is free there is equally variable. The way that I would have imagined this is a enum that can be one of the types of loads that the vectorizer produces (Normal, Masked, Interleave, Gather, Expanded?). There probably needs to be an option for None or Unknown too. I understand that you tried this before but ran into trouble? Can you speak to what kinds of problems you ran into doing things that way?

Indeed. But if were have a getLoopVectorizationCost API, then it would seem reasonable enough to pass that object there.

Although this might well be something that we do need in the long run, it will be very tied into how vplan ends up doing costmodelling, should probably not be limited the loopvectorization and is probably a much bigger task to design and implement than this. Not something that an intern with less than a month left should be asked to do. Being able to cost blocks of code at a time feels like it's quite important to MVE, but is not something we should rush into here.

The other option if all this is unworkable is to put the cost into getMaskedMemOpCost. Or just bypass the cost modelling and force the vectorizer to not consider wide vectors when tail predicating, which I think is something that you've suggested before. If we do try to cost it properly, where we choose to put the high cost is up to us in a way. It's the masked load we are choosing not to split into something where the the extend would end up as free, even if it is extend which is expanded. The actual cost, when you get down to it, comes from choosing to not split the masked load and not being able to sensible split VCTP's, leaving the predicate shuffle between the vctp and the load as high cost.

We may run into the same kinds of problems in getMaskedMemOpCost, but if we pass an 'I' through we can try and tell if it needs to be extended there.

We (ARM/MVE) need to do the same thing for the cost of gather and interleaved loads. Whether the sext/zext is free there is equally variable. The way that I would have imagined this is a enum that can be one of the types of loads that the vectorizer produces (Normal, Masked, Interleave, Gather, Expanded?). There probably needs to be an option for None or Unknown too. I understand that you tried this before but ran into trouble? Can you speak to what kinds of problems you ran into doing things that way?

Yes, I originally tried an enum with one entry per "kind" of load/store, as you said. It was working fine, but I felt that it was a bit confusing, as each entry had a different meaning based on the opcode. For instance, I had the MaskedVector entry, which, for most casts, meant that the operand was a masked load, but for truncs meant that the single user of the cast is a masked store. ,What if another target needs to deal with truncs of masked loads or something like that ? They wouldn't be able to use that API, they'd have to hack it. (I don't know if this will ever come up, I'm just trying to think about all of the use cases for this enum)
With the current format of the enum, it's a clearer IMHO, there is no ambiguity, and any target can add their specific case in there and deal with it in the backend.

Of course, both versions work equally well, I'm fine with both.

Yes, I originally tried an enum with one entry per "kind" of load/store, as you said. It was working fine, but I felt that it was a bit confusing, as each entry had a different meaning based on the opcode. For instance, I had the MaskedVector entry, which, for most casts, meant that the operand was a masked load, but for truncs meant that the single user of the cast is a masked store. ,What if another target needs to deal with truncs of masked loads or something like that ? They wouldn't be able to use that API, they'd have to hack it. (I don't know if this will ever come up, I'm just trying to think about all of the use cases for this enum)
With the current format of the enum, it's a clearer IMHO, there is no ambiguity, and any target can add their specific case in there and deal with it in the backend.

Of course, both versions work equally well, I'm fine with both.

I feel like a truncating load is not a very common thing. We should try and get the common case working first. Although I see your point about the other types of casts, it might be enough at the moment to only look at sext, zext and trunc. Trunc is a little different because we are looking at the operands, might not be the prettiest, but otherwise I think should be OK.

I think that we in MVE need a way to distinguish all the types of loads/stores that the vectorizer produces, even if the context instruction is incorrect and cannot be trusted. But it may be better not to think of this from an individual backend perspective exactly. We are trying to pass the information that the midend knows through to the costmodel. In that way it makes sense to me to add a parameter that has vales {None, Normal, Masked, Interleaved and Gather}, maybe reversed too as the vectorizer can produce it. Get them to be passed through to the correct places, or calculated from the context if nothing else is known. That sounds like it should be able to be done cleanly and simply enough.

I believe that extending load and truncating stores are at least the most common ones we need to worry about, (if not the only one's). Can give that a go, see how it looks?

Yes, I originally tried an enum with one entry per "kind" of load/store, as you said. It was working fine, but I felt that it was a bit confusing, as each entry had a different meaning based on the opcode. For instance, I had the MaskedVector entry, which, for most casts, meant that the operand was a masked load, but for truncs meant that the single user of the cast is a masked store. ,What if another target needs to deal with truncs of masked loads or something like that ? They wouldn't be able to use that API, they'd have to hack it. (I don't know if this will ever come up, I'm just trying to think about all of the use cases for this enum)
With the current format of the enum, it's a clearer IMHO, there is no ambiguity, and any target can add their specific case in there and deal with it in the backend.

Of course, both versions work equally well, I'm fine with both.

I feel like a truncating load is not a very common thing. We should try and get the common case working first. Although I see your point about the other types of casts, it might be enough at the moment to only look at sext, zext and trunc. Trunc is a little different because we are looking at the operands, might not be the prettiest, but otherwise I think should be OK.

I think that we in MVE need a way to distinguish all the types of loads/stores that the vectorizer produces, even if the context instruction is incorrect and cannot be trusted. But it may be better not to think of this from an individual backend perspective exactly. We are trying to pass the information that the midend knows through to the costmodel. In that way it makes sense to me to add a parameter that has vales {None, Normal, Masked, Interleaved and Gather}, maybe reversed too as the vectorizer can produce it. Get them to be passed through to the correct places, or calculated from the context if nothing else is known. That sounds like it should be able to be done cleanly and simply enough.

I believe that extending load and truncating stores are at least the most common ones we need to worry about, (if not the only one's). Can give that a go, see how it looks?

If I understand correctly, I should:

  • Use {None, Normal, Masked, Interleaved and Gather} instead of the current enum values. (No Scatter? e.g. for trunc to scatter store? Do I need the "reversed" one as well?)
  • Only set CastContextHint for zext, sext and trunc.

Is that correct? If yes, I can give it a go and we'll see how it looks.

If I understand correctly, I should:

  • Use {None, Normal, Masked, Interleaved and Gather} instead of the current enum values. (No Scatter? e.g. for trunc to scatter store? Do I need the "reversed" one as well?)
  • Only set CastContextHint for zext, sext and trunc.

Is that correct? If yes, I can give it a go and we'll see how it looks.

I think the vectorizer calls it GatherScatter. That sounds like a good thing to mirror.

Reverse end up as extend(reverseshuffle(load))), and it doesn't look like the extend will usually get pushed through the shuffle. You could either make it a None or add and extra Reverse option for completeness. Either way, it should probably not be free by default.

Pierre-vh updated this revision to Diff 263134.May 11 2020, 3:20 AM
  • Moved CastContextHint to TargetTransformInfo
  • Moved the logic that calculates the CastContextHint from an Instruction* from getCastInstrCost (in TargetTransformInfo.cpp) to a static function in TargetTransformInfo named getCastContextHint).
  • CastContextHint is no longer an optional parameter. Callers have to choose between using a None for the context, using a custom context or calling TTI::getCastContextHint
  • Removed my change in BasicTTIImpl.h - (Should I restore it? It seemed to have no effects on tests)

This should much better than the previous version, and more in-line with other similar enums used by getArithmeticInstrCost for instance.

Note that this patch doesn't support Interleave and Reversed in getCastContextHint - I personally don't know how to do that, so feedback is welcome.

Yeah, I think this is looking better. Thanks for the update. But I would be interested in the opinion of others too. It seems to get this information through to the costmodel and be more accurate than just presuming 'I' will be correct.

Note that this patch doesn't support Interleave and Reversed in getCastContextHint - I personally don't know how to do that, so feedback is welcome.

I think that's fine of the moment. They are implemented as shufflevectors, so I don't think the extend and the load will usually combine. Essentially treating them as None should be OK, as far as I understand. And we can adjust that in the future if we need to.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1041

Considering how this is used, it looks like None would mean "This is not a load/strore", not "No context". I guess if we don't have I and we are not provided with anything better, we don't have any context. But in all other situations we at least try to calculate it from I.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
765

I feel like this should now be CCH == TTI::CastContextHint::Normal.

The other types of loads won't apply for the logic below.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6481

If the VF ==1 and the operand is a load/store, this should likely use Normal.

6482

You either don't need unsigned Opcode = I->getOpcode(), or the variable can be used more throughout this function?

Pierre-vh updated this revision to Diff 263646.May 13 2020, 1:56 AM
Pierre-vh marked 4 inline comments as done.

Had a first quick look, and here's a drive-by comment from my side: I can't say it was love at first sight for me with this patch. It indeed feels like a very narrow approach, and am not entirely sure if it is similar to getArithmeticInstrCost. I see the challenge here though, and haven't given it any time to think about an alternative. Not sure, would it be worth to write to the dev list to get some more exposure/views/opinion on this?

The general idea of passing down information about the instruction makes sense.

The one thing I would say here is that we probably shouldn't pass down the "Instruction" from the vectorizer at all: in general, the cost computation doesn't know what sort of transforms the vectorizer is going to do, so any information computed from its operands/uses will be misleading in general.

The general idea of passing down information about the instruction makes sense.

The one thing I would say here is that we probably shouldn't pass down the "Instruction" from the vectorizer at all: in general, the cost computation doesn't know what sort of transforms the vectorizer is going to do, so any information computed from its operands/uses will be misleading in general.

I agree, I'll remove the instruction from the calls to getCastInstrCost in the vectorizer.
It was mostly useless with those changes anyway. I removed it, and tests are still passing just fine.

Had a first quick look, and here's a drive-by comment from my side: I can't say it was love at first sight for me with this patch. It indeed feels like a very narrow approach, and am not entirely sure if it is similar to getArithmeticInstrCost. I see the challenge here though, and haven't given it any time to think about an alternative. Not sure, would it be worth to write to the dev list to get some more exposure/views/opinion on this?

I can't think of many alternatives though. We also thought about implementing getMaskedMemoryOpCost in the ARM backend, but it doesn't solve the issue nicely:

  • The cost would be in the wrong place: it's not the load that's expensive, it's the extend/trunc. If you remove the cast, the problem disappears.
  • We'd also need to change getMaskedMemoryOpCost. At the very least, it'll need to know the type after/before extend/trunc, which will need to be calculated in the vectorizer to account for the minimal bitwidths (e.g. sometimes, instructions such as zext i32 can become zext i16 after vectorization as the vectorizer know that the value doesn't need more than 16 bits)

I agree that sending a message on the mailing list (with a link to this patch) would be a good idea so we can get more feedback.

  • Removing instruction from calls to getCastInstrCost in the LoopVectorizer.

I agree that sending a message on the mailing list (with a link to this patch) would be a good idea so we can get more feedback.

Well, I was happy that @efriedma replied and shared his opinion, as it looks like we are on the right track and not doing something really strange. Don't know, perhaps that's enough confirmation.

On its own, this doesn't seem like the sort of change that would need an llvm-dev thread; it's a localized change to one specific function on TargetTransformInfo. If you're looking for feedback on the general approach of adding "context" hints to get*InstrCost, it might make sense to send a message to llvm-dev outlining the general direction.

If you're looking for feedback on the general approach of adding "context" hints to get*InstrCost, it might make sense to send a message to llvm-dev outlining the general direction.

Yeah, so I was getting worried people might want to see a more generic approach here. We might want to do that anyway at some point, but for now I of course don't want to torpedo or delay this work, so was happy that this would be acceptable as a local change to getCastInstrCost.

  • Removing instruction from calls to getCastInstrCost in the LoopVectorizer.

Although I would agree that in theory this would make a lot of sense, there are other places that are currently using the context instruction for things that are not modeled here. And just removing them from the vectorier will likely lead to regressions in practice. They won't be tested because the costmodel is usually tested through non-vectorizer tests (testing the costmodel directly), and this change will now treat the vectorizer differently to those tests.

From a quick look:

  • aarch64 looks for "isWideningInstruction".
  • arm/neon can now do the same as of a recent patch.
  • systemz seems to use it for loads and.. something to do with compares?

I would recommend that for the moment we keep the context instruction in place from the vectorizer. The alternative would be to try and replace all needed modelling with hints or extra parameters, but that sounds like it will get very messy quite quickly. If only the opcode of the surrounding instructions is used, it will likely be "correct enough" for most cases (I think). In the long run as the vectorizer learns to transform code more, and vplan starts to learn new tricks this is more likely to break down, but I think that will need larger changes to the costmodelling anyway. What we have here is at least well defined (the type of the load/store) and is known to be fixing something that is incorrectly used at the moment.

In the long run I would like to see something that really tries to cost multiple instructions at the same time. If we have trunc(shl(mul(sext, sext))) and we know in the backend that we can convert that to a vmulh, it's going to be next to impossible to sensible costmodel that without something that looks at the entire tree and gives it a single cost. You don't want to have a sextend looking at it's uses uses uses to see if the whole thing together makes something that is cheap vs something that is expensive. Maybe that's not a great example but hopefully you can see my point. I imagine it might need a better way to get context instructions for things that don't exist too. From vplan recipes or runtime unrolled loops or the like. It would be good to be able to get a fake instruction analogue without being tied specifically to the original IR. That will all need a lot of careful design though.

Pierre-vh updated this revision to Diff 264892.May 19 2020, 7:04 AM

Re-adding the instructions in the calls to TTI::getCastInstrCost in the LoopVectorizer.

Thanks. I would be happy with this. But I'd like to get a second opinion on that.

(I guess you argue that Masked shouldn't be a type, and should be applied as some sort of modifier to the other types. I don't know of any architecture where this would matter though, so may not be important in practice. Gathers are always masked and Interleaved are never, from the architectures I know of that make use of them.)

@Pierre-vh Sam has made some potentially awkward changes here. Can you rebase this and see if it's still looking OK?

Pierre-vh updated this revision to Diff 265463.May 21 2020, 2:33 AM

Rebasing the patch - there's now one more call side for getCastInstrCost

dmgreen added inline comments.May 21 2020, 7:28 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
765

Actually looking at the code again, this needn't check I? Because we are not using it's value directly, and we know the operand it a load now.

That would imply that the if (!I) condition above can change to just guard the TLI call too.

Pierre-vh marked an inline comment as done.
Pierre-vh updated this revision to Diff 265677.May 22 2020, 1:54 AM

Rebasing the patch. (Changed code in TargetTransformInfoImpl.hpp, around line 864).

dmgreen commandeered this revision.May 26 2020, 2:49 AM
dmgreen edited reviewers, added: Pierre-vh; removed: dmgreen.
dmgreen removed a reviewer: dmgreen.

Pierre is now off to do more important things, so I'll take this over. I'll commandeer the patch and rebase it.

dmgreen updated this revision to Diff 266143.May 26 2020, 2:50 AM
dmgreen updated this revision to Diff 271695.Jun 18 2020, 6:14 AM

Rebase onto trunk and attempt to get some FP converts working in the same way too.

dmgreen edited the summary of this revision. (Show Details)Jun 18 2020, 6:19 AM
bmahjour removed a subscriber: bmahjour.Jun 18 2020, 7:06 AM

Ping. Anyone happy/unhappy with this?

fhahn added inline comments.Jul 2 2020, 1:20 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1046

Did you consider also excluding non load/store cast uses? For example, it might be worth to include arithmetic instructions into which casts can be folded, e.g. USUBL & co on AArch64.

dmgreen marked an inline comment as done.Jul 2 2020, 8:11 AM
dmgreen added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1046

Yes.. That might be a possibility.

AArch64TTIImpl::getCastInstrCost already has some code to check for isWideningInstruction, which should already catch a lot of these situations I think. That relies on the context instruction being correct, but I believe that will be correct most of the time currently. It's really the type of the load or store that is most often incorrect.

I'm honestly not sure how well this would scale to many different operand kinds. I think in the long run we need something that makes costing multiple nodes together easier, if you imagine something like a trunc(shift(mul(sext(x), sext(y), 16), all being a single instruction! And with vplan making larger changes during vectorization it would ideally handle "hypothetical" instructions better without relying on "real" context instructions.

But this patch is at least a little step that gets the load/store kinds more correct.

fhahn added inline comments.Jul 15 2020, 10:54 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1046

AArch64TTIImpl::getCastInstrCost already has some code to check for isWideningInstruction, which should already catch a lot of these situations I think. That relies on the context instruction being correct, but I believe that will be correct most of the time currently. It's really the type of the load or store that is most often incorrect.

Yep, but I think it only works if the original IR does the widening already as part of the context instruction.

I'm honestly not sure how well this would scale to many different operand kinds. I think in the long run we need something that makes costing multiple nodes together easier, if you imagine something like a trunc(shift(mul(sext(x), sext(y), 16), all being a single instructio

yeah, the isolated helpers reach a limit of usefulness. A way to cost arbitrary instruction trees would be very useful in many contexts (with some limits to the size I guess).

But this patch is at least a little step that gets the load/store kinds more correct.

Sounds good to me, it might be good to just mention the reason for only limiting to memory cases initially somewhere (sorry if I missed that this is already said somewhere.

dmgreen updated this revision to Diff 278441.Jul 16 2020, 5:34 AM

Update the comment with a fixme about possibly changing to costing multiple instructions.

SjoerdMeijer accepted this revision.Jul 27 2020, 9:31 AM

I think there's consensus this patch is a reasonable fix. So, this LGTM, but please wait a day in case there are more comments.

llvm/lib/Analysis/TargetTransformInfo.cpp
733

Can this and computeTruncCastContextHint be merged into 1 function? Essentially only the cases in the switch are different, which can be merged.

This revision is now accepted and ready to land.Jul 27 2020, 9:31 AM
This revision was automatically updated to reflect the committed changes.