Page MenuHomePhabricator

[Intrinsic] Signed Saturation Intrinsic
Needs ReviewPublic

Authored by leonardchan on Sep 19 2018, 5:28 PM.

Details

Summary

Add a new intrinsic that saturates a signed integer. If a value X provided is larger than the maximum value that can be represented by the provided bit width W, the this intrinsic returns that maximum value. If the X is smaller than the minimum value that can be represented in W bits, the intrinsic returns that minimum value. Otherwise, X is returned.

This is a part of implementing fixed point arithmetic in clang where some of the more complex operations will be implemented as intrinsics.

Diff Detail

Repository
rL LLVM

Event Timeline

leonardchan created this revision.Sep 19 2018, 5:28 PM

I think we need to decide where these intrinsics should be 'lowered' if they aren't legal, and define how to let the target specify legality.

Lowering during isel is probably the 'right' place, but I think it gets tricky when you have to deal with lowering illegal operations on legal types where the legalized operation needs to be done in an illegal type. It's easier to do it in IR instead, but that goes against the way legalization is typically done.

include/llvm/CodeGen/ISDOpcodes.h
264

With the way the rest is written, it doesn't seem like this DAG opcode is ever used.

The way you would normally do this in DAG would be to:

  • Lower the intrinsic to this node in SelectionDAGBuilder, regardless of type or operation legality
  • Deal with illegally typed nodes in LegalizeTypes
  • Deal with illegal opcodes in LegalizeDAG

I think that doing the legalization in DAG for some of the fixed-point operations could be hard. Opcode legalization occurs after type legalization, so the legalization for opcodes that have legal types but where the target doesn't actually support the operation must be able to legalize to valid types. I think this could get unnecessarily tricky. For example, an N-bit, M-scale fixsmul must actually be done as N+M-bits or greater, and if that type (or anything greater than it) isn't legal it gets hairy.

There's also the issue of how a target specifies legality of these operations. A target might support ISD::FIXSMUL on MVT::i16, but only with a scale of 15. However, there's currently no way of expressing legality of an operation based on anything than the type (in general). Something would probably have to be added to TargetLowering/TargetLoweringBase that lets targets specify the legality of these operations. (The same issue applies to ssaturate; it might be legal to do 'i32 ssaturate (X, 15)', but not 'i32 ssaturate (X, 19)')

lib/CodeGen/IntrinsicLowering.cpp
416 ↗(On Diff #166200)

This will replace the intrinsic with a call to 'ssaturate'. Given that llvm.ssaturate is overloaded, I'm not sure how that will work. If anything, you would want to replace the intrinsic with regular IR here instead.

This is the first time I've seen this 'IntrinsicLowering' component though. It seems like it's only part of the IR interpreter. Not sure how important that is.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5764

This should probably be building an SSAT node rather than actually converting the intrinsic to normal DAG operations.

lib/Transforms/FixedPoint/SaturationPass.cpp
13 ↗(On Diff #166200)

I think this would be more futureproof if it didn't strictly deal with ssaturate, but also lowered other fixed-point operations as well.

However, signed saturation is not strictly a fixed-point operation, so I'm not really sure how to go about it without it seeming odd.

It should probably also ask the target somehow if the operations should be converted at all. If this were considered a 'fixed-point lowering pass' then it could be inserted into the pass pipeline towards the end to lower operations that are not legal according to the target.

  • Removed IntructionLowering code since that was probably just for lli

I think we need to decide where these intrinsics should be 'lowered' if they aren't legal, and define how to let the target specify legality.

Lowering during isel is probably the 'right' place, but I think it gets tricky when you have to deal with lowering illegal operations on legal types where the legalized operation needs to be done in an illegal type. It's easier to do it in IR instead, but that goes against the way legalization is typically done.

My bad. This is my first time touching the LLVM backend so a lot of this is still new to me and I'm still trying to grasp how the IR lowering works, but I think I understand it a little better now and everything's coming together.

So if I understand this correctly, we can handle this intrinsic in 2 ways:

  1. The intrinsic gets converted to IR in a pass before touching the DAG.
  2. The intrinsic gets passed to the DAG and gets legalized and lowered to machine code there.

It seems that from the thread, the consensus was for going with (2), but I also see now how things would be much easier just having this run through a pass in (1). If I recall correctly, the main reason for offsetting the intrinsic "expansion" to the DAG was to determine if the intrinsic was supported by the hardware during the legalization steps. If so, couldn't this also be done in a pass when converting the intrinsic to IR in (1)? That is, we determine early in the pass if this intrinsic call is supported by this target. Those that aren't legal get expanded into IR whereas those that are legal get passed to DAG instruction selection.

I think this might avoid the trouble of having to check types and operations in the DAG, and ISel can choose whatever appropriate instructions for whatever supported intrinsic (if that makes sense).

include/llvm/CodeGen/ISDOpcodes.h
264

I think that doing the legalization in DAG for some of the fixed-point operations could be hard. Opcode legalization occurs after type legalization, so the legalization for opcodes that have legal types but where the target doesn't actually support the operation must be able to legalize to valid types. I think this could get unnecessarily tricky. For example, an N-bit, M-scale fixsmul must actually be done as N+M-bits or greater, and if that type (or anything greater than it) isn't legal it gets hairy.

I'm not sure if I follow still. (Still learning how the instruction selection process works). So for the fixsmul example, is the problem that during the operation legalization step, the target may not be able to support an int that has N+M bits? Wouldn't this legalization occur during the first type legalization step before operation legalization, like we check if the target can support a N+M bit int before operation legalization? Then during the following optimization stage, have a DAG pass that will expand an illegal intrinsic operation into other DAG nodes.

There's also the issue of how a target specifies legality of these operations. A target might support ISD::FIXSMUL on MVT::i16, but only with a scale of 15. However, there's currently no way of expressing legality of an operation based on anything than the type (in general). Something would probably have to be added to TargetLowering/TargetLoweringBase that lets targets specify the legality of these operations. (The same issue applies to ssaturate; it might be legal to do 'i32 ssaturate (X, 15)', but not 'i32 ssaturate (X, 19)')

Also having trouble understanding this one. Is the problem that you don't have access to operand types during the instruction legalization step?

lib/CodeGen/IntrinsicLowering.cpp
416 ↗(On Diff #166200)

Ah, ok. This is also my first time touching the LLVM backend so I'm not sure if this was the correct place to work on. I also noticed that this code seems to only be touched when using lli, so this probably doesn't need to be touched.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5764

Ah, ok, I think I'm starting to understand the DAG process better. I tried this earlier, but wasn't sure if that was the correct way to approach this. I initially had setValue(&I, DAG.getNode(ISD::SSAT, sdl, DAG.getVTList(...), Op1, Op2)) (which I believe creates an SSAT node), but always ran into this fatal error:

fatal error: error in backend: Cannot select: t4: i32 = ssaturate t2, Constant:i32<4>
  t2: i32,ch = CopyFromReg t0, Register:i32 %2
    t1: i32 = Register %2
  t3: i32 = Constant<4>
In function: main

But I think this is because I didn't make this node type illegal during either legalization step.

If I understand the purpose of this file correctly: this is where the initial build of the DAG is done, then legalization occurs, then for targets where specific intrinsic operations are illegal, this DAG should be "expanded" into other DAG nodes before instruction selection into machine instructions.

If this is the case, where should the actual expansion (which I think is what this code also does) occur? I imagine it would be during the optimization step between the operation legalization and actual machine instruction selection from the DAG.

lib/Transforms/FixedPoint/SaturationPass.cpp
13 ↗(On Diff #166200)

I think this would be more futureproof if it didn't strictly deal with ssaturate, but also lowered other fixed-point operations as well.

However, signed saturation is not strictly a fixed-point operation, so I'm not really sure how to go about it without it seeming odd.

I see, in a sense we have 2 different sets of intrinsics: the saturation and add/sub ones that work generically for integers, and the mul/div ones that care about the scale of the fixed point operands. Do you think it would be ok to propose this as 2 separate passes then where one is specifically applied to fixed point types that take a scale? This would just be creating another directory and renaming this one.

  • Moved DAG expansion from the builder to the legalization
leonardchan added inline comments.Sep 20 2018, 4:51 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

@ebevhan Do you know if this is where checking if this intrinsic is supported by the target should be?

So if I understand this correctly, we can handle this intrinsic in 2 ways:

  1. The intrinsic gets converted to IR in a pass before touching the DAG.
  2. The intrinsic gets passed to the DAG and gets legalized and lowered to machine code there.

It seems that from the thread, the consensus was for going with (2), but I also see now how things would be much easier just having this run through a pass in (1). If I recall correctly, the main reason for offsetting the intrinsic "expansion" to the DAG was to determine if the intrinsic was supported by the hardware during the legalization steps. If so, couldn't this also be done in a pass when converting the intrinsic to IR in (1)? That is, we determine early in the pass if this intrinsic call is supported by this target. Those that aren't legal get expanded into IR whereas those that are legal get passed to DAG instruction selection.

I think this might avoid the trouble of having to check types and operations in the DAG, and ISel can choose whatever appropriate instructions for whatever supported intrinsic (if that makes sense).

Yes, exactly. Typically legalization occurs on the isel level, but I think it's a lot easier to do some fixed-point operations on a higher level instead.

Though, I think this mostly applies to the operations that must be performed in a wider type, like multiplication and division. For operations like saturation, and saturated addition/subtraction, it's probably not as important since saturation can be done in the operation type just fine, and addition/subtraction hopefully have the overflow operations. It feels a bit odd to do some in DAG and others in IR, though... but if all of the DAG operations are also the ones that work on integer values as well, maybe that works fine?

include/llvm/CodeGen/ISDOpcodes.h
264

I'm not sure if I follow still. (Still learning how the instruction selection process works). So for the fixsmul example, is the problem that during the operation legalization step, the target may not be able to support an int that has N+M bits?

Correct. Operation legalization (as I understand it; I could be wrong!) can not introduce illegal types into the DAG since type legalization occurs first. Say that you have a 'i16 fixsmul(X, Y, 15)' on a target where i16 is legal and i32 is not, and the fixsmul is not legal. This passes type legalization just fine, because i16 is legal. Then we get to operation legalization. It determines that the fixsmul node is not legal, so it needs to either be expanded or promoted according to what the target says (generally from TLI.getOperationAction).

However, we know how a fixsmul should be lowered. It needs to be done as just a regular mul in a wider type; but we can't make this wider type, because it's not legal! If the target specified Expand (which it probably did in this case, since i32 is not legal), we need to do it as a bunch of i8 operations*. This just seems really messy to me; it would have been much cleaner if it had just been properly promoted from the start and we let isel legalize the higher level lowering instead.

*I'm unsure about this. It might be the case that if we need to expand an illegal node with a legal type, it is allowed to do the lowered operations in the legal type anyway, even though we said expand.

Wouldn't this legalization occur during the first type legalization step before operation legalization, like we check if the target can support a N+M bit int before operation legalization? Then during the following optimization stage, have a DAG pass that will expand an illegal intrinsic operation into other DAG nodes.

If the type of the node is legal, type legalization won't do anything. And N+M necessarily being legal for an N-bit fixsmul is not a requirement, since the fixsmul might itself be a legal, actual target operation, in which case the legality of N+M doesn't matter. I suppose you could ask about N+M if the node isn't legal as N, but that mixes type and operation legalization (since type legalization would then depend on whether an operation was legal in a type that we haven't even seen) and I'm not sure if the existing implementation ever does this.

Also having trouble understanding this one. Is the problem that you don't have access to operand types during the instruction legalization step?

Generally, DAG will call TLI.getOperationAction to determine the legality of operations. It passes the opcode and the type, but we need more than that to determine if one of these operations is legal. A new hook would have to be added to TargetLowering that takes both the opcode, the type and the scaling factor/saturating bitwidth of the node. I think I mentioned it in another comment.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

Yes, this would be it. For nodes like ssat and fixed-point ops, you could ask a custom target hook, maybe TLI.getScaledOperationAction(<opcode>, <type>, <scale/sat_bw>) to determine the legality of a given node.

When type-legalizing you also need to handle things for specific nodes as well, since things behave differently when you legalize operands/results on different nodes.

(Of course, this all depends on whether we do the legalization here or in IR instead.)

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5764

Yes, that line looks correct. The reason you got that error was because the final isel pattern matching step failed (i.e. the node was completely legal all the way through) but as there are no patterns for x86 to select that node, it couldn't finish.

If I understand the purpose of this file correctly: this is where the initial build of the DAG is done, then legalization occurs, then for targets where specific intrinsic operations are illegal, this DAG should be "expanded" into other DAG nodes before instruction selection into machine instructions.

Correct. The initial DAG is built here, then type-legalized (through expansion/promotion), then operation-legalized (through expansion/promotion) with some optimization steps in between both. Then it's passed to the pattern matcher to finish the selection.

Technically, it is possible to select intrinsics directly if they survive all the way down. They show up as 'int_XXX' nodes in the .td patterns. If we decide to circumvent the typical path for determining the legality of these intrinsics (such as adding a special target hook to TargetLowering or legalizing as IR instead), then we might not need a DAG node at all. Any such intrinsic that makes it to the isel stage would be 'legal'. I don't know what upstream thinks of that design, though.

It also seems that some nodes have special hooks to handle a bit extra even on the DAG level. That might be an option if we still decide to go with doing something in DAG.

If this is the case, where should the actual expansion (which I think is what this code also does) occur? I imagine it would be during the optimization step between the operation legalization and actual machine instruction selection from the DAG.

The actual expansion occurs during legalization. Typically, nodes can either be expanded (the type is divided in half and the operation is done in the halves) or promoted (the type is rounded up to the next legal type and done in the larger type instead). All of this occurs in LegalizeDAG.cpp, LegalizeTypes.cpp, and a number of other files (LegalizeIntegerTypes, LegalizeFloatTypes, LegalizeTypesGeneric etc). If you run a compilation with -debug-only=isel you'll get dumps of the DAG for each step of DAG selection:

Initial selection DAG: %bb.0 
...
Optimized lowered selection DAG: %bb.0 
...
Type-legalized selection DAG: %bb.0
...
Legalized selection DAG: %bb.0
...
Optimized legalized selection DAG: %bb.0
...
===== Instruction selection begins:
...

'Optimized' is done by DAGCombiner.

lib/Transforms/FixedPoint/SaturationPass.cpp
13 ↗(On Diff #166200)

It's true that the intrinsics do operate on different domains, but I think I would prefer them to be one and the same, since saturation is something that some fixed-point operations will want to do anyway. The expansion of a fixsmul_sat would contain a ssaturate, so it makes more sense to keep them in the same pass since this means you can iterate the expansion until you hit a fixed point (fixsmul_sat might not be legal but ssaturate might be, for example).

I guess you could have two passes and just run the fixed-point pass before the saturating integer pass...

  • Removed passes since ssaturate can be expanded in the DAG
leonardchan added a reviewer: rjmccall.
leonardchan changed the repository for this revision from rC Clang to rL LLVM.

Of the intrinsics we plan to implement, it seems that the ordering of legalization affects specifically the fixed point mul/div intrinsics since if they get expanded into other nodes, we will need to check again for legal types since performing scaled mul/div requires larger integers for a buffer. The remaining ones though I don't think require any extra legalization. Both saturation and saturated add/sub can be done without different sized operands.

For ssat specifically, I think this patch has most of the necessary code to have this ready for a formal review. I removed the passes for this one since it seems all expansion can be done safely in the DAG without any extra type or operation legalization.

include/llvm/CodeGen/ISDOpcodes.h
264

However, we know how a fixsmul should be lowered. It needs to be done as just a regular mul in a wider type; but we can't make this wider type, because it's not legal! If the target specified Expand (which it probably did in this case, since i32 is not legal), we need to do it as a bunch of i8 operations*. This just seems really messy to me; it would have been much cleaner if it had just been properly promoted from the start and we let isel legalize the higher level lowering instead.

I see. I'll add John on this patch also to get his opinion and see if anyone else has more suggestions on llvm-dev.

If the type of the node is legal, type legalization won't do anything. And N+M necessarily being legal for an N-bit fixsmul is not a requirement, since the fixsmul might itself be a legal, actual target operation, in which case the legality of N+M doesn't matter. I suppose you could ask about N+M if the node isn't legal as N, but that mixes type and operation legalization (since type legalization would then depend on whether an operation was legal in a type that we haven't even seen) and I'm not sure if the existing implementation ever does this.

Ah, so it seems what we really need is a way to check if the operation is legal first before the types, because if the operation is supported, then we don't need to care about any other requirements on the expanded operations (like the N+M bit int for an expanded fixsmul). In this case, it seems like this would be another reason to have the LLVM pass that also does some special legalization for the intrinsic.

Alternatively, it seems that we actually have all the information we need in the LegalizeOp step, that is we know which intrinsic we have, if this target supports this operation, and the bit widths of the operands and result in the event it is unsupported and needs to be expanded into other operations.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

Made the default action Expand for this in TargetLoweringBase and targets can override this themselves.

I think ssat may not need a custom hook since I don't think it requires anything extra like fixsmul if it has to be expanded. But I see what you mean now by how different nodes may require different steps for legalization.

ebevhan added inline comments.Sep 25 2018, 1:20 AM
include/llvm/CodeGen/ISDOpcodes.h
264

For the multiplications, I now realized there are the UMUL_LOHI and SMUL_LOHI nodes. Technically we want to be doing the operation as iN+M, but i2*N works fine too (and is better; it's more likely to be a legal type) since the scale cannot be larger than the bitwidth. Those nodes should be valid even for an iN type, even if i2*N isn't valid. I don't think there's an equivalent node for divisions, though.

Also, for saturating multiplication, you need to do saturation on the wider type, meaning that you need to do saturation on two separate parts of an integer. This would be the same as doing expanding type legalization on ssat, but since type legalization has already been performed, you can't reuse that part of the code for operation legalization.

Alternatively, it seems that we actually have all the information we need in the LegalizeOp step, that is we know which intrinsic we have, if this target supports this operation, and the bit widths of the operands and result in the event it is unsupported and needs to be expanded into other operations.

If you reach LegalizeOp, then the types of the operands and result must be legal since we have passed type legalization. We don't know if the target considers that operation, with that type, with that scale to be legal, though. Type legalization won't catch this, and getOperationAction does not let you pass enough information to determine it.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

You don't need an extra hook to determine anything extra for expansion, but you do need the hook to determine if it's legal in the first place. As I mentioned in another comment, an i16 ssat with a saturating width of 8 might be legal, but an i16 ssat with saturating width 12 might not. The former must not be expanded, but the latter must.

leonardchan added inline comments.Sep 25 2018, 10:59 AM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

Got it. Just to clarify though, legalizations regarding the scale would only apply for our mul/div intrinsics right? Operations like saturation and sat add/sub which don't care about the scale do not need this hook right, since it would also be useful for these operations to work on regular integers?

ebevhan added inline comments.Sep 26 2018, 7:18 AM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

Saturating add/sub won't need anything, but I still think saturation requires some extra check to determine if an operation with a specific saturating bitwidth should be legal or expanded. As I mentioned, a saturation with a type might be legal for one saturating bitwidth but not for another.

leonardchan marked an inline comment as done.
leonardchan removed a subscriber: cfe-commits.
leonardchan added inline comments.
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

Added the extra case to LegalizeOp and a getSaturationOperationAction() method.

ebevhan added inline comments.Sep 27 2018, 6:42 AM
include/llvm/CodeGen/ISDOpcodes.h
259

This comment doesn't mention that W must be constant.

include/llvm/CodeGen/TargetLowering.h
808

This says 'scaled' but currently only looks at saturating operations (and is named as such).

If you want to make it generic, maybe it should be 'getSatOrScaledOperationAction' and corresponding for the hook above? Not sure what to call the parameter, though.

817

'is supported in this type'?

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1119

Looks good!

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
146

Just a nit but this is short enough to go on one line.

540

Is this correct in the face of extension to a wider type? I think you want SExtPromotedInteger.

lib/CodeGen/SelectionDAG/LegalizeTypes.h
333

What about expansion? If you have an i64 ssat and i64 isn't a legal type, then it will likely decide to expand it during type legalization.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5765

Do we know if Op2 here works properly if the type of the operand (i32) is not legal? I suspect it will try to legalize it even though it shouldn't (since it's not a real value per se, just a parameterization of the operation).

Perhaps it should be a TargetConstant (which the legalizer won't touch), but I don't know if there's precedent for this.

leonardchan marked 8 inline comments as done.
leonardchan added inline comments.
include/llvm/CodeGen/TargetLowering.h
808

I wanted to separate them both into 2 functions that could be called after each other on intrinsics that use both saturation and scale. I initially wrote this as the scaled method when I meant saturation, but forgot to redo some of the comments.

lib/CodeGen/SelectionDAG/LegalizeTypes.h
333

Added a test case for illegal type.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5765

Didn't think about this either, but I also imagine the type will be legalized somehow also after exploring the type legalization code. Regardless, I still made it a TargetConstant.

I think this looks pretty good to me now. It's probably a good time to add someone who is familiar with DAG as reviewer at this point.

include/llvm/CodeGen/TargetLowering.h
808

Hm, okay. I'm assuming that the opcode (e.g. fixsmul_sat) would be passed down into the hooks in that case? Since the hook already knows what opcode is being examined, why not just have one hook? The legality of 'fixsmul_sat' for a certain type is not predicated on whether 'fixsmul' and 'ssat' are legal. And presumably, if the fixsmul_sat is legal, both hooks would return true anyway.

leonardchan added inline comments.Sep 28 2018, 10:22 AM
include/llvm/CodeGen/TargetLowering.h
808

Yes, fixsmul_sat would be passed to both hooks. Are you saying that an intrinsic that uses both scale and saturation like fixsmul_sat should have its own hook that checks both at the same time?

Something like

getScaledSaturationOperationAction(unsigned Op, EVT VT, unsigned Scale, unsigned SatBitWidth)

I imagine that for our purposes, 2 separate functions would work and be independent from each other because it seems cleaner, but if a user down the line requires this combined function, it could be reworked then in the future.

Adding @spatel. This feels like it could be implemented with a min+max sequence in IR using 2 icmps and 2 selects. We've already spent quite a lot of effort optimizing these common patterns in both IR and SelectionDAG with the SMAX/SMIN ISD opcodes.

RKSimon added a subscriber: RKSimon.
RKSimon added inline comments.
include/llvm/CodeGen/ISDOpcodes.h
264

Instead of constant nodes, we might be better off with a ValueType node here - similar to SIGN_EXTEND_INREG etc.?

  • Changed second argument of SSAT node from ConstantSDNode to VTSDNode
leonardchan marked an inline comment as done.Oct 2 2018, 10:53 AM

Adding @spatel. This feels like it could be implemented with a min+max sequence in IR using 2 icmps and 2 selects. We've already spent quite a lot of effort optimizing these common patterns in both IR and SelectionDAG with the SMAX/SMIN ISD opcodes.

This was one of the initial design choices @ebevhan and I discussed before with either expanding this intrinsic into IR or in the DAG. We could have a pass that expands this into IR, but wouldn't be legalizing the instruction the "normal" way of checking if the this intrinsic is legal in the DAG.

I'm not sure why you would need an intrinsic at all. Why can't you just emit the IR equivalent from the beginning. This looks the equivalent of doing something like "x = std::min(std::max(x, -2048), 2047)" in C++ which would be implemented in IR with 2 compares and 2 selects. This is not an uncommon pattern to find in real code and we spend a lot of effort to make sure its handled well. Adding an intrinsic means you'll miss out on the constant folding, knownbits, etc that we've already put in for min/max IR patterns.

I'm completely in favor of add/sub/mul with saturation intrinsics. I want them to replace the the target specific x86 intrinsics for vector PADDSB/PADDSW/PSUBSB/PSUBSW/etc. I just don't see the need for an intrinsic to just clip the range of a number to a certain number of bits.

I'm not sure why you would need an intrinsic at all. Why can't you just emit the IR equivalent from the beginning. This looks the equivalent of doing something like "x = std::min(std::max(x, -2048), 2047)" in C++ which would be implemented in IR with 2 compares and 2 selects. This is not an uncommon pattern to find in real code and we spend a lot of effort to make sure its handled well. Adding an intrinsic means you'll miss out on the constant folding, knownbits, etc that we've already put in for min/max IR patterns.

The issue with using pure IR for these operations is that if the optimizer so much as sneezes on the representation of the operation, there's a high risk that you won't properly select a native operation. This is extremely important on targets for which these operations -- fixed-point in particular, but also saturation -- are legal, like our downstream one.

It is pretty unfortunate that it means we can't take advantage of a lot of the work put into min-max patterns, which is one of the reasons why I thought an early lowering-to-IR pass would be interesting to consider on targets for which these operations are not legal, rather than doing the expansion as late as isel.

I'm not sure why you would need an intrinsic at all. Why can't you just emit the IR equivalent from the beginning. This looks the equivalent of doing something like "x = std::min(std::max(x, -2048), 2047)" in C++ which would be implemented in IR with 2 compares and 2 selects. This is not an uncommon pattern to find in real code and we spend a lot of effort to make sure its handled well. Adding an intrinsic means you'll miss out on the constant folding, knownbits, etc that we've already put in for min/max IR patterns.

The issue with using pure IR for these operations is that if the optimizer so much as sneezes on the representation of the operation, there's a high risk that you won't properly select a native operation. This is extremely important on targets for which these operations -- fixed-point in particular, but also saturation -- are legal, like our downstream one.

It is pretty unfortunate that it means we can't take advantage of a lot of the work put into min-max patterns, which is one of the reasons why I thought an early lowering-to-IR pass would be interesting to consider on targets for which these operations are not legal, rather than doing the expansion as late as isel.

I don't know anything about your downstream target. For this particular intrinsic, if the middle end optimizers mess it up how bad would it be. How bad is 2 compares and 2 selects?

include/llvm/CodeGen/TargetLowering.h
824

This case needs a "break;" so you don't surprise the next person who adds a case.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1121

Should we jsut pass the SatBitsNode->getVT() into getSaturationOperationAction and isSupportedSaturationOperation

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8937 ↗(On Diff #167979)

I think maybe this should be in TargetLowering if I understand anything about our division of labor today. It feels similar to expandMUL_LOHI, expandMUL, expandFP_TO_SINT, scalarizeVectorLoad, scalarizeVectorStore.

8938 ↗(On Diff #167979)

Asserts need messages.

8946 ↗(On Diff #167979)

This can just be getSizeInBits can't it?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5766

I think we should also verify that the second argument is contant in lib/IR/Verifier.cpp as well.

Ka-Ka added a subscriber: Ka-Ka.Oct 4 2018, 12:57 AM
leonardchan marked 5 inline comments as done.
leonardchan added inline comments.
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1121

As far as I know, saturation does not really need to depend on anything other than the saturation width, which is the main reason for why we only need a constant integer or value type to hold this width.

In this sense, I feel like passing anything else other than the width seems unnecessary, but if a user down the line needs more than the saturation width for checking sat legality, they could change these methods to accept the EVT instead.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8937 ↗(On Diff #167979)

This function is essentially for preventing duplicating code. There are different instances where we want to expand the node in the event we find the type or operation is illegal, so for convenience we call this when expanding.

craig.topper added inline comments.Oct 4 2018, 4:13 PM
lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8937 ↗(On Diff #167979)

TargetLowering is available to all of the places that want to do an expansion. Those other functions i mention also exist to share code. So moving this would seem consistent.

*Ping*

@ebevhan @craig.topper Any more comments on this patch?

I'm also unfamiliar with @ebevhan's target, but I imagine there are different CPUs that have their own operations for performing saturation that could use this as an alternative to selects and compares.

*Ping*

@ebevhan @craig.topper Any more comments on this patch?

I'm also unfamiliar with @ebevhan's target, but I imagine there are different CPUs that have their own operations for performing saturation that could use this as an alternative to selects and compares.

Agreed, but that doesn't mean you can't pattern match the select+cmp sequence or a pair of ISD::MIN/MAX to a native operation with a target specific DAG combine. See for example detectSSatPattern and detectUSatPattern in X86ISelLowering.cpp. But maybe middle end IR can mess it up ever so slightly and make it not emit quite the right pattern and then you're left with something close to the pattern, but not quite. Maybe the MIN got removed or the MAX got removed because we could prove the number was positive or something. So then you end up with an incomplete pattern. I just want to understand what the worst case incomplete pattern would look like for @ebevhans's target.

lib/CodeGen/TargetLoweringBase.cpp
1881

Should have caught this earlier. LLVM style prefers to use auto only when its super obvious like on the result of dyn_cast or cast. So these should be APInt and SDValue. Though it could be argued that APInt is pretty obvious since its mentioned in the class on the right hand side, but APInt is only one character longer than auto since doesn't really save much.

I've been experimenting a bit with early expansion of our sat intrinsic to see how the code generation is affected by it. In general, it doesn't actually seem like there's much of an effect. In fact, neither our benchmarks nor the user code I'm looking at seem to be terribly affected by expanding the intrinsic. This is probably due to most of the instances of saturation being 'locked down' by surrounding intrinsics.

However, it's definitely possible to construct quite simple cases that are worsened by expansion. Here's an example. Obviously I'm the only one who can compile this, but I think the idea gets across:

__fixed ac(__accum a, unsigned int n) {
  a = a < 0.0a ? 0.0a : a;
  __accum s = 0.0a;
  for (unsigned i = 0; i < n; i++)
    s += (__sat __fixed)a;
  return (__sat __fixed)s;
}

If we expand the saturation intrinsic (used for the casts from __accum to __sat __fixed) right after IR emission, our final optimized IR becomes:

define i16 @ac(i24 %a, i16 %n) #0 {
entry:
  %0 = icmp sgt i24 %a, 0
  %cond = select i1 %0, i24 %a, i24 0
  %cmp19 = icmp eq i16 %n, 0
  br i1 %cmp19, label %.thread13, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %entry
  %1 = icmp slt i24 %cond, 32767
  %2 = select i1 %1, i24 %cond, i24 32767
  %3 = and i24 %2, 65535
  %4 = add i16 %n, -1
  %5 = zext i16 %4 to i24
  %6 = add nuw nsw i24 %5, 1
  %7 = mul i24 %6, %3
  %8 = icmp sgt i24 %7, -32768
  br i1 %8, label %9, label %.thread13

; <label>:9:                                      ; preds = %for.cond.cleanup
  %10 = icmp slt i24 %7, 32767
  %extract.t15 = trunc i24 %7 to i16
  br i1 %10, label %.thread13, label %11

.thread13:                                        ; preds = %9, %for.cond.cleanup, %entry
  %.off014 = phi i16 [ %extract.t15, %9 ], [ 0, %entry ], [ -32768, %for.cond.cleanup ]
  br label %11

; <label>:11:                                     ; preds = %.thread13, %9
  %.off0 = phi i16 [ %.off014, %.thread13 ], [ 32767, %9 ]
  ret i16 %.off0
}

For our target, this doesn't select any saturation instructions, and consists of 19 static cycles.

If we keep the saturation intrinsic instead:

define i16 @ac(i24 %a, i16 %n) {
entry:
  %cmp17 = icmp eq i16 %n, 0
  br i1 %cmp17, label %for.cond.cleanup, label %for.body.lr.ph

for.body.lr.ph:                                   ; preds = %entry
  %0 = icmp sgt i24 %a, 0
  %cond = select i1 %0, i24 %a, i24 0
  %1 = tail call i24 @llvm.sat.i24(i24 %cond, i32 16)
  %2 = add i16 %n, -1
  %3 = zext i16 %2 to i24
  %4 = shl i24 %1, 8
  %5 = ashr exact i24 %4, 8
  %6 = add nuw nsw i24 %3, 1
  %7 = mul i24 %6, %5
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.body.lr.ph, %entry
  %s.0.lcssa = phi i24 [ 0, %entry ], [ %7, %for.body.lr.ph ]
  %8 = tail call i24 @llvm.sat.i24(i24 %s.0.lcssa, i32 16)
  %resize3 = trunc i24 %8 to i16
  ret i16 %resize3
}

This builds to 13 cycles.

I suppose you could consider this case to be a bit contrived, since the loop is eliminated, but it's still the case that the optimizer mangled the original saturation operations. Essentially anything that results in the optimizer moving the max and min selects away from each other will accomplish this.

@ebevhan, what does the IR look like out of the frontend or after your early expansion. I want to run it through the optimizers and see it changing.

Here is a module with the function right after expansion. This is generated for x86 and not our target so both this IR and the result will probably be slightly different than what I posted.

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: nounwind uwtable
define signext i16 @ac(i24 signext %a, i32 %n) #0 {
entry:
  %a.addr = alloca i24, align 4
  %n.addr = alloca i32, align 4
  %s = alloca i24, align 4
  %i = alloca i32, align 4
  store i24 %a, i24* %a.addr, align 4
  store i32 %n, i32* %n.addr, align 4
  %0 = load i24, i24* %a.addr, align 4
  %cmp = icmp slt i24 %0, 0
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  br label %cond.end

cond.false:                                       ; preds = %entry
  %1 = load i24, i24* %a.addr, align 4
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i24 [ 0, %cond.true ], [ %1, %cond.false ]
  store i24 %cond, i24* %a.addr, align 4
  store i24 0, i24* %s, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %cond.end
  %2 = load i32, i32* %i, align 4
  %3 = load i32, i32* %n.addr, align 4
  %cmp1 = icmp ult i32 %2, %3
  br i1 %cmp1, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %4 = load i24, i24* %a.addr, align 4
  %too_large = icmp sgt i24 %4, 32767
  %too_small = icmp slt i24 %4, -32768
  %5 = select i1 %too_small, i24 -32768, i24 %4
  %6 = select i1 %too_large, i24 32767, i24 %5
  %resize = trunc i24 %6 to i16
  %resize2 = sext i16 %resize to i24
  %7 = load i24, i24* %s, align 4
  %add = add i24 %7, %resize2
  store i24 %add, i24* %s, align 4
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %8 = load i32, i32* %i, align 4
  %inc = add i32 %8, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  %9 = load i24, i24* %s, align 4
  %too_large4 = icmp sgt i24 %9, 32767
  %too_small5 = icmp slt i24 %9, -32768
  %10 = select i1 %too_small5, i24 -32768, i24 %9
  %11 = select i1 %too_large4, i24 32767, i24 %10
  %resize3 = trunc i24 %11 to i16
  ret i16 %resize3
}

attributes #0 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

If the use of a special intrinsic for saturation is rare (few targets that supports it natively), then perhaps we should put this particular patch on hold for now?

I guess the target that @bevinh (and I) is working on can override the codegen in clang to select our target specific intrinsic just like we do today. Or is it hard to do target specific overrides in clang codegen?

The generic support for fixed point types could, at least initially, use the min/max clamping. It would let us move forward on the more interesting parts of fixed point support (such as saturated add, multiplication etc), instead of getting stuck in heavy discussions regarding something that IMHO has lower priority. Then we can revisit this patch later. Perhaps it will be easier to demonstrate the need for an ssaturate intrinsic when we come further and clang is generating fixed point conversions.

If the use of a special intrinsic for saturation is rare (few targets that supports it natively), then perhaps we should put this particular patch on hold for now?

I guess the target that @bevinh (and I) is working on can override the codegen in clang to select our target specific intrinsic just like we do today. Or is it hard to do target specific overrides in clang codegen?

One of the reasons for moving to an intrinsic or new type based solution was to avoid having to implement a custom codegen hook in Clang for the operations for targets that wanted to do it differently. If we wanted to do it that way we would probably have to maintain a patch.

The generic support for fixed point types could, at least initially, use the min/max clamping. It would let us move forward on the more interesting parts of fixed point support (such as saturated add, multiplication etc), instead of getting stuck in heavy discussions regarding something that IMHO has lower priority. Then we can revisit this patch later. Perhaps it will be easier to demonstrate the need for an ssaturate intrinsic when we come further and clang is generating fixed point conversions.

You could be right. I was aiming for a solution that was generic enough to work for anyone's use case (and also worked for us out of the box), but maybe it's simply not important enough to warrant it.

leonardchan marked an inline comment as done.Oct 9 2018, 11:33 AM

If the use of a special intrinsic for saturation is rare (few targets that supports it natively), then perhaps we should put this particular patch on hold for now?

I guess the target that @bevinh (and I) is working on can override the codegen in clang to select our target specific intrinsic just like we do today. Or is it hard to do target specific overrides in clang codegen?

The generic support for fixed point types could, at least initially, use the min/max clamping. It would let us move forward on the more interesting parts of fixed point support (such as saturated add, multiplication etc), instead of getting stuck in heavy discussions regarding something that IMHO has lower priority. Then we can revisit this patch later. Perhaps it will be easier to demonstrate the need for an ssaturate intrinsic when we come further and clang is generating fixed point conversions.

Understandable. I'll leave this on the backburner then and make patches for the other intrinsics.

craig.topper retitled this revision from [Intrinsic] Signed Saturation Intirnsic to [Intrinsic] Signed Saturation Intrinsic.Oct 9 2018, 9:41 PM