This patch was originally the CodeGen portion of https://reviews.llvm.org/D44528.
Namely, it exploits the instructions that store a single element from a vector to preform a (store (extract_elt)). We already have code that does this with ISA 3.0 instructions that were added to handle i8/i16 types. However, we had never exploited the existing ones that handle f32/f64/i32/i64 types. This patch does that.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
118 | One of the unanswered comments from the original patch was along the lines of: The short answer is yes. The reason we don't need this conversion is that single precision vector elements are represented the same way in registers and memory. If you inspect the original code sequence carefully, you'll see that it does the following
The new sequence just skips all conversion and stores the single-precision vector element as a 4-byte single-precision value. |
Thanks for updating!
lib/Target/PowerPC/PPCInstrVSX.td | ||
---|---|---|
3343 | These are identical to little endian patterns above? | |
3881 | Can we add some comments about the "magic" indexes here. | |
test/CodeGen/PowerPC/extract-and-store.ll | ||
118 | Thanks for checking! But I am more concern with abnormal values. Since conversions will change abnormal values, so I think we may get different results for out of range values eg: NaN, Inf, Denormals if we remove those conversions. Of course we may argue that the new behavior without conversion maybe be better? So I still think the semantic is actually different, but it is a good idea to remove the conversions if we can accept the risk of changing some floating point application's behavior. | |
118 | Also, looks like some typos in above comments, so it looks confusing to me. My understandings from ISA are:
|
lib/Target/PowerPC/PPCInstrVSX.td | ||
---|---|---|
3343 | Oops. The extract indices were supposed to be reversed. Good catch. Thank you. | |
3881 | Sure, will do. | |
test/CodeGen/PowerPC/extract-and-store.ll | ||
118 | I'm not really sure what typos you're referring to, but we can think of it as there being two different single precision representations (i.e. bit patterns). The "in-scalar-register" representation is 64-bits wide and is equivalent to the double precision bit pattern for the same value. The in-memory representation is 32-bits wide and conforms to IEEE specifications for binary 32-bit floating point. Vectors of single precision values use the latter for each element. Each instruction that operates on scalar floating point registers and "produces a single precision result" actually produces a double precision result that has equivalent precision to the single precision value that would be produced by performing that operation on its inputs. Now for the issue at hand (i.e. convert-to-extract-then-convert-to-store vs. store-with-no-conversion):
So unless I am missing something subtle in the conversion that stfs does, I believe that this change does not modify the semantics. |
Updated the wrong indices for big endian systems. Added comments for magic numbers for indices/shift amounts.
test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
118 | The typos I was referring is the "8 byte single precision", as I don't think there is any "8 byte single precision" ... Anyway, if this can help for your "understanding", it is fine to "think of it as there being two different single precision". I think our key divergency here is whether convert-convert will change semantic for NaN/INFs. From my point of view, conversions might change the values. eg: ConvertSPtoDP_NS used by xscvspdpn if (x.bit[1:8] == 255) then do exponent <- 2047 end else if (x.bit[1:8] == 0) && (fraction == 0) the do exponent <- 0 end The exponent might be overridden. So convert-to-extract-then-convert-to-store vs. store-with-no-conversion might get different results. But I also agree that both results should be valid. Just that we may want to be aware or accept the risk of changing some floating point application's behavior. |
lib/Target/PowerPC/PPCInstrVSX.td | ||
---|---|---|
3343 | Thanks for updating! One question to me is why this not exposed by testcases below? Are we missing some coverage there? |
lib/Target/PowerPC/PPCInstrVSX.td | ||
---|---|---|
3343 | I'll add it to the respective test. | |
test/CodeGen/PowerPC/extract-and-store.ll | ||
118 | But that's precisely what I'm saying is *not* the case. Setting the exponent to 2047 is precisely the "keeping NaNs NaNs and INFs INFs" that I was referring to. A NaN has the exponent set to all 1's. All 1's in single precision is 255 and in double precision, it's 2047. INF has the same property with the additional property that the fraction bits are all zeros. |
test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
118 | There is of course obviously a very real possibility that I am missing some corner case or am misunderstanding something. |
Thanks! Yes, you are right. Sorry, my mistake of pasting conversion code without going through them carefully.
test/CodeGen/PowerPC/extract-and-store.ll | ||
---|---|---|
118 | Ah, yes, you are right. After going through the two algorithm carefully, I agree that for Inf/NaNs, they should be fine. eg: 0_11111111_10101010101010101010101 => (ConvertSPtoDP_NS) 0_11111111111_10101010101010101010101_0_0000_0000_0000_0000_0000_0000_0000 => (SINGLE) 0_11111111_10101010101010101010101 ` | |
118 | What I meant to discuss should be the case that involve denormalization, Here, the 32-bit pattern *may* be different from what it was without conversion, That is what I meant both results should be valid, but bit pattern *may* be different. And since we have flax-vector-conversions, if some application rely on bit patterns with conversion, |
LGTM, as long as we add the missing BE test. Thanks for exploiting this, and also great patience while discussion.
Not at all, thank you for the close scrutiny of the changes. I always feel much more at ease committing something after a thorough discussion in the review than something that just gets an automatic approval. Also yup, I'll add the BE test on the commit.
Also, some SPEC numbers with this transformation for motivation:
blender - 6.5%/9.5% improvement on base/peak rate
gcc - 2% improvement on both base/peak rate
deepsjeng - 1.31% improvement on base rate
xz - 15.9% improvement on base
lbm - 1.38% degradation on peak rate, no observable change in base rate
All other changes are below 1% with the majority of those in the improvement category.
These are identical to little endian patterns above?