This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Exploit store instructions that store a single vector element
ClosedPublic

Authored by nemanjai on Dec 31 2018, 11:34 AM.

Details

Summary

This patch was originally the CodeGen portion of https://reviews.llvm.org/D44528.
Namely, it exploits the instructions that store a single element from a vector to preform a (store (extract_elt)). We already have code that does this with ISA 3.0 instructions that were added to handle i8/i16 types. However, we had never exploited the existing ones that handle f32/f64/i32/i64 types. This patch does that.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai created this revision.Dec 31 2018, 11:34 AM
nemanjai updated this revision to Diff 179776.Dec 31 2018, 11:59 AM

Indentation was off in the td file.

nemanjai marked an inline comment as done.Dec 31 2018, 12:09 PM
nemanjai added inline comments.
test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

One of the unanswered comments from the original patch was along the lines of:
"The stfs instruction performs a conversion from 8-byte single precision (that PPC uses for single precision representation in registers) to 4-byte single precision (the in-memory single precision representation). The updated code no longer involves such a conversion, is that semantically correct?"

The short answer is yes. The reason we don't need this conversion is that single precision vector elements are represented the same way in registers and memory. If you inspect the original code sequence carefully, you'll see that it does the following

  • xxsldwi to line up the element into the correct location in the register
  • xscvspdpn to convert vector single precision to scalar single precision
  • stfs to implicitly convert the value back and store it

The new sequence just skips all conversion and stores the single-precision vector element as a 4-byte single-precision value.

jsji added a comment.Jan 7 2019, 2:50 PM

Thanks for updating!

lib/Target/PowerPC/PPCInstrVSX.td
3343 ↗(On Diff #179776)

These are identical to little endian patterns above?

3881 ↗(On Diff #179776)

Can we add some comments about the "magic" indexes here.

test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

Thanks for checking!
yes, you are right, the new sequence should work the same way at the old sequences for normal floating point values.

But I am more concern with abnormal values.

Since conversions will change abnormal values, so I think we may get different results for out of range values eg: NaN, Inf, Denormals if we remove those conversions.

Of course we may argue that the new behavior without conversion maybe be better?

So I still think the semantic is actually different, but it is a good idea to remove the conversions if we can accept the risk of changing some floating point application's behavior.

118 ↗(On Diff #179776)

Also, looks like some typos in above comments, so it looks confusing to me.

My understandings from ISA are:

  1. Power ISA defined two data formats for floating point:
    • 32-bit (4-byte) single format:
    • 64-bit (8-byte) double format:
  1. Each FPR contains 64 bits that support the floating-point double format. Every instruction that interprets the contents of an FPR as a floating-point value uses the floating-point double format for this interpretation.
  1. Vector Floating-Point (VMX) instruction interprets the contents of a Vector Register (VR) as a sequence of equal-length elements, each element use the floating-point single data format.
  1. Vector-Scalar Floating-Point (VSX) instruction interprets the contents of a Vector Register (VR) as a sequence of equal-length elements, each element use the floating-point single data format (4 bytes) or floating-point double format (8 bytes) depending on the length.
  1. Single Precision Floating-Point takes operands from the FPRs in double format, performs the operation, and then coerces this intermediate result to fit in single format. Status bits, in the FPSCR and optionally in the Condition Register, are set to reflect the single-precision result. The result is then converted to double format and placed into an FPR.
  1. xscvspdpn is used to convert the element from single data format to double format, without setting FPSCR etc.
  1. stfs will converting x in FPR from floating-point double format to floating-point single format, then store.
  1. stfiwx will store the 32-63 bits of FPR to memory directly, without conversion.
nemanjai marked 3 inline comments as done.Jan 11 2019, 6:58 AM
nemanjai added inline comments.
lib/Target/PowerPC/PPCInstrVSX.td
3343 ↗(On Diff #179776)

Oops. The extract indices were supposed to be reversed. Good catch. Thank you.

3881 ↗(On Diff #179776)

Sure, will do.

test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

I'm not really sure what typos you're referring to, but we can think of it as there being two different single precision representations (i.e. bit patterns). The "in-scalar-register" representation is 64-bits wide and is equivalent to the double precision bit pattern for the same value. The in-memory representation is 32-bits wide and conforms to IEEE specifications for binary 32-bit floating point. Vectors of single precision values use the latter for each element.

Each instruction that operates on scalar floating point registers and "produces a single precision result" actually produces a double precision result that has equivalent precision to the single precision value that would be produced by performing that operation on its inputs.

Now for the issue at hand (i.e. convert-to-extract-then-convert-to-store vs. store-with-no-conversion):

  • The current implementation does the conversions
  • The implementation in this patch does not do the conversions, just store
  • The first conversion (xscvspdpn) will perform normalization, NaNs and INFs just produce double precision versions of the same
  • The second conversion (stfs) will perform denormalization (NaNs, INFs and zero remain)
  • Now the memory contains an unmodified or denormalized version of the vector element
    • If it's unmodified (the conversion simply extracts the correct number of bits from exponent/mantissa), nothing to discuss here
    • If it's denormalized, it will produce a 32-bit bit pattern that is equivalent to the original denormal value in the vector

So unless I am missing something subtle in the conversion that stfs does, I believe that this change does not modify the semantics.

nemanjai updated this revision to Diff 181272.Jan 11 2019, 7:43 AM

Updated the wrong indices for big endian systems. Added comments for magic numbers for indices/shift amounts.

jsji added inline comments.Jan 11 2019, 7:54 AM
test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

The typos I was referring is the "8 byte single precision", as I don't think there is any "8 byte single precision" ...
There are only two data format: 4 byte single format , 8 byte double format.

Anyway, if this can help for your "understanding", it is fine to "think of it as there being two different single precision".

I think our key divergency here is whether convert-convert will change semantic for NaN/INFs.

From my point of view, conversions might change the values.

eg: ConvertSPtoDP_NS used by xscvspdpn

if (x.bit[1:8] == 255) then do 
     exponent <- 2047
end
else if (x.bit[1:8] == 0) && (fraction == 0) the do 
     exponent <- 0
end

The exponent might be overridden.

So convert-to-extract-then-convert-to-store vs. store-with-no-conversion might get different results.

But I also agree that both results should be valid.

Just that we may want to be aware or accept the risk of changing some floating point application's behavior.

jsji added inline comments.Jan 11 2019, 7:57 AM
lib/Target/PowerPC/PPCInstrVSX.td
3343 ↗(On Diff #179776)

Thanks for updating! One question to me is why this not exposed by testcases below? Are we missing some coverage there?

nemanjai marked 2 inline comments as done.Jan 14 2019, 1:50 PM
nemanjai added inline comments.
lib/Target/PowerPC/PPCInstrVSX.td
3343 ↗(On Diff #179776)

I'll add it to the respective test.

test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

But that's precisely what I'm saying is *not* the case. Setting the exponent to 2047 is precisely the "keeping NaNs NaNs and INFs INFs" that I was referring to.
The if in the RTL you have pasted accomplishes that very thing. The else if just says that a zero stays a zero.

A NaN has the exponent set to all 1's. All 1's in single precision is 255 and in double precision, it's 2047. INF has the same property with the additional property that the fraction bits are all zeros.
https://en.wikipedia.org/wiki/IEEE_754-1985#NaN

nemanjai marked an inline comment as done.Jan 14 2019, 2:10 PM
nemanjai added inline comments.
test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

There is of course obviously a very real possibility that I am missing some corner case or am misunderstanding something.

jsji added a comment.Jan 14 2019, 3:05 PM

Thanks! Yes, you are right. Sorry, my mistake of pasting conversion code without going through them carefully.

test/CodeGen/PowerPC/extract-and-store.ll
118 ↗(On Diff #179776)

Ah, yes, you are right.

After going through the two algorithm carefully, I agree that for Inf/NaNs, they should be fine.

eg:

0_11111111_10101010101010101010101
=> (ConvertSPtoDP_NS)
0_11111111111_10101010101010101010101_0_0000_0000_0000_0000_0000_0000_0000
=> (SINGLE)
0_11111111_10101010101010101010101
`
118 ↗(On Diff #179776)

What I meant to discuss should be the case that involve denormalization,
as you mentioned that it will produce a 32-bit bit pattern that is equivalent to the original denormal value in the vector.

Here, the 32-bit pattern *may* be different from what it was without conversion,
although it is equivalent to the original denormal value.

That is what I meant both results should be valid, but bit pattern *may* be different.

And since we have flax-vector-conversions, if some application rely on bit patterns with conversion,
they may get different result without conversion.

jsji accepted this revision.Jan 14 2019, 3:21 PM

LGTM, as long as we add the missing BE test. Thanks for exploiting this, and also great patience while discussion.

This revision is now accepted and ready to land.Jan 14 2019, 3:21 PM

LGTM, as long as we add the missing BE test. Thanks for exploiting this, and also great patience while discussion.

Not at all, thank you for the close scrutiny of the changes. I always feel much more at ease committing something after a thorough discussion in the review than something that just gets an automatic approval. Also yup, I'll add the BE test on the commit.

Also, some SPEC numbers with this transformation for motivation:
blender - 6.5%/9.5% improvement on base/peak rate
gcc - 2% improvement on both base/peak rate
deepsjeng - 1.31% improvement on base rate
xz - 15.9% improvement on base

lbm - 1.38% degradation on peak rate, no observable change in base rate

All other changes are below 1% with the majority of those in the improvement category.

This revision was automatically updated to reflect the committed changes.
bzEq added a subscriber: bzEq.Apr 30 2019, 2:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2019, 2:28 AM
bzEq removed a subscriber: bzEq.Apr 30 2019, 2:28 AM