Page MenuHomePhabricator

[RISCV] Use stack temporary to splat two GPRs into SEW=64 vector on RV32.
ClosedPublic

Authored by craig.topper on Apr 21 2021, 3:38 PM.

Details

Summary

Rather than doing splatting each separately and doing bit manipulation
to merge them in the vector domain, copy the data to the stack
and splat it using a strided load with x0 stride. At least on
some implementations this vector load is optimized to not do
a load for each element.

This is equivalent to how we move i64 to f64 on RV32.

I've only implemented this for the intrinsic fallbacks in this
patch. I think we do similar splatting/shifting/oring in other
places. If this is approved, I'll refactor the others to share
the code.

Diff Detail

Event Timeline

craig.topper created this revision.Apr 21 2021, 3:38 PM
craig.topper requested review of this revision.Apr 21 2021, 3:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2021, 3:38 PM
Herald added a subscriber: MaskRay. · View Herald Transcript

I wouldn't have thought this would be any faster, that's interesting. As I said in D100815 I can't really speak for performance on any particular implementation. So I'm happy if others are.

I am also thinking whether we can be faster by avoiding using ld/st unit
like:

vsetivli x0, #1, e32, m1
vmv.s.x v1, a0   // high 32b
vsetivli x0, #2, e32, m1
vslide1up.vx v1, v1, a1  // low 32b
vsetivli x0, a2, e64, m1
vrgather.vi v2, v1, #0
This revision was not accepted when it landed; it landed in state Needs Review.Apr 22 2021, 9:50 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.