This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Load FP immediates via a GPR instead of the constant pool.
Needs ReviewPublic

Authored by jonpa on May 30 2022, 8:12 AM.

Details

Reviewers
uweigand
Summary

Performance results so far show minor changes both ways, overall a slight regression, which may be worth looking into. I however got several nice improvements with my first version of this patch which (by mistake sort of) loaded the fp constants into a full VR128 instead of a VR32/VR64. I found that I with that first patch got a lot more spilling overall (the vector regs are never callee saved, right), but however also some nice improvements:

(VR128 version)
Improvements:  
0.938: f507.cactuBSSN_r 
0.975: f544.nab_r 
0.985: f511.povray_r 

Regressions:
1.016: i523.xalancbmk_r

I am trying to look into why cactus is so much better and I know for one thing that in that particular file there is a lot *less* spilling for some reason when constants are loaded into full vector regs instead of VR32/64 ones (this is the same file that was improving previusly with fp-contract=off, with a huge function that is not that easy to analyze...). This could be something "random" or there may be an explanation / heuristic to be found. I am not sure why VR128 should work better... The VR128 version gave less spills than main or the VR32/VR64 version, which were about the same.

Not quite sure if it is necessary to have the new SystemZ::INSERT_FP_LANE - perhaps it would be easier to just emit the machine nodes directly there instead.

With a handling in foldMemoryOperandImpl(), I don't know much to improve except to try to figure out more about the cactus difference.

Diff Detail

Event Timeline

jonpa created this revision.May 30 2022, 8:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2022, 8:12 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
jonpa requested review of this revision.May 30 2022, 8:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2022, 8:12 AM
jonpa added a comment.May 31 2022, 5:26 AM

I think I have found at least an interesting explanation: The reason the spilling decreased significantly when VR128 regs were used (VR128 = vlgg; extract subreg => use of VR128:subreg) is that those registers now have all the 32 registers to choose from. If however a VR64 is produced containing the immediate it will be constrained to FP64 whenever there is a user that folds a memory operand using an FP opcode. This makes many of those constants occupy FP64 regs instead of VR64, which increases reg-pressure significantly!

I have some ideas I will look into:

  • rematerialize a load from constant pool
  • avoid using FP reg/mem opcodes in case of high register pressure generally.

Comparing with fp-contract "off" (which previously ran faster), it looks like with "off" there are still less spills, but performance is the same. Not sure yet why there are less spills with "off", but IIRC that is a huge function (with huge blocks) with many FP constants that for some reason go through regalloc with a better result, even though I do not know why. Seems very promising if it is possible to eliminate the extra spills per above...