LGTM, thanks!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Oct 18 2022
Jun 29 2022
Jun 23 2022
LGTM.
Jun 2 2022
Jun 1 2022
In D126389#3549129, @arsenm wrote:I wouldn’t really expect there to be a performance difference between gpr indexing and movrel requences. Gfx8 has both so you could directly compare there
May 31 2022
Rebased, keep the current version for GFX9 and only apply this patch on architectures with movrel.
May 30 2022
In D126389#3540486, @rampitec wrote:In D126389#3540098, @jpages wrote:In D126389#3539881, @foad wrote:In D126389#3538346, @rampitec wrote:Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.
I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.
Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.
I will try to get some performance numbers on specific games. Do you know if this performance problem was specific to an architecture?
Like Jay said, I could tweak the heuristic for this and only generate it for GFX10+.All the tests were done on gfx9 flavors. It is also worth noting gfx9 does not have movrel, but uses s_set_gpr_idx_on instead, which may be the difference.
I.e. it may be linked to FeatureMovrel.
May 26 2022
In D126389#3539881, @foad wrote:In D126389#3538346, @rampitec wrote:Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.
I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.
Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.
May 25 2022
Updated a testcase
Feb 11 2022
Feb 10 2022
Feb 8 2022
Feb 7 2022
Rebased following Jay's comments.
Feb 3 2022
Updated a failing test on AArch64.
Feb 1 2022
Rebased following Jay's comments.
Jan 31 2022
Rebased, as suggested I added a target independent ISD opcode.
Jan 13 2022
The intrinsic has the IntrNoMem attribute, and is represented by INTRINSIC_WO_CHAIN in the DAG.
Jan 11 2022
I changed the patch to have only one intrinsic with a rounding mode parameter as suggested. This required many modifications in the frontend parts of LLVM.
Dec 2 2021
Dec 1 2021
Added G_FPTRUNC_ROUND_UPWARD/G_FPTRUNC_ROUND_DOWNWARD opcodes between the intrinsics and the pseudo instructions. Added a custom ISD node for the DAG version.
Oct 20 2021
In D110579#3054424, @foad wrote:In D110579#3051790, @jpages wrote:The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.
Good point. s_round_mode/s_denorm_mode are new in GFX10, so they did not exist when this pass was written. Do you think the pass could be improved to emit s_round_mode/s_denorm_mode instead of s_setreg whenever it only needs to change the rounding/denormal bits of the mode register? That could be a separate patch.
Oct 14 2021
Oct 13 2021
It was possible without the script, the diff is smaller like that, it's probably better.
Rebased: updated the test with the script instead of manually.
Rebased without -LABEL in --check-prefixes
Oct 12 2021
Oct 8 2021
Rebased based on Jay's comments.
Oct 4 2021
Rebased based on previous comments.
Sep 27 2021
Jun 3 2021
In D103344#2793853, @arsenm wrote:Probably should also add a globalisel test just in case
Jun 2 2021
Renamed the test file with a better name.
Jun 1 2021
Rebased, changed the test with bugpoint/opt to a much simpler version.
May 31 2021
May 28 2021
May 11 2021
May 10 2021
Added the VOP3Mods in the pattern and associated tests, thanks for the suggestion!
May 6 2021
Rebased to it in Tablegen, it should be cleaner now. I removed the too complex pattern with the conversions from i16 to f16.
Instead this v_pack_b32_f16 instruction will be used only for f16 when we are sure the two inputs have already been flushed if needed.
May 4 2021
Thanks for the review. I don't have commit access to the repo, could someone do it for me?
May 3 2021
Updated for code style
Apr 30 2021
Thank you for your inputs.
Apr 28 2021
Mar 23 2021
It appears that this instruction is harder to select than expected on integers.