PowerPC subtargets prior to Power9 use the 'legacy' itinerary way to provide scheduling information. This patch re-writes the tablegen file to define the scheduling information in the new SchedModel way. This can bring improvements to some benchmarks.
Details
- Reviewers
nemanjai shchenz - Group Reviewers
Restricted Project - Commits
- rGb922a3621116: [PowerPC] Define SchedModel for Power8
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for the big effort.
I think moving from old ProcessorItineraries form to new ProcessorModel form should be correct. Some comments from me. And also please note that there are some failures in the pre-merge checks.
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
18 | why the MicroOpBufferSize is 60? The pwr8 UM seems showing this value as 32 * 2? | |
19 | We need to comment here why on pwr8 this field is set as 0. | |
47 | minor: _1, _2, _4 does not sound like a good name... | |
48 | Why this unit number for type "2xdouble" is set to 4? Its parent P8_VMX only has 2. | |
61 | Same with the name issue _1, ..._6. | |
llvm/test/CodeGen/PowerPC/BreakableToken-reduced.ll | ||
244 | Seems positive, now we use less CSRs. | |
llvm/test/CodeGen/PowerPC/P10-stack-alignment.ll | ||
31 | low ILP but less registers? | |
llvm/test/CodeGen/PowerPC/aix32-p8-scalar_vector_conversions.ll | ||
30 | lower ILP with less register, I think it is fine only if we get better perf in benchmarks. | |
llvm/test/CodeGen/PowerPC/atomics-i128.ll | ||
510 | negative change, now we have more mr which is not free on pwr8. | |
llvm/test/CodeGen/PowerPC/atomics-i16-ldst.ll | ||
184 | We may need another name for check prefix PREP10? And also double check if prefixes P8 and P9 are still used in this file? | |
4519 | ||
llvm/test/CodeGen/PowerPC/atomics-i32-ldst.ll | ||
4779 | Ditto | |
llvm/test/CodeGen/PowerPC/atomics-i8-ldst.ll | ||
4219 | Ditto | |
llvm/test/CodeGen/PowerPC/atomics-regression.ll | ||
456 | Aren't the atomic related instructions are barriers to the scheduler? Can you check why there are changes around them? Thanks. | |
llvm/test/CodeGen/PowerPC/bool-math.ll | ||
120 | Weird. Now there is no check for this function? | |
llvm/test/CodeGen/PowerPC/branch_coalesce.ll | ||
18 | lower ILP and less registers | |
llvm/test/CodeGen/PowerPC/ctrloop-constrained-fp.ll | ||
27 | Positive | |
llvm/test/CodeGen/PowerPC/elf64-byval-cc.ll | ||
35 | postive | |
llvm/test/CodeGen/PowerPC/int128_ldst.ll | ||
1100 | Same with above | |
llvm/test/CodeGen/PowerPC/testComparesi32ltu.ll | ||
37 | This looks positive. We had a request to move the lr store instruction be away from mflr |
Please notice the 2 failures on two different bot in the premerge test.
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
32 | Sorry, this still confuses me. The super is P8_FPU and P8_FPU has 4 execution units which can be executed as 2 way SIMD operation for double. So why we set the ProcResource value to 4 here? | |
33 | Setting P8_VMX as super of "4xSingle" also seems weird. In ISA of pwr8, I think most instructions that handle 4xSingle type are VSX related and from the UM "10.3.2 Instructional Latencies and Throughputs", I saw most of them are using pipeline FPU. So I guess, we should use P8_FPU as Super instead? | |
llvm/test/CodeGen/PowerPC/atomics-i16-ldst.ll | ||
4517 | There are still unused check prefix in the run lines. | |
llvm/test/CodeGen/PowerPC/atomics-i32-ldst.ll | ||
4777 | Please delete the unused check prefixes, so that we won't have this note. | |
llvm/test/CodeGen/PowerPC/atomics-i8-ldst.ll | ||
4219 | Please try to remove this NOTE. |
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
32 | Thanks, updated. | |
33 |
I think the reason is before VSX, Altivec already had instructions for 4xSingle vectors and they are implemented within VMX units. So their VSX equivalents uses the same execution units. | |
48 | Its parent P8_FPU has 4 units. | |
llvm/test/CodeGen/PowerPC/atomics-i16-ldst.ll | ||
184 | Like CHECK-NONP10 or CHECK-P8P9? But it actually has the same meaning as CHECK-PREP10. | |
llvm/test/CodeGen/PowerPC/bool-math.ll | ||
120 | Because the test RUN line did not specify -mcpu, so that Linux uses pwr8 while AIX uses pwr7. Fixed. |
llvm/test/CodeGen/PowerPC/atomics-i128.ll | ||
---|---|---|
510 | The BB with negative result is quite simple: SU(0): %29:gprc = COPY $r3 SU(1): %34:gprc = LWZ 12, %stack.0 SU(2): %33:gprc = LWZ 8, %stack.0 SU(3): %32:gprc = LWZ 4, %stack.0 SU(4): %31:gprc = LWZ 0, %stack.0 SU(5): %30:crrc = CMPLWI %29:gprc, 0 The old scheduling model puts SU(5) just after SU(0), so that %29:gprc is assigned with r3. With the new scheduling model, the BB is unchanged after pre-isel scheduling. I think it's more likely an issue of the scheduler exposed by new model and I'll further look at why regpressure is not respected here. But here I'm curious why the register allocator not assign r3 to %29. |
llvm/test/CodeGen/PowerPC/atomics-regression.ll | ||
---|---|---|
456 | lwsync (SYNC) is treated as global memory barrier. But clrlwi (RLWINM) neither writes nor reads any memory, so scheduling such instruction around SYNC is okay. BTW, the new codegen is the original order. |
I'd like to give extra comments on some definitions:
2.1.3 Speculative Superscalar Inner Core Organization
Out-of-order issue of up to 10 operations into the following 10 issue ports:
– Two ports to do loads or fixed-point operations.
– Two ports to do stores, fixed-point loads, or fixed-point operations.
– Two fixed-point operations
– Two issue ports shared by two floating-point, two VSX, two VMX, one crypto, and one DFP operations
– One branch operation
– One condition register operationSixteen execution units:
– Two symmetric load/store units (LSU), capable of executing stores, fixed-point loads, and simple fixed-point operations
– Two load-only units (LU) also capable of executing simple fixed-point operations
– Two symmetric fixed-point units (FXU)
– Four floating-point units (FPU), implemented as two 2-way SIMD operations for double- and single-precision. Scalar binary floating-point instructions can only use two FPUs.
– Two VMX execution units capable of executing simple FX, permute, complex FX, and 4-way SIMD single-precision floating-point operations
– One Crypto unit
– One decimal floating-point unit (DFU)
– One branch execution unit (BR)
– One condition register logical execution unit (CRL)
So correspondent units and ports are defined. But
- For ports, we may need to 'combine' them. For example, port group 1, 2, 3 are all acceptable for fixed-point operations. So we need to create a 'parent' group for fixed-point ops and make group 1, 2, 3 all its children. Similar rule applies for fixed-point loads.
- Some instructions uses 'LSU or FXU', a similar parent-child definition are also needed.
Compared to Power9 model, dispatch rules and pipeline forwarding definition are missing in this patch, because they're vague in Power8 manual.
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
12–13 | This seems like resources for issue not dispatch? | |
28 | Don't know how do you arrange these instructions. But from the UM: vaddfp pipeline is FPU that should be using P8_FP_4x32 and the execution unit should be P8_FPU? | |
33 | I believe we should use P8_VSX as super of P8_FP_4x32. According to the instructions which uses P8_FP_4x32, all of them are with pipeline FPU that should be for the 4 FPU units. | |
89 | instruction like STB seems occupies two pipelines LSU and LU while seems here it only occupies LSU pipiline? |
- Rename P8_DISP to P8_ISSUE
- Add scheduling info for DFP instructions
- Mark resource usage of 4-way SIMD FP instructions from VMX to FPU
- Add resource usage of FXU to basic load instructions
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
89 | It occupies both: def P8_LS_LU : SchedWriteRes<[P8_LU, P8_LS]>; |
Looks almost good to me. Some other comments related to the instruction definitions.
Please note the two failures in the premerge check on X86 target.
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
68 | nit: This issue port is not used? | |
88 | should be def P8_LS_FP : SchedWriteRes<[P8_LSU, P8_FPU]>;? | |
169 | For cracked instructions, should we only use one issue port? | |
171 | ADD_rec/SUBF_rec/NEG_rec seems can use both FXU or LSU? |
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
210 | hmm, this change seems still not accurate, below is from PWR8 UM(17 March 2016) addi addis add add. subf subf. addic subfic adde addme subfme subfze neg neg. nego FXU (or LU or LSU for non-dot forms) add. should be able to use FXU or LU (4 hardware units). Maybe we need to define a new unit group like P8_FX_LU for this kind of instructions? I suggest you check other instructions as well compared with the UM. Thanks very much for the big effort. |
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
210 | Per my understanding, only non-dot forms use LU/LSU? |
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
210 | OK, so we have different understanding about FXU (or LU or LSU for non-dot forms). Let us use yours for now until we find a clear doc about what units dot form instructions can use. | |
224 | Seems I can not find mflr, is it expected? | |
244 | lswi lswx stswi stswx (unaligned) LSU,LU lwarx ldarx LSU or LU |
- CRBit instructions are not defined in User Manual, remove
- Fixed point compare use FX instead of CR
- Reorder and rename definitions to make it cleaner
- Mark tw instructions as using P8_FX_NONE
- Leave TODO note in logical instructions, since the modified codegen is suspicious
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
244 |
We can't differentiate whether it's aligned or not, assume naturally aligned here. |
LGTM. Thanks very much again for the very big effort.
llvm/lib/Target/PowerPC/PPCScheduleP8.td | ||
---|---|---|
144 | I really like the idea about group the instructions by the pipeline. It will be much easier for later change for this model. |
This seems like resources for issue not dispatch?