This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Define SchedModel for Power8
ClosedPublic

Authored by qiucf on Jul 5 2023, 2:56 AM.

Details

Reviewers
nemanjai
shchenz
Group Reviewers
Restricted Project
Commits
rGb922a3621116: [PowerPC] Define SchedModel for Power8
Summary

PowerPC subtargets prior to Power9 use the 'legacy' itinerary way to provide scheduling information. This patch re-writes the tablegen file to define the scheduling information in the new SchedModel way. This can bring improvements to some benchmarks.

Diff Detail

Event Timeline

qiucf created this revision.Jul 5 2023, 2:56 AM
qiucf requested review of this revision.Jul 5 2023, 2:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 2:56 AM

Thanks for the big effort.
I think moving from old ProcessorItineraries form to new ProcessorModel form should be correct. Some comments from me. And also please note that there are some failures in the pre-merge checks.

llvm/lib/Target/PowerPC/PPCScheduleP8.td
18

why the MicroOpBufferSize is 60? The pwr8 UM seems showing this value as 32 * 2?

19

We need to comment here why on pwr8 this field is set as 0.

47

minor: _1, _2, _4 does not sound like a good name...

48

Why this unit number for type "2xdouble" is set to 4? Its parent P8_VMX only has 2.

61

Same with the name issue _1, ..._6.

llvm/test/CodeGen/PowerPC/BreakableToken-reduced.ll
244

Seems positive, now we use less CSRs.

llvm/test/CodeGen/PowerPC/P10-stack-alignment.ll
31

low ILP but less registers?

llvm/test/CodeGen/PowerPC/aix32-p8-scalar_vector_conversions.ll
30

lower ILP with less register, I think it is fine only if we get better perf in benchmarks.

llvm/test/CodeGen/PowerPC/atomics-i128.ll
510

negative change, now we have more mr which is not free on pwr8.

llvm/test/CodeGen/PowerPC/atomics-i16-ldst.ll
184

We may need another name for check prefix PREP10? And also double check if prefixes P8 and P9 are still used in this file?

4519

Ah, it show P8 and P9 are not used, we should delete them in the run lines.

llvm/test/CodeGen/PowerPC/atomics-i32-ldst.ll
4779

Ditto

llvm/test/CodeGen/PowerPC/atomics-i8-ldst.ll
4219

Ditto

llvm/test/CodeGen/PowerPC/atomics-regression.ll
456

Aren't the atomic related instructions are barriers to the scheduler? Can you check why there are changes around them? Thanks.

llvm/test/CodeGen/PowerPC/bool-math.ll
120

Weird. Now there is no check for this function?

llvm/test/CodeGen/PowerPC/branch_coalesce.ll
18

lower ILP and less registers

llvm/test/CodeGen/PowerPC/ctrloop-constrained-fp.ll
27

Positive

llvm/test/CodeGen/PowerPC/elf64-byval-cc.ll
35

postive

llvm/test/CodeGen/PowerPC/int128_ldst.ll
1100

Same with above

llvm/test/CodeGen/PowerPC/testComparesi32ltu.ll
37

This looks positive. We had a request to move the lr store instruction be away from mflr

qiucf updated this revision to Diff 543374.Jul 23 2023, 10:08 PM
qiucf marked 5 inline comments as done.

Please notice the 2 failures on two different bot in the premerge test.

llvm/lib/Target/PowerPC/PPCScheduleP8.td
32

Sorry, this still confuses me. The super is P8_FPU and P8_FPU has 4 execution units which can be executed as 2 way SIMD operation for double. So why we set the ProcResource value to 4 here?

33

Setting P8_VMX as super of "4xSingle" also seems weird. In ISA of pwr8, I think most instructions that handle 4xSingle type are VSX related and from the UM "10.3.2 Instructional Latencies and Throughputs", I saw most of them are using pipeline FPU. So I guess, we should use P8_FPU as Super instead?

llvm/test/CodeGen/PowerPC/atomics-i16-ldst.ll
4517

There are still unused check prefix in the run lines.

llvm/test/CodeGen/PowerPC/atomics-i32-ldst.ll
4777

Please delete the unused check prefixes, so that we won't have this note.

llvm/test/CodeGen/PowerPC/atomics-i8-ldst.ll
4219

Please try to remove this NOTE.

qiucf updated this revision to Diff 543788.Jul 24 2023, 7:27 PM
qiucf marked 10 inline comments as done.
qiucf added inline comments.
llvm/lib/Target/PowerPC/PPCScheduleP8.td
32

Thanks, updated.

33

2.1.3 Speculative Superscalar Inner Core Organization:

  • Two VMX execution units capable of executing simple FX, permute, complex FX, and 4-way SIMD single-precision floating-point operations

I think the reason is before VSX, Altivec already had instructions for 4xSingle vectors and they are implemented within VMX units. So their VSX equivalents uses the same execution units.

48

Its parent P8_FPU has 4 units.

llvm/test/CodeGen/PowerPC/atomics-i16-ldst.ll
184

Like CHECK-NONP10 or CHECK-P8P9? But it actually has the same meaning as CHECK-PREP10.

llvm/test/CodeGen/PowerPC/bool-math.ll
120

Because the test RUN line did not specify -mcpu, so that Linux uses pwr8 while AIX uses pwr7. Fixed.

qiucf added inline comments.Jul 25 2023, 1:11 AM
llvm/test/CodeGen/PowerPC/atomics-i128.ll
510

The BB with negative result is quite simple:

SU(0):   %29:gprc = COPY $r3
SU(1):   %34:gprc = LWZ 12, %stack.0
SU(2):   %33:gprc = LWZ 8, %stack.0
SU(3):   %32:gprc = LWZ 4, %stack.0
SU(4):   %31:gprc = LWZ 0, %stack.0
SU(5):   %30:crrc = CMPLWI %29:gprc, 0

The old scheduling model puts SU(5) just after SU(0), so that %29:gprc is assigned with r3. With the new scheduling model, the BB is unchanged after pre-isel scheduling.

I think it's more likely an issue of the scheduler exposed by new model and I'll further look at why regpressure is not respected here. But here I'm curious why the register allocator not assign r3 to %29.

qiucf added inline comments.Jul 25 2023, 1:28 AM
llvm/test/CodeGen/PowerPC/atomics-regression.ll
456

lwsync (SYNC) is treated as global memory barrier. But clrlwi (RLWINM) neither writes nor reads any memory, so scheduling such instruction around SYNC is okay. BTW, the new codegen is the original order.

qiucf added a comment.Jul 27 2023, 7:29 PM

I'd like to give extra comments on some definitions:

2.1.3 Speculative Superscalar Inner Core Organization
Out-of-order issue of up to 10 operations into the following 10 issue ports:
– Two ports to do loads or fixed-point operations.
– Two ports to do stores, fixed-point loads, or fixed-point operations.
– Two fixed-point operations
– Two issue ports shared by two floating-point, two VSX, two VMX, one crypto, and one DFP operations
– One branch operation
– One condition register operation

Sixteen execution units:
– Two symmetric load/store units (LSU), capable of executing stores, fixed-point loads, and simple fixed-point operations
– Two load-only units (LU) also capable of executing simple fixed-point operations
– Two symmetric fixed-point units (FXU)
– Four floating-point units (FPU), implemented as two 2-way SIMD operations for double- and single-precision. Scalar binary floating-point instructions can only use two FPUs.
– Two VMX execution units capable of executing simple FX, permute, complex FX, and 4-way SIMD single-precision floating-point operations
– One Crypto unit
– One decimal floating-point unit (DFU)
– One branch execution unit (BR)
– One condition register logical execution unit (CRL)

So correspondent units and ports are defined. But

  • For ports, we may need to 'combine' them. For example, port group 1, 2, 3 are all acceptable for fixed-point operations. So we need to create a 'parent' group for fixed-point ops and make group 1, 2, 3 all its children. Similar rule applies for fixed-point loads.
  • Some instructions uses 'LSU or FXU', a similar parent-child definition are also needed.

Compared to Power9 model, dispatch rules and pipeline forwarding definition are missing in this patch, because they're vague in Power8 manual.

shchenz added inline comments.Aug 1 2023, 10:32 PM
llvm/lib/Target/PowerPC/PPCScheduleP8.td
12–13

This seems like resources for issue not dispatch?

28

Don't know how do you arrange these instructions. But from the UM: vaddfp pipeline is FPU that should be using P8_FP_4x32 and the execution unit should be P8_FPU?

33

I believe we should use P8_VSX as super of P8_FP_4x32. According to the instructions which uses P8_FP_4x32, all of them are with pipeline FPU that should be for the 4 FPU units.
Or we can create two P8_FP_4x32 resources, one is child of P8_FPU and one is child of P8_VMX. However the child of P8_VMX seems have no use instructions in the "Instruction Latencies and Throughputs" sheet.

89

instruction like STB seems occupies two pipelines LSU and LU while seems here it only occupies LSU pipiline?

qiucf updated this revision to Diff 546332.Aug 1 2023, 11:58 PM
qiucf marked 3 inline comments as done.
  • Rename P8_DISP to P8_ISSUE
  • Add scheduling info for DFP instructions
  • Mark resource usage of 4-way SIMD FP instructions from VMX to FPU
  • Add resource usage of FXU to basic load instructions
llvm/lib/Target/PowerPC/PPCScheduleP8.td
89

It occupies both:

def P8_LS_LU : SchedWriteRes<[P8_LU, P8_LS]>;
qiucf added a comment.Aug 30 2023, 1:15 AM

Gentle ping

Looks almost good to me. Some other comments related to the instruction definitions.

Please note the two failures in the premerge check on X86 target.

llvm/lib/Target/PowerPC/PPCScheduleP8.td
68

nit: This issue port is not used?

88

should be def P8_LS_FP : SchedWriteRes<[P8_LSU, P8_FPU]>;?

169

For cracked instructions, should we only use one issue port?

171

ADD_rec/SUBF_rec/NEG_rec seems can use both FXU or LSU?

qiucf updated this revision to Diff 555300.Sep 1 2023, 1:34 AM
qiucf marked 3 inline comments as done.
qiucf added inline comments.
llvm/lib/Target/PowerPC/PPCScheduleP8.td
169

I tried modelling cracked instructions, but not got good performance results, which can be future improvement item. I added a TODO note.

171

Thanks, moved to use only FX instead.

shchenz added inline comments.Sep 1 2023, 3:48 AM
llvm/lib/Target/PowerPC/PPCScheduleP8.td
210

hmm, this change seems still not accurate, below is from PWR8 UM(17 March 2016)

addi addis add add. subf subf. addic subfic adde addme subfme subfze neg neg. nego
FXU (or LU or LSU for non-dot forms)

add. should be able to use FXU or LU (4 hardware units).

Maybe we need to define a new unit group like P8_FX_LU for this kind of instructions?

I suggest you check other instructions as well compared with the UM.

Thanks very much for the big effort.

qiucf added inline comments.Sep 3 2023, 10:03 PM
llvm/lib/Target/PowerPC/PPCScheduleP8.td
210

Per my understanding, only non-dot forms use LU/LSU?

shchenz added inline comments.Sep 4 2023, 2:06 AM
llvm/lib/Target/PowerPC/PPCScheduleP8.td
210

OK, so we have different understanding about FXU (or LU or LSU for non-dot forms).

Let us use yours for now until we find a clear doc about what units dot form instructions can use.

224

Seems I can not find mflr, is it expected?

244
lswi lswx stswi stswx (unaligned)
LSU,LU
lwarx ldarx
LSU or LU
qiucf updated this revision to Diff 556226.Sep 7 2023, 9:07 PM
  • CRBit instructions are not defined in User Manual, remove
  • Fixed point compare use FX instead of CR
  • Reorder and rename definitions to make it cleaner
  • Mark tw instructions as using P8_FX_NONE
  • Leave TODO note in logical instructions, since the modified codegen is suspicious
qiucf added inline comments.Sep 7 2023, 9:14 PM
llvm/lib/Target/PowerPC/PPCScheduleP8.td
244
  • lswi (naturally aligned) uses LSU or LU
  • lswi lswx stswi stswx (unaligned) uses LSU,LU
  • stswi (naturally aligned) uses LSU, LU

We can't differentiate whether it's aligned or not, assume naturally aligned here.

shchenz accepted this revision as: shchenz.Sep 8 2023, 12:04 AM

LGTM. Thanks very much again for the very big effort.

llvm/lib/Target/PowerPC/PPCScheduleP8.td
144

I really like the idea about group the instructions by the pipeline. It will be much easier for later change for this model.

This revision is now accepted and ready to land.Sep 8 2023, 12:04 AM
This revision was landed with ongoing or failed builds.Sep 8 2023, 12:45 AM
This revision was automatically updated to reflect the committed changes.