This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers
AbandonedPublic

Authored by sdmitrouk on Oct 31 2014, 7:25 AM.

Details

Summary

Here is the result of my tries to make memcpy() inlined in an "optimal" way, which means interleaved load/store pair instructions that use 64-bit registers.

It was suggested to make this in AArch64LoadStoreOptimizer pass, which did work until PostRA Machine Instruction Scheduler was enabled for AArch64 target, hence it became a separate pass that runs after PostRA MISched. The pass is disabled by default, but changes in tests make them pass with and without the pass.

When ldr/str is in the middle they are reordered as well except for cases like:

ldr
ldp
stp
str

which occur only on copying small amount of data and I'm not sure if its worth reordering them to

ldr
str
ldp
stp

but that can be done.

Unfortunately, I don't have AArch64 hardware to run performance test yet so I can't back it up with numbers, but such sequence was claimed to be preferred. At least this gives a way to test it. Or it can just be here for now.

Diff Detail

Event Timeline

sdmitrouk updated this revision to Diff 15614.Oct 31 2014, 7:25 AM
sdmitrouk retitled this revision from to [AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers.
sdmitrouk updated this object.
sdmitrouk edited the test plan for this revision. (Show Details)
sdmitrouk set the repository for this revision to rL LLVM.
sdmitrouk added a subscriber: Unknown Object (MLST).
jmolloy requested changes to this revision.Nov 3 2014, 4:04 AM
jmolloy edited edge metadata.

Hi,

I think the principle here is OK. It'd have been nicer if we could convince the scheduler to do this instead, rather than going behind its back though. Have you talked to Andy Trick or Dave Estes to work out if this is possible?

Comments inline.

I'd also like Tim's signoff before this goes in.

James

lib/Target/AArch64/AArch64ISelLowering.cpp
474 ↗(On Diff #15614)

s/af/as

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
26

The important thing is that we have ldp/stp in that order, ideally with increasing addresses. We don't need to cluster them all together - it's the ordering of memory operations that counts I think.

So we can have:

ldp
stp
add # unrelated operation
ldp
stp

This should be fine, and may be a good thing, depending on the microarchitecture.

49

This is a fairly generic statistic name. Something more A64 specific perhaps?

145

Wouldn't isSafeToSpeculate() conservatively do the same job here?

206

This was not mentioned in the comment; why should all loads come before all stores?

291

Here and elsewhere: single-line if's should have their {}'s removed.

This revision now requires changes to proceed.Nov 3 2014, 4:04 AM
sdmitrouk updated this revision to Diff 15691.Nov 3 2014, 5:42 AM
sdmitrouk edited edge metadata.

Addressed some of comments (will comment on the rest):

  1. Fix typo s/af/as.
  2. NumSequences => NumLdStSequencesUpdated.
  3. Fixed rather confusing comment.
  4. Removed extra curly braces around body of single-line if statements.

Hi James,

It'd have been nicer if we could convince the scheduler to do this instead, rather than going behind its back though.

That's what I've tried initially, but wasn't able to do.

Have you talked to Andy Trick or Dave Estes to work out if this is possible?

I don't think so, there were a couple of related threads on llvm-dev, and doing
this similar to load/store optimizer was the only proposed solution. Scheduler
doesn't seem to have extension points where one could provide hints about
instruction, at least I didn't find a way other than to subclass it.

Comments inline.

Those not answered inline here are addressed in newer revision.

Sergey

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
26

I'll try that, it shouldn't require a lot of changes.

ideally with increasing addresses

Actually, input of the pass is already in reverse order:

ldp     x10, x11, [x8, #48]
stp     x10, x11, [x9, #48]
ldp     x10, x11, [x8, #32]
stp     x10, x11, [x9, #32]
ldp     x10, x11, [x8, #16]
stp     x10, x11, [x9, #16]
ldp      x10, x8, [x8]
stp      x10, x8, [x9]

which might come from getMemcpyLoadsAndStores() in SelectionDAG.cpp, which
doesn't specify order.

145

I don't see any function with exactly this name, functions with similar name don't seem to fit and some are also static.

206

Wrong comment, I meant that first load should go before first store.

t.p.northover edited edge metadata.Nov 3 2014, 1:44 PM

I'm really not sure about this one. I agree with James that hacking around with the instructions after the scheduler seems really iffy. It sounds much more like we're hitting a scheduler defect that we want to fix properly instead, unless it's a constraint that's just impossible to represent.

Perhaps some kind of forwarding from a load to a dependent store has been omitted?

I've also got some other issues with the actual implementation.

Cheers.

Tim.

lib/Target/AArch64/AArch64ISelLowering.cpp
6615–6620 ↗(On Diff #15691)

How general is this? We should be writing for future cores as well as existing ones, and always preferring 64-bit operations seems like it'll be more and more of an oddity in future.

It also seems like it belongs in a completely separate patch to the interleaving one.

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
243–246

This seems like a really fragile way to do this. It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores.

It sounds much more like we're hitting a scheduler defect that we want to fix properly instead, unless it's a constraint that's just impossible to represent.

The scheduler seems to do its job correctly for generic case, but it seems to be missing
information about instruction operands. In this case it could ignore latency of ldp when
it's followed by stp with same operands.

Perhaps some kind of forwarding from a load to a dependent store has been omitted?

I tried gluing and/or combining nodes in a lot of ways, scheduler doesn't care about any
of these. Another way would be to introduce pseudo-instruction and expand it after
scheduling, but it requires temporary registers and its too late to allocate registers
at that point.

Regards,
Sergey

lib/Target/AArch64/AArch64ISelLowering.cpp
6615–6620 ↗(On Diff #15691)

How general is this? We should be writing for future cores as well as existing ones, and always preferring 64-bit operations seems like it'll be more and more of an oddity in future.

If I get it right, the problem with 128-bit registers is that they are floating point registers rather than general purpose ones, so as long as there is no 128-bit GP registers, this should hold.

It also seems like it belongs in a completely separate patch to the interleaving one.

The pass is there to allow better code generation for memcpy(), but they can be separated technically (pass can go first, maybe with slightly changed tests).

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
243–246

It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores.

The condition isn't exactly this one, but it does have similar constrain. Next revision that works on pairs of instructions should change this.

sdmitrouk updated this revision to Diff 15881.Nov 6 2014, 8:57 AM
sdmitrouk edited edge metadata.

Changed code to process pairs of loads and stores rather than all of them inside a basic block.
Removed changes that are not directly related to new pass.

Tim, James,

I've updated the diff, but I'll also ask Andy Trick and/or Dave Estes if there
is a better way that involves instruction scheduler using this review as an
example of what I want to achieve.

Cheers,
Sergey

In this case it could ignore latency of ldp when it's followed by stp with same operands.

I don't think that's right. We're not magically going to make the stp less quick, we can just issue them back to back in the same cycle. Potentially a ScheduleHazardRecognizer might be the right thing here?

In this case it could ignore latency of ldp when it's followed by stp with same operands.

I don't think that's right. We're not magically going to make the stp less quick, we can just issue them back to back in the same cycle.

Well, I didn't assume there is some magic, what I meant is that when scheduler looks for the next instruction after stp, ldp should be the best match among all predecessors.

Potentially a ScheduleHazardRecognizer might be the right thing here?

From its description, I'd say that it does the opposite: allows to postpone execution of some instruction till the next cycle.

Dave's advice to look at clustering in scheduler applied through DAG mutations almost worked, the only issue is that some "free" instructions can still be inserted between ldp and stp (previously it were instructions that compute addresses), but not sure this can be solved using clustering. The next thing is custom scheduling strategy, it might be an option, but I just started trying adding it.

sdmitrouk abandoned this revision.Jan 19 2015, 1:46 AM

Custom instruction scheduler (actually, both pre-RA and post-RA might be needed) might be a solution, but it replaces generic scheduler, which will only make scheduling worse.

No more ideas, looks like such instruction interleaving can't be achieved in LLVM with reasonable effort.