Page MenuHomePhabricator

[ARM] Implement TTI::getMemcpyCost

Authored by SjoerdMeijer on Mar 25 2019, 11:12 AM.



This implements TargetTransformInfo method getMemcpyCost, which estimates the number of instructions to which a memcpy instruction expands to.

This patch depends on D59785 and D59766.

Diff Detail


Event Timeline

SjoerdMeijer created this revision.Mar 25 2019, 11:12 AM
SjoerdMeijer retitled this revision from [ARM] WIP: implement TTI:getMemcpyCost to [ARM] Implement TTI:getMemcpyCost.
SjoerdMeijer edited the summary of this revision. (Show Details)
SjoerdMeijer added reviewers: dmgreen, efriedma, fhahn.

Ignoring for a moment how I would like to use this, I think this is a nice self-contained improvement (or perhaps even bug fix) so that getIntstructionCost/getIntrinsicCost returns something an awful lot more reasonable for memcpy's. I think the dependent refactoring patches are general improvements too. So, a friendly ping, any opinions on this?

olista01 added inline comments.Mar 29 2019, 3:28 AM
404 ↗(On Diff #192345)

I can't find the old definition or any uses of this function in the existing code or linked patches (D59785, D59766), have I missed a patch?

Ah, oops, that was a experiment in my local branch. I will get rid of this, and upload a new diff soon. So, thanks for spotting this, you haven't missed anything!

SjoerdMeijer updated this revision to Diff 192802.EditedMar 29 2019, 4:30 AM
SjoerdMeijer retitled this revision from [ARM] Implement TTI:getMemcpyCost to [ARM] Implement TTI::getMemcpyCost.
SjoerdMeijer edited the summary of this revision. (Show Details)
  • cleaned up the diff,
  • tweaked the cost for generating a library call.
SjoerdMeijer updated this revision to Diff 193888.EditedApr 5 2019, 8:45 AM

Run code-size benchmark CSiBe, triggered an assert in function FindOptimalMemOpLowering, and fixed that.

This are the code-size results:


Overall, small differences and the changes are neutral, but I will look into libpng to see what is happening there.

SjoerdMeijer added a reviewer: samparker.

Removed the change in findOptimalMemOpLowering, because that belongs to D59766 which I have just updated.

About the regression: in my local tree had a logic error in findOptimalMemOpLowering, i.e. the condition to bail early was wrong. It was right here in the upstream diffs, so somehow I managed to mess that up).

Now, the good news is that the regressions disappear, In fact, there are no changes at all anymore. Because of the logic error, is was triggering more often, also for other intrinsics. Thus, this is now almost a non-functional change, which is what I was expecting as this is just the initial plumbing to model the cost correctly.

Investigating the "regressions" was a useful exercise though. I learned that in libPNG it was actually making the right decisions, but we were unlucky with register allocation allocating high registers like R9 and R10 not allowing narrow encodings for instructions LDRB and STRB. The other regression, in TEEM, was because it was also transforming memset intrinsics to eabi_memclr4 calls. This was just wrong, and was an result of not bailing early in findOptimalMemOpLowering. But this shows I definitely want to look at removing this early bail in a follow-up patch, because with the memclr things fixed, it should be a win overall and reduce code-size.

But summarising: this is not causing any changes in codesize for two code-bases CSiBe and Mbed, and thus is almost a non-functional change, which was my intention.

dmgreen added inline comments.Apr 24 2019, 1:58 PM
412 ↗(On Diff #193888)

Can you explain why this is more than 6? I would imagine it would be something like 4; 1 (for the call) + a few for argument setup.

436 ↗(On Diff #193888)

Does this need to be clang-formatted?

373 ↗(On Diff #196444)

Is it worth adding a few strict-align tests too?

Hi Dave, many thanks for taking a look!

Can you explain why this is more than 6? I would imagine it would be something like 4; 1 (for the call) + a few for argument setup.

6 was arbitrary, I wanted to make it slightly more costly than 'TCC_Expensive', but don't have good reasons at the moment, so I've changed it back to 4.

Does this need to be clang-formatted?

I did and made a few changes. But to keep the inline comments near the arguments readable, I kept them on separate lines (clang-format puts them after each other).

Is it worth adding a few strict-align tests too?

Yes, thanks for the suggestion. Done!

dmgreen added inline comments.Apr 26 2019, 3:16 AM
420 ↗(On Diff #196632)

How come this is needed now? I was under the impression that findOptimalMemOpLowering handled allows unaligned cases. And if the memory is aligned (but size >= 8), we can still expand the memcpy.

434 ↗(On Diff #196632)

Should we be using the same value as getMaxStoresPerMemcpy here?

That way we may not need the "Size >= 32 && ... " checks above.

Thanks for the suggestion to use getMaxStoresPerMemcpy; that greatly simplified and cleaned up things. Have added more tests too (all functions are now tested with/without strict align).

dmgreen accepted this revision.Apr 29 2019, 8:17 AM

Nice. LGTM

422 ↗(On Diff #197094)


180 ↗(On Diff #197094)

This can be removed?

This revision is now accepted and ready to land.Apr 29 2019, 8:17 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2019, 3:28 AM