This is an archive of the discontinued LLVM Phabricator instance.

[GlobalISel] Improving InstructionSelect's performance by reducing MatchTable
ClosedPublic

Authored by rtereshin on Mar 20 2018, 2:11 PM.

Details

Reviewers
qcolombet
dsanders
bogner
aemerson
javed.absar
Commits
rGa4c410d50d78: [GlobalISel][InstructionSelect] Switching over root LLTs, perf patch 10
rG5f5e55008f70: [GlobalISel][InstructionSelect] Moving Reg Bank Checks forward, perf patch 9
rG152fc1605e70: [GlobalISel][InstructionSelect] Maximizing # of Group's common conditions, perf…
rG9a9fa49cd271: [GlobalISel][InstructionSelect] Sorting MatchTable's 2nd level by root LLT…
rGb1ba127aa87f: [GlobalISel][InstructionSelect] Moving type checks forward, perf patch 6
rGfedae33efacb: [GlobalISel][InstructionSelect] MatchTable second level grouping, perf patch 5
rG0ee082f3b9b0: [GlobalISel][InstructionSelect] Switching MatchTable over opcodes, perf patch 4
rG770136030806: [GlobalISel][InstructionSelect] Sorting MatchTable's first level by opcodes and…
rG19da66759944: [GlobalISel][InstructionSelect] Removing redundant num operands and nested def…
rGf0dc9fa934fa: [GlobalISel] Improving InstructionSelect's performance by reducing MatchTable…
rL333146: [GlobalISel][InstructionSelect] Switching over root LLTs, perf patch 10
rL333144: [GlobalISel][InstructionSelect] Moving Reg Bank Checks forward, perf patch 9
rL333139: [GlobalISel][InstructionSelect] Maximizing # of Group's common conditions, perf…
rL333131: [GlobalISel][InstructionSelect] Sorting MatchTable's 2nd level by root LLT…
rL333114: [GlobalISel][InstructionSelect] Moving type checks forward, perf patch 6
rL333053: [GlobalISel][InstructionSelect] MatchTable second level grouping, perf patch 5
rL333017: [GlobalISel][InstructionSelect] Switching MatchTable over opcodes, perf patch 4
rL333001: [GlobalISel][InstructionSelect] Sorting MatchTable's first level by opcodes and…
rL332945: [GlobalISel][InstructionSelect] Removing redundant num operands and nested def…
rL332907: [GlobalISel] Improving InstructionSelect's performance by reducing MatchTable…
Summary

This patch decreases time spent by GlobalISel in its InstructionSelect pass by roughly 60% for -O0 builds for large inputs as measured on sqlite3-amalgamation (http://sqlite.org/download.html) targeting AArch64.

The performance improvement is achieved solely by reducing the number of matching GIM_* opcodes executed by the MatchTable's interpreter during the selection by approx. a factor of 30, which also brings contribution of this particular part of the selection process to the overall runtime of InstructionSelect pass down from approx. 60-70% to 5-7%, thus making further improvements in this particular direction not very profitable.

The patch loosely depends on Testgen patches and relies on additional testing provided by it (see https://reviews.llvm.org/D43962)

Diff Detail

Repository
rL LLVM

Event Timeline

rtereshin created this revision.Mar 20 2018, 2:11 PM
rtereshin updated this revision to Diff 139196.Mar 20 2018, 2:22 PM

fixing small formatting issue, NFC

qcolombet accepted this revision.Apr 23 2018, 2:01 PM

Hi Roman,

LGTM.
When you commit, please split the patch between, formatting NFC, fixing of wrong assertion :), adding switch opcode, adding switch types.

Cheers,
-Quentin

include/llvm/CodeGen/GlobalISel/InstructionSelector.h
154 ↗(On Diff #139196)

Formatting patches, in a different NFC patch please.

include/llvm/CodeGen/GlobalISel/InstructionSelectorImpl.h
66 ↗(On Diff #139196)

Ditto

273 ↗(On Diff #139196)

Separate patch

This revision is now accepted and ready to land.Apr 23 2018, 2:01 PM

Hi Roman,

LGTM.
When you commit, please split the patch between, formatting NFC, fixing of wrong assertion :), adding switch opcode, adding switch types.

Cheers,
-Quentin

Hi Quentin,

Thank you for looking into this!

Good suggestions, will do.

Thanks,
Roman

I've split this patch up into an initial and rather large mostly NFC commit that does necessary preparation work and refactoring, and a following series of small patches introducing a specific optimization. The relative performance impact of each commit could be seen here:

Release build, assertions off, llc sqlite3.bc -mtriple aarch64-- -O0 -time-passes -time-compilations=5

BaselineMain Perf Patch: NFC + not checking topology first + moving reg bank checks to the end+ Removing redundant num operands and nested def operands checks+ Sorting first level by opcodes and num operands+ Switching over opcodes+ Second level grouping+ Moving type checks forward+ Sorting second level by root LLT+ Maximizing # of Group's common conditions+ Switching over LLTs+ Moving Reg Bank Checks forward+ Dropping redundant type checks
secondsdiff vs Baselinesecondsdiff vs Baselinesecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previoussecondsdiff vs previous
Instruction Select0.94950.97270.91780.75170.60480.52150.48240.37760.37620.35060.34280.3444
0.94690.97400.91680.75060.60210.52200.48290.37670.37650.35050.34280.3443
0.95030.97200.91960.75570.60550.52030.48220.37600.37580.35020.34340.3445
Min0.9470.9722.7%0.917-3.2%0.751-18.1%0.602-19.8%0.520-13.6%0.482-7.3%0.376-22.0%0.376-0.1%0.350-6.8%0.343-2.1%0.3440.4%
Avg0.9490.9732.5%0.918-3.2%0.753-18.0%0.604-19.7%0.521-13.7%0.483-7.4%0.377-21.9%0.376-0.2%0.350-6.8%0.343-2.1%0.3440.4%
Err0.4%0.2%0.3%0.7%0.6%0.3%0.1%0.4%0.2%0.1%0.2%0.1%
+ Making get type faster+ Dropping redundant type checks
diff vs Baselinesecondsdiff vs previoussecondsdiff vs previous
0.32640.3266
0.32630.3264
0.32660.3277
Min-65.5%0.326-4.8%0.3260.0%
Avg-65.6%0.326-4.8%0.3270.1%
Err0.1%0.4%

Making get type faster refers to https://reviews.llvm.org/D46809

Dropping redundant type checks has been closely investigated by dumping full traces of opcodes executed by the selector for sqlite3 amalgamation: the traces are identical down to extra type checks and additional nesting levels. The negative performance impact (though, nearly indistinguishable from error) is quite surprising. Especially the fact that making type checks faster brings noticeable improvement, while removing them altogether does not. The only theory I have is that checking types on all consecutive operands keeps something important hot in cache, while not checking them / checking in a non-local way gets important data evicted.

I'm not planning on committing that patch yet.

This revision was automatically updated to reflect the committed changes.