This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/X86/X86TargetMachine.cpp | ||
---|---|---|
420 | I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline. |
llvm/lib/Target/X86/X86TargetMachine.cpp | ||
---|---|---|
420 | @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized? |
Strange. llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll can pass in my local machine.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll | ||
---|---|---|
78 | Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero. |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
434 | bool C = false | |
440 | We can use a forward order to iterate it. | |
471 | Remove the {} for single line loop. | |
507 | You can just return it by return LAT.visit(). | |
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll | ||
61 | Maybe we can use zero mask load in future optimization. |
llvm/lib/Target/X86/X86TargetMachine.cpp | ||
---|---|---|
421 | We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes. |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
212–213 | In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
212–213 | We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero. |
llvm/include/llvm/CodeGen/Passes.h | ||
---|---|---|
504 | Add comments to describe what the pass does? | |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
2 | This seems wrong file name. | |
12 | Type 'able'. | |
160 | Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions. | |
251 | Delete the dead code. | |
268 | It should be in another line. | |
281 | Better to be in a new line. | |
293 | Better to be in a new line. | |
373 | The name seems not good. Is "PreBuilder" better? And why we need two builder in the function? | |
378 | Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation. | |
412 | Is "PreBuilder" better? | |
416 | Shift? | |
449 | PreBuilder? | |
488 | Do we iterate the instructions in topology order or in post order? |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
488 | It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll |
llvm/include/llvm/CodeGen/Passes.h | ||
---|---|---|
504 | transforms | |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
2 | We usually comment as ===--- filename - description ---=== | |
52 | Ctx | |
89 | Can we just use template <bool IsLoad>? I think it also can reduce the branch. | |
100 | Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic llvm.masked.load/store to reduce the inner loop. | |
167 | Maybe we can just use cast to help to raise the assertion. | |
224 | You can use cast to help to check the failure so that VecA/B/C won't be uninitialized. | |
230 | ditto | |
232 | Should check it is V256I32? | |
233 | ditto | |
289 | eltc? | |
312 | Is it necessary to insert the ResElt to VecC? | |
341 | TileLoadStore | |
342 | Forgot to remove? | |
344 | ditto | |
388 | ditto | |
392 | ditto | |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
337 | ditto | |
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll | ||
1 ↗ | (On Diff #325964) | Better name it amx-low-intrinsics-no-amx-bitcast.ll |
13 ↗ | (On Diff #325964) | It seems the body block is not necessary |
19 ↗ | (On Diff #325964) | ditto. The lable TILELOAD_SCALARIZE_COLS_BODY even not been used. |
31 ↗ | (On Diff #325964) | I think cols.latch is not necessary either. |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
312 | Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
312 | But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
89 | Why do we need a template instead of passing a parameter bool IsLoad? |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
89 | Bing thought template instantiation can avoid the condition code to turn into branch instructions. |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
89 | That may be arguable what benefit more. Code size saving or branch instructions avoiding. :) |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
100 | I think We can compose a follow-up patch for this optimization |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
312 | I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops. | |
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll | ||
13 ↗ | (On Diff #325964) | In fact, ISEL PASS can merge basicblocks together. |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
312 | It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C. |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
312 | But your solution still need to update D so D's phi will be kept in the inner loops. |
LGTM with some nitpicks 😊
llvm/include/llvm/CodeGen/Passes.h | ||
---|---|---|
504 | transforms | |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
10 | operations | |
11 | We always enable it. Also need to mention optnone. | |
47 | Curly brackets are not necessary here. | |
83 | Do we need to remove the successor? Isn't it still being dominated? | |
97 | IsTileLoad | |
118 | Use 1 directly? | |
122 | Better use the same naming conversion, i.e. ColLoopHeader | |
126–127 | Better to change the order, e.g. Type *EltTy = B.getInt32Ty(); FixedVectorType *V256I32Ty = FixedVectorType::get(EltTy, 256); | |
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll | ||
2 | I think we should move the files to llvm/test/Transforms/ | |
llvm/test/CodeGen/X86/AMX/amx-type.ll | ||
2 | Why adding this? Is it O2 by default? |
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp | ||
---|---|---|
83 | I think this is to remove edge from preheader to tmp, because we insert a loop between them. | |
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll | ||
2 | Not sure about it. Our .cpp code is under lib/Target/X86/ folder. | |
llvm/test/CodeGen/X86/AMX/amx-type.ll | ||
2 | I think this is to test with opt level 2 this pass do nothing. |
This seems to break the build https://buildkite.com/mlir/mlir-core/builds/12026#91ec4dfe-542f-4312-92db-7d555f05ce06.
I could repro locally, reverting locally fixes the build.
Please address, thanks!
@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867
Hi, @RKSimon @nicolasvasilache , it seems we haven't told libLLVMX86CodeGen.so.13git to link TransformUtils inllvm/lib/Target/X86/CMakeLists.txt, That's why we encounter buildfail.
But There is a strange thing which can be observed in build.ninja :
When I cmake with "-DBUILD_SHARED_LIBS=OFF", libLLVMX86CodeGen.a will still link lib/libLLVMTransformUtils.a.
When I cmake with "-DBUILD_SHARED_LIBS=ON", libLLVMX86CodeGen.so.13git won't link TransformUtils.
Is there any difference in build system for static library and shared library?
It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions
The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).
Thanks, @nikic, I will fix it ASAP. Besides, How could I reproduce the regression?
Eh, I am asking these question because I think I should see if the repression can't be reproduced with my future bugfix.
@yubing In this case I would recommend building sqlite3.c from test-suite under perf stat and look at the instructions metric. For me the command looks like this:
perf stat CLANG_BINARY -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -I. -MD -MT MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o -MF MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o.d -o MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o -c ../MultiSource/Applications/sqlite3/sqlite3.c
You can generally get a build command using ninja -v sqlite3 in test-suite.
Add comments to describe what the pass does?