This is an archive of the discontinued LLVM Phabricator instance.

[X86] Optimize getImpliedDisabledFeatures & getImpliedEnabledFeatures
ClosedPublic

Authored by MaskRay on Aug 4 2020, 4:29 PM.

Details

Summary

Previously the time complexity is O(|number of paths from the root to an
implied feature| * CPU_FWATURE_MAX) where CPU_FEATURE_MAX is 92.

The number of paths can be large (theoretically exponential).

For an inline asm statement, there is a code path
clang::Parser::ParseAsmStatement -> clang::Sema::ActOnGCCAsmStmt -> ASTContext::getFunctionFeatureMap
leading to potentially many calls of getImpliedEnabledFeatures (41 for my -march=native case).

We should improve the performance a bit in case the number of inline asm
statements is large (Linux kernel builds).

Diff Detail

Event Timeline

MaskRay created this revision.Aug 4 2020, 4:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 4 2020, 4:29 PM
MaskRay requested review of this revision.Aug 4 2020, 4:29 PM

Thanks for the patch, I'm happy to test it out.

llvm/lib/Support/X86TargetParser.cpp
98–101

are these more concise to return llvm::any_of here and above?

576

can we not use != here?

Also, I don't think the results of the queries to getImpliedDisabledFeatures and getImpliedEnabledFeatures is likely to change, though we repeatedly check the same values (at least for this input I'm testing). Is it worth throwing some FeatureBitSets in a map of some sort?

Also, I don't think the results of the queries to getImpliedDisabledFeatures and getImpliedEnabledFeatures is likely to change, though we repeatedly check the same values (at least for this input I'm testing). Is it worth throwing some FeatureBitSets in a map of some sort?

For example, applying:

diff --git a/llvm/lib/Support/X86TargetParser.cpp b/llvm/lib/Support/X86TargetParser.cpp
index 572d1203aaf2..ae351b6c4563 100644
--- a/llvm/lib/Support/X86TargetParser.cpp
+++ b/llvm/lib/Support/X86TargetParser.cpp
@@ -572,9 +572,11 @@ static void getImpliedDisabledFeatures(FeatureBitset &Bits, unsigned Value) {
   }
 }
 
+#include <llvm/Support/raw_ostream.h>
 void llvm::X86::getImpliedFeatures(
     StringRef Feature, bool Enabled,
     SmallVectorImpl<StringRef> &ImpliedFeatures) {
+  errs() << __func__ << ": " << Feature.str() << " : " << Enabled << "\n";
   auto I = llvm::find_if(
       FeatureInfos, [&](const FeatureInfo &FI) { return FI.Name == Feature; });
   if (I == std::end(FeatureInfos)) {

then building one file in the kernel (manually invoking clang, not make) spews enough output to fill my scrollback buffer. Looks like the same features being enabled/disabled all over again.

MaskRay marked an inline comment as done.Aug 4 2020, 4:52 PM
MaskRay added inline comments.
llvm/lib/Support/X86TargetParser.cpp
98–101

llvm::any_of is usually used on one collection. Here we are comparing two collections..

MaskRay updated this revision to Diff 283064.Aug 4 2020, 4:53 PM

operator== -> operator!=

nickdesaulniers added inline comments.Aug 4 2020, 4:55 PM
llvm/lib/Support/X86TargetParser.cpp
98–101

Sure for operator !=, but operator bool looks like one collection though.

I suspect setFeatureEnabled is being called for every global.

I suspect setFeatureEnabled is being called for every global.

Can you check how often ASTContext::getFunctionFeatureMap is called. I think most of these calls should originate through that.

craig.topper added inline comments.Aug 4 2020, 5:10 PM
llvm/lib/Support/X86TargetParser.cpp
98–101

Instead of operator bool can we do any() to match std::bitset and the similar class in include/llvm/MC/SubtargetFeature.h

MaskRay marked an inline comment as done.Aug 4 2020, 5:11 PM
MaskRay added inline comments.
llvm/lib/Support/X86TargetParser.cpp
98–101

llvm::any_of can't be used because it is not constexpr: error: constexpr function never produces a constant expression [-Winvalid-constexpr]

craig.topper added inline comments.Aug 4 2020, 5:12 PM
llvm/lib/Support/X86TargetParser.cpp
98–101

We don't need constexpr for this usage do we?

MaskRay updated this revision to Diff 283068.Aug 4 2020, 5:13 PM

constexpr operator bool -> non-constexpr any

MaskRay marked an inline comment as done.Aug 4 2020, 5:13 PM

I suspect setFeatureEnabled is being called for every global.

Ah, it looks like clang::Parser::ParseAsmStatement, so each inline asm statement is problematic. I guess that's no surprise for the kernel. There's another call from ParseAsmStatement that then calls clang::targets::X86TargetInfo::initFeatureMap, but I can't see it (the type signature overflows in perf report).

I suspect setFeatureEnabled is being called for every global.

Ah, it looks like clang::Parser::ParseAsmStatement, so each inline asm statement is problematic. I guess that's no surprise for the kernel. There's another call from ParseAsmStatement that then calls clang::targets::X86TargetInfo::initFeatureMap, but I can't see it (the type signature overflows in perf report).

We can probably do something better there if the function doesn't have a target attribute on it.

craig.topper added inline comments.Aug 4 2020, 5:25 PM
llvm/lib/Support/X86TargetParser.cpp
573

Is there something special about going in reverse here?

MaskRay marked an inline comment as done.Aug 4 2020, 5:31 PM
MaskRay added inline comments.
llvm/lib/Support/X86TargetParser.cpp
573

Backward iteration is more efficient.

If most "implies" edges are forward (i.e. from a smaller index to a larger index; if we organize features from older to newer), one iteration is going to finalize Bits. The second iteration will break the loop.

MaskRay added inline comments.Aug 4 2020, 5:33 PM
llvm/lib/Support/X86TargetParser.cpp
573

Backward iteration is more efficient.

If most "implies" edges are backward (i.e. from a larger index to a smaller index; if we organize features from older to newer), one sweep is going to finalize Bits. The second sweep will make Prev == Bits.

attaching gdb, setting a breakpoint in clang::targets::X86TargetInfo::initFeatureMap, skipping the first instance which looks legit (clang::TargetInfo::CreateTargetInfo -> clang::targets::X86TargetInfo::initFeatureMap -> clang::targets::X86TargetInfo::initFeatureMap), the rest all look like:

clang::Parser::ParseAsmStatement -> clang::Sema::ActOnGCCAsmStmt -> clang::targets::X86TargetInfo::initFeatureMap

Probably the call to Context.getFunctionFeatureMap(FeatureMap, FD); in clang::Sema::ActOnGCCAsmStmt.

craig.topper added inline comments.Aug 4 2020, 5:43 PM
llvm/lib/Support/X86TargetParser.cpp
573

The feature list is currently sorted with the first portion in an order defined by libgcc/compiler-rt. And the rest in alphabetical order. I believe libgcc recently made a change to share some code with their driver and now all bits are features are defined by libgcc. With a bunch in alphabetical order. I'm not sure if that means all features are officially supported by __builtin_cpu_supports in gcc now or not.

But I think the majority of the features that have dependencies are in the libgcc compatibility section. Certainly the longest portion of any chain of dependencies is there.

This revision is now accepted and ready to land.Aug 4 2020, 5:44 PM
MaskRay updated this revision to Diff 283077.Aug 4 2020, 5:48 PM
MaskRay retitled this revision from [X86] Optimize getImpliedDisabledFeatures & getImpliedEnabledFeatures' to [X86] Optimize getImpliedDisabledFeatures & getImpliedEnabledFeatures.
MaskRay edited the summary of this revision. (Show Details)

Add a fast path as in many cases Implies is empty

This revision was landed with ongoing or failed builds.Aug 4 2020, 5:50 PM
This revision was automatically updated to reflect the committed changes.
nickdesaulniers added a comment.EditedAug 4 2020, 5:53 PM

I'll try to get some numbers comparing overall build time, and full build profiles with this applied, tomorrow. Thanks for the hard work! (And I'll work with @Nathan-Huckleberry on the suggestion for Sema::ActOnGCCAsmStmt. We can likely save a pointer to the llvm::StringMap<bool> FeatureMap within Sema and only recompute it if we have a target function attribute).

MaskRay added a comment.EditedAug 4 2020, 6:25 PM

I'll try to get some numbers comparing overall build time, and full build profiles with this applied, tomorrow. Thanks for the hard work! (And I'll work with @Nathan-Huckleberry on the suggestion for Sema::ActOnGCCAsmStmt. We can likely save a pointer to the llvm::StringMap<bool> FeatureMap within Sema and only recompute it if we have a target function attribute).

For my trivial -march=native example, Target->getTargetOpts().Features.size() is 82:

{"+64bit", "+adx", "+aes", "+avx", "+avx2", "+avx512bw", "+avx512cd", "+avx512dq", "+avx512f", "+avx512v
l", "+bmi", "+bmi2", "+clflushopt", "+clwb", "+cmov", "+cx16", "+cx8", "+f16c", "+fma", "+fsgsbase", "+fxsr", "+invpcid", "+lzcnt", "+mmx", "+movbe",
"+pclmul", "+pku", "+popcnt", "+prfchw", "+sahf", "+sse", "+sse2", "+sse3", "+sse4.1", "+sse4.2", "+ssse3", "+x87", "+xsave", "+xsavec", "+xsaves", "-
amx-bf16", "-amx-int8", "-amx-tile", "-avx512bf16", "-avx512bitalg", "-avx512er", "-avx512ifma", "-avx512pf", "-avx512vbmi", "-avx512vbmi2", "-avx512v
nni", "-avx512vp2intersect", "-avx512vpopcntdq", "-cldemote", "-clzero", "-enqcmd", "-fma4", "-gfni", "-lwp", "-movdir64b", "-movdiri", "-mwaitx", "-p

You may check whether there is improvement room. I'd be wary of creating a cache as latent caching invalidation bugs would be difficult to debug :(

4c9ed3ed3d2fc7622acf5fc0d80ad20b44cf376a :
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:56.05
Top methods in bottom up trace:

+    9.17%  clang-12                   [.] getImpliedDisabledFeatures
-    3.04%  clang-12                   [.] llvm::StringMapImpl::LookupBucketFor
   - 3.01% llvm::StringMapImpl::LookupBucketFor
      - 1.10% llvm::StringMap<bool, llvm::MallocAllocator>::try_emplace<>
         + 1.10% clang::targets::X86TargetInfo::setFeatureEnabled
+    1.67%  clang-12                   [.] llvm::X86::getImpliedFeatures

0c7af8c83bd1acb0ca78f35ddde29b6fde4363a0:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:53.39
Top methods in bottom up trace:

+    3.59%  clang-12                   [.] llvm::X86::getImpliedFeatures
-    3.36%  clang-12                   [.] llvm::StringMapImpl::LookupBucketFor
   - 3.32% llvm::StringMapImpl::LookupBucketFor
      - 1.24% llvm::StringMap<bool, llvm::MallocAllocator>::try_emplace<>
         + 1.24% clang::targets::X86TargetInfo::setFeatureEnabled

So that's good; getImpliedDisabledFeatures now falls out of the top slot. Looks like memoizing llvm::X86::getImpliedFeatures will still help, a lot.

I'll try to get some numbers comparing overall build time, and full build profiles with this applied, tomorrow. Thanks for the hard work! (And I'll work with @Nathan-Huckleberry on the suggestion for Sema::ActOnGCCAsmStmt. We can likely save a pointer to the llvm::StringMap<bool> FeatureMap within Sema and only recompute it if we have a target function attribute).

You may check whether there is improvement room. I'd be wary of creating a cache as latent caching invalidation bugs would be difficult to debug :(

But I don't expect that the "implied features" would change, so there would be no need to invalidate any cache. Caching without invalidation is just memoization. Or is llvm::X86::getImpliedFeatures not pure? I suspect it is.

4c9ed3ed3d2fc7622acf5fc0d80ad20b44cf376a :
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:56.05
Top methods in bottom up trace:

+    9.17%  clang-12                   [.] getImpliedDisabledFeatures
-    3.04%  clang-12                   [.] llvm::StringMapImpl::LookupBucketFor
   - 3.01% llvm::StringMapImpl::LookupBucketFor
      - 1.10% llvm::StringMap<bool, llvm::MallocAllocator>::try_emplace<>
         + 1.10% clang::targets::X86TargetInfo::setFeatureEnabled
+    1.67%  clang-12                   [.] llvm::X86::getImpliedFeatures

0c7af8c83bd1acb0ca78f35ddde29b6fde4363a0:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:53.39
Top methods in bottom up trace:

+    3.59%  clang-12                   [.] llvm::X86::getImpliedFeatures
-    3.36%  clang-12                   [.] llvm::StringMapImpl::LookupBucketFor
   - 3.32% llvm::StringMapImpl::LookupBucketFor
      - 1.24% llvm::StringMap<bool, llvm::MallocAllocator>::try_emplace<>
         + 1.24% clang::targets::X86TargetInfo::setFeatureEnabled

So that's good; getImpliedDisabledFeatures now falls out of the top slot. Looks like memoizing llvm::X86::getImpliedFeatures will still help, a lot.

I'll try to get some numbers comparing overall build time, and full build profiles with this applied, tomorrow. Thanks for the hard work! (And I'll work with @Nathan-Huckleberry on the suggestion for Sema::ActOnGCCAsmStmt. We can likely save a pointer to the llvm::StringMap<bool> FeatureMap within Sema and only recompute it if we have a target function attribute).

You may check whether there is improvement room. I'd be wary of creating a cache as latent caching invalidation bugs would be difficult to debug :(

But I don't expect that the "implied features" would change, so there would be no need to invalidate any cache. Caching without invalidation is just memoization. Or is llvm::X86::getImpliedFeatures not pure? I suspect it is.

It's pure. But I think we should focus on getting clang to not rebuild the feature map for every inline asm statement. That will also remove a string map creation and destruction.

I should also mention that the code that reason clang is doing this is to check the ABI compatibility for vector arguments to produce a nice error message about feature mismatch. The backend is also able to generate an error for the same case. Its just not as nice as it won't tell you that avx512 is missing. So we're paying a high cost for nicer error message. The inline asm should have source location so I actually wonder if the backend could be made to print a better error message itself.

llvm/lib/Support/X86TargetParser.cpp
573

I might be missing something with how the data is sorted, but breaking the loop early feels like we might miss cases.

C -> A,B
D -> A
E -> A,C
F -> D

If we had the above dependencies and enabled {E,F} wouldn't we not enable B?

craig.topper added inline comments.Aug 5 2020, 3:58 PM
llvm/lib/Support/X86TargetParser.cpp
573

What loop are we breaking early?