Page MenuHomePhabricator

[RISCV] Lazily add RVV C intrinsics.
Needs ReviewPublic

Authored by HsiangKai on Oct 12 2021, 1:01 AM.

Details

Summary

Leverage the method OpenCL uses that adds C intrinsics when the lookup
failed. There is no need to define C intrinsics in the header file any
more. It could help to avoid the large header file to speed up the
compilation of RVV source code. Besides that, only the C intrinsics used
by the users will be added into the declaration table.

This patch is based on https://reviews.llvm.org/D103228 and inspired by
OpenCL implementation.

Authored-by: Kito Cheng <kito.cheng@sifive.com>
Co-Authored-by: Hsiangkai Wang <kai.wang@sifive.com>

Diff Detail

Event Timeline

HsiangKai created this revision.Oct 12 2021, 1:01 AM
HsiangKai requested review of this revision.Oct 12 2021, 1:01 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 12 2021, 1:01 AM
rogfer01 added inline comments.Oct 12 2021, 2:18 AM
clang/lib/Sema/SemaLookup.cpp
923

Not for this patch: I think this table may be a bit large so all lookups (including those that will fail) will be slower after a #pragma riscv intrinsic vector is found.

Filtering them as fast as possible (looking at the spec shows that currently all RVV intrinsics start with v) or using some hash table (if too difficult to build at compile time we could build it the first time we get here?) might be something we want to do.

1054

This seems a bit fragile if there are more predefines than just this one. I understand the intent is to avoid looking up the RVV builtin every time, only do that if we have found the pragma, right?

Several pragma handlers receive a reference to Sema (in an object called Action or Actions) and then they notify Sema (via a member function that would have to be added to it) about having parsed the pragma. That could be used to set some flag to true in Sema itself and also emit diagnostics if we want (e.g. what if the pragma is used twice? can it be used anywhere?).

Do you think this would be workable?

kito-cheng added inline comments.Oct 12 2021, 2:35 AM
clang/lib/Sema/SemaLookup.cpp
923

OpenCL using a tablegen-based generator to generate a big swtich table to speed up the lookup rather than linear scan, here is generated file:

https://gist.github.com/kito-cheng/46616c82c0f25e5df31ff5eaa14914ba#file-openclbuiltins-inc-L8055

I think we could using same approach to prevent the slow down.

HsiangKai added inline comments.Oct 12 2021, 6:53 PM
clang/lib/Sema/SemaLookup.cpp
923

Indeed, OpenCL generates a checking function, isOpenCLBuiltin, as the filter. I will use the similar approach to filter the queries.

1054

I should add a flag somewhere as the checking condition. I will try to find a better way to do it. Thanks for your suggestion. I think it is workable.

The purpose of the pragma is to enable the lazily insertion. If users do not include riscv_vector.h and they use vadd as the function call, we will not treat it as a vector generic intrinsic.

Although it reduces the header size, this patch will increase the binary size of clang.

Debug build:
Before this patch:

text            data            bss             dec             hex             filename
263892591       10838284        500232          275231107       1067b183        clang-14

After this patch:

text            data            bss             dec             hex             filename
263909721       12085116        500232          276495069       107afadd        clang-14

Release build:
Before this patch;

text            data            bss             dec             hex             filename
382952171       88029764        10264736        481246671       1caf3dcf        clang-14

After this patch:

text            data            bss             dec             hex             filename
387629483       94652582        10264736        492546801       1d5baaf1        clang-14

As discussed in D110684, developers complain not only compile time, but also binary size & memory usage caused by RVV intrinsics. We need to consider binary size, too. Is there other way to handle it? Or we should go back to think about the design of RVV intrinsics?

Although it reduces the header size, this patch will increase the binary size of clang.

Options we can consider to mitigate this:

  • See if we can be more economical reprenting the table itself
  • Store it in the binary in a format unsuitable for lookups (e.g. compressed) and then transform it (in memory) the first time is used so queries can be solved in a reasonable amount of time.

Or we should go back to think about the design of RVV intrinsics?

Certainly the explosion of cases is an issue.

Maybe an alternative is using overloaded builtins (similar to what atomic builtins can do when the type is not specified), but I understand this means implementing the typechecking itself. Which perhaps it can be generated too.

This looks like it might bring a lot of reduction in cases and tables. Corner cases will appear when the arguments of a call are not enough to determine the precise intrinsic desired, e.g. loads as they are currently defined would be ambiguous between different LMUL types (though there may be ways to mitigate this, e.g. overload only within LMULs).

Although it reduces the header size, this patch will increase the binary size of clang.

Options we can consider to mitigate this:

  • See if we can be more economical reprenting the table itself

Yes, I agree. This primitive work has lots of space to improve. Most of the C intrinsics have the same argument lists. We could encode them in a table and reuse the same argument lists. In addition, we could use StringMatcher in the TableGen utilities to generate the switch cases to improve the performance. I am trying to implement this patch in a better way.

  • Store it in the binary in a format unsuitable for lookups (e.g. compressed) and then transform it (in memory) the first time is used so queries can be solved in a reasonable amount of time.

Or we should go back to think about the design of RVV intrinsics?

Certainly the explosion of cases is an issue.

Maybe an alternative is using overloaded builtins (similar to what atomic builtins can do when the type is not specified), but I understand this means implementing the typechecking itself. Which perhaps it can be generated too.

This looks like it might bring a lot of reduction in cases and tables. Corner cases will appear when the arguments of a call are not enough to determine the precise intrinsic desired, e.g. loads as they are currently defined would be ambiguous between different LMUL types (though there may be ways to mitigate this, e.g. overload only within LMULs).

HsiangKai updated this revision to Diff 387472.Mon, Nov 15, 6:46 PM

Restructure the data structure to reuse information between C intrinsics.
In this way, we can have a smaller binary size and speed up the lookup
for the C intrinsics.

HsiangKai updated this revision to Diff 387488.Mon, Nov 15, 9:13 PM

Check required extensions when adding the declarations.

The clang binary size increases about +1.4%. Compared to +2.3% previously, it is better in the implementation.

The test time of lit testing under clang/test/CodeGen/RISCV/ is 25.59s. It spends 112.07s without this patch in my local environment.

HsiangKai updated this revision to Diff 388881.Mon, Nov 22, 4:32 AM

Update the implementation of getMangledName().

craig.topper added inline comments.Wed, Nov 24, 12:15 PM
clang/lib/Parse/ParsePragma.cpp
511

Since this is a clang specific pragma should it be #pragma clang riscv intrinsic? gcc's pragma for SVE is #pragma GCC aarch64 "arm_sve.h"

3842

Do we need to check that there are no tokens left to parse? Quick look at other handles I see a check for tok::eod

clang/lib/Sema/SemaLookup.cpp
904

Can this be SmallVectorImpl?

921

SmallVectorImpl

929

Use a fixed size type like uint32_t since it is a bit vector.

949

const on integer arguments doesn't make a much sense. It just prevents you from assigning over the variable in the function body.

966

Use a fixed size type like uint32_t here or uint8_t to match RequiredExts,

clang/utils/TableGen/RISCVVEmitter.cpp
1513

Use StringSet?

1516

With StringSet you can use

if (Seen.insert(RVVTy->getShortStr()).second)

The .second field from the pair will tell if you the insert happened or not.

1551

I don't think you need to insert anything here. The FctOverloadMap[IName].push_back( line below will construct an empty entry before the push_back if it doesn't already exist.

1575

I don't think you need to insert anything. The push_back line will take care of it.

1616

Can this be a vector of unique_ptrs?

1622

Can this be a unique_ptr?

1639

I think this condition can be an additional && on the previous if?

1640

Don't call SignatureListMap.find(Candidate) twice.

1677

I'd use uint16_t rather than unsigned short.

1721

I believe this will create a static global constructor which is not allowed by coding standards https://llvm.org/docs/CodingStandards.html#do-not-use-static-constructors

1760

isRVVBuiltin isn't a good name if it doesn't return a bool.

1784

StringSet?

HsiangKai updated this revision to Diff 390198.Sun, Nov 28, 5:53 AM

Address a part of comments.

HsiangKai marked 16 inline comments as done.Sun, Nov 28, 5:56 AM