This is an archive of the discontinued LLVM Phabricator instance.

[Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type
ClosedPublic

Authored by wmi on Aug 9 2017, 5:04 PM.

Details

Summary

Now, all the consecutive bitfields are wrapped as a large integer unless there is unamed zero sized bitfield in between. The patch is trying to make the bitfield to be accessed as separate memory location if it has width of legal integer type and its bit offset is naturally aligned for the type. This can significantly improve the access efficiency of such bitfield.

https://reviews.llvm.org/D30416 wants to achieve the same goal in llvm, but it is much more difficult because it has to deal with the significantly tweaked IR after all sorts of optimizations. The patch here is trying to do that in clang.

With the patch, we can remove most of D30416 except the illegal memory access shrinking.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Aug 9 2017, 5:04 PM
wmi edited the summary of this revision. (Show Details)Aug 9 2017, 5:05 PM
wmi edited the summary of this revision. (Show Details)Aug 9 2017, 5:13 PM
chandlerc edited edge metadata.Aug 9 2017, 7:42 PM

This has been discussed before and I still pretty strongly disagree with it.

This cripples the ability of TSan to find race conditions between accesses to consecutive bitfields -- and these bugs have actually come up.

We also have had cases in the past where LLVM missed significant bitfield combines and optimizations due to loading them as separate integers. Those would become problems again, and I think they would be even harder to solve than narrowing the access is going to be because we will have strictly less information to work with.

Ultimately, while I understand the appeal of this approach, I don't think it is correct and I think we should instead figure out how to optimize these memory accesses well in LLVM. That approach will have the added benefit of optimizing cases where the user has manually used a large integer to simulate bitfields, and making combining and canonicalization substantially easier.

wmi added a comment.Aug 9 2017, 10:42 PM

This has been discussed before and I still pretty strongly disagree with it.

This cripples the ability of TSan to find race conditions between accesses to consecutive bitfields -- and these bugs have actually come up.

I guess you mean accessing different bitfields in a consecutive run simultaneously can cause data race. Because bitfields order in a consecutive run is implementation defined. With the change, Tsan may miss reporting that, but such data race can be exposed in a different compiler.

This can be solved by detecting tsan mode in the code. If tsan is enabled, we can stop splitting the bitfields.

We also have had cases in the past where LLVM missed significant bitfield combines and optimizations due to loading them as separate integers. Those would become problems again, and I think they would be even harder to solve than narrowing the access is going to be because we will have strictly less information to work with.

how about only separating legal integer width bitfields at the beginning of a consecutive run? Then it won't hinder bitfields combines. This way, it can still help for some cases, including the important case we saw.

Ultimately, while I understand the appeal of this approach, I don't think it is correct and I think we should instead figure out how to optimize these memory accesses well in LLVM. That approach will have the added benefit of optimizing cases where the user has manually used a large integer to simulate bitfields, and making combining and canonicalization substantially easier.

wmi updated this revision to Diff 110641.Aug 10 2017, 3:34 PM

Don't separate bitfield in the middle of a run because it is possible to hinder bitfields accesses combine. Only separate bitfield at the beginning of a run.

wmi added a comment.Aug 10 2017, 3:45 PM

I limit the bitfield separation in the last update to only happen at the beginning of a run so no bitfield combine will be blocked.

I think I need to explain more about why I change the direction and start to work on the problem in frontend. Keeping the information by generating widening type and letting llvm pass to do narrowing looks like a good solution to me originally. However, I saw in real cases that the narrowing approach in a late llvm stage has more difficulties than I originally thought. Some of them are solved but at the cost of code complexity, but others are more difficult.

  • store forwarding issue: To extract legal integer width bitfields from a large integer generated by frontend, we need to split both stores and loads related with legal integer bitfields. If store is narrowed and load is not, the width of load may be smaller than the store and the target may have difficulty to do store forwarding and that fact will hit the performance. Note, we found case that related load and store are in different funcs, so when deciding whether to narrow a store or not, it is possible that we have no idea that the related load is narrowed or not. If we cannot know all the related loads will be narrowed, the store is better not to be narrowed.
  • After instcombine, some bitfield access information will be lost: The case we saw is: unsigned long bf1 : 16 unsigned long bf2 : 16 unsigned long bf3 : 16 unsigned long bf4 : 8

    bool cond = "bf3 == 0 && bf4 == -1";

    Before instcombine, bf3 and bf4 are extracted from an i64 separately. We can know bf3 is a 16 bits access and bf4 is a 8 bit access from the extracting code pattern. After instcombine, bf3 and bf4 are merged into a 24 bit access, the comparison above is changed to: extract 24 bit data from the i64 (%bf.load = wide load i64, %extract = and %bf.load, 0xffffff00000000) and compare %extract with 0xffff00000000. The information that there are two legal integer bitfield accesses is lost, and we won't do narrowing for the load here.

    Because we cannot split the load here, we trigger store forwarding issue.

That is why I am exploring to work on the bitfield access issue in multiple directions.

wmi updated this revision to Diff 112271.Aug 22 2017, 6:31 PM

Try another idea suggested by David.

All the bitfields in a single run are still wrapped inside of a large integer according to CGBitFieldInfo. For the bitfields with legal integer types and aligned, change their access manner when we generate load/store in llvm IR for bitfield (In EmitLoadOfBitfieldLValue and EmitStoreThroughBitfieldLValue). All the other bitfields will still be accessed using widen load/store plus masking operations. Here is an example:

class A {

unsigned long f1:2;
unsigned long f2:6;
unsigned long f3:8;
unsigned long f4:4;

};
A a;

f1, f2, f3 and f4 will still be wrapped as a large integer. f1, f2, f4 will have the same access code as before. f3 will be accessed as if it is a separate unsigned char variable.

In this way, we can reduce the chance of blocking bitfield access combining. a.f1 access and a.f4 access can be combined if only no a.f3 access stands in between a.f1 and a.f4. We will generate two less instructions for foo, and one more instruction for goo. So it is better, but not perfect.

void foo (unsigned long n1, unsigned long n2, unsigned long n3) {

a.f1 = n1;
a.f4 = n4;
a.f3 = n3;

}

void goo (unsigned long n1, unsigned long n2, unsigned long n3) {

a.f1 = n1;
a.f3 = n3;    // a.f3 will still block the combining of a.f1 and a.f2 because a.f3 is accessed independently.
a.f4 = n4;

}

I'm really not a fan of the degree of complexity and subtlety that this introduces into the frontend, all to allow particular backend optimizations.

I feel like this is Clang working around a fundamental deficiency in LLVM and we should instead find a way to fix this in LLVM itself.

As has been pointed out before, user code can synthesize large integers that small bit sequences are extracted from, and Clang and LLVM should handle those just as well as actual bitfields.

Can we see how far we can push the LLVM side before we add complexity to Clang here? I understand that there remain challenges to LLVM's stuff, but I don't think those challenges make *all* of the LLVM improvements off the table, I don't think we've exhausted all ways of improving the LLVM changes being proposed, and I think we should still land all of those and re-evaluate how important these issues are when all of that is in place.

wmi updated this revision to Diff 116232.Sep 21 2017, 11:29 AM

Changes following the discussion:

  • Put the bitfield split logic under an option and off by default.
  • When sanitizer is enabled, the option for bitfield split will be ignored and a warning message will be emitted.

In addition, a test is added.

You seem to be only changing the behavior for the "separatable" fields, but I suspect you want to change the behavior for the others too. The bitfield would be decomposed into shards, separated by the naturally-sized-and-aligned fields. Each access only loads its shard. For example, in your test case you have:

struct S3 {
  unsigned long f1:14;
  unsigned long f2:18;
  unsigned long f3:32;
};

and you test that, with this option, loading/storing to a3.f3 only access the specific 4 bytes composing f3. But if you load f1 or f2, we're still loading all 8 bytes, right? I think we should only load/store the lower 4 bytes when we access a3.f1 and/or a3.f2.

Otherwise, you can again end up with the narrow-store/wide-load problem for nearby fields under a different set of circumstances.

include/clang/Driver/Options.td
1032 ↗(On Diff #116232)

I'm not opposed to -fsplit-bitfields, but I'd prefer if we find something more self-explanatory. It's not really clear what "splitting a bitfield" means. Maybe?

-fsplit-bitfield-accesses
-fdecomposed-bitfield-accesses
-fsharded-bitfield-accesses
-ffine-grained-bitfield-accesses

(I think that I prefer -ffine-grained-bitfield-accesses, although it's the longest)

1034 ↗(On Diff #116232)

How about?

Use separate access for bitfields with legal widths and alignments.

I don't think that "in LLVM" is needed here (or we could put "in LLVM" on an awful lot of these options).

lib/CodeGen/CGExpr.cpp
1679 ↗(On Diff #116232)

var -> variable

wmi added a comment.Sep 26 2017, 4:20 PM

You seem to be only changing the behavior for the "separatable" fields, but I suspect you want to change the behavior for the others too. The bitfield would be decomposed into shards, separated by the naturally-sized-and-aligned fields. Each access only loads its shard. For example, in your test case you have:

struct S3 {
  unsigned long f1:14;
  unsigned long f2:18;
  unsigned long f3:32;
};

and you test that, with this option, loading/storing to a3.f3 only access the specific 4 bytes composing f3. But if you load f1 or f2, we're still loading all 8 bytes, right? I think we should only load/store the lower 4 bytes when we access a3.f1 and/or a3.f2.

This is intentional. if the struct S3 is like following:
struct S3 {

unsigned long f1:14;
unsigned long f2:32;
unsigned long f3:18;

};

and if there is no write of a.f2 between a.f1 and a.f3, the loads of a.f1 and a.f2 can still be shared. It is trying to keep the combining opportunity maximally while reducing the cost of accessing naturally-sized-and-aligned fields

Otherwise, you can again end up with the narrow-store/wide-load problem for nearby fields under a different set of circumstances.

Good catch. It is possible to have the problem indeed. Considering the big perf impact and triaging difficulty of store-forwarding problem, I have to sacrifice the combining opportunity above and take the suggestion just as you describe.

Thanks,
Wei.

include/clang/Driver/Options.td
1032 ↗(On Diff #116232)

Ok.

1034 ↗(On Diff #116232)

Sure.

wmi updated this revision to Diff 116896.Sep 27 2017, 3:54 PM

Address Hal's comment. Separate bitfields to shards separated by the naturally-sized-and-aligned fields.

hfinkel added inline comments.Oct 5 2017, 3:34 PM
lib/CodeGen/CGRecordLayoutBuilder.cpp
411

betterBeSingleFieldRun -> IsBetterAsSingleFieldRun

440–447

The logic here is not obvious. Can you please add a comment. SingleFieldRun here is only not equal to betterBeSingleFieldRun(Field) if we've skipped 0-length bitfields, right? Please explain what's going on and also please make sure there's a test case.

wmi marked an inline comment as done.Oct 5 2017, 6:22 PM
wmi added inline comments.
lib/CodeGen/CGRecordLayoutBuilder.cpp
440–447

I restructure the code a little bit and hope the logic is more clear. I already have a testcase added for it.

wmi updated this revision to Diff 117947.Oct 5 2017, 6:22 PM

Address Hal's comment.

hfinkel added inline comments.Oct 7 2017, 11:08 PM
include/clang/Basic/DiagnosticDriverKinds.td
335 ↗(On Diff #117947)

with a sanitizer

include/clang/Driver/Options.td
1041 ↗(On Diff #117947)

access -> accesses

1044 ↗(On Diff #117947)

Use large-integer access for consecutive bitfield runs.

include/clang/Frontend/CodeGenOptions.def
182 ↗(On Diff #117947)

These lines are too long.

lib/CodeGen/CGRecordLayoutBuilder.cpp
447

Why do you have the IsBetterAsSingleFieldRun(Run) check here (where we'll evaluate it multiple times (for all of the fields in the run)). Can't you make the predicate above directly?

// Any non-zero-length bitfield can start a new run.
if (Field->getBitWidthValue(Context) != 0 &&
     !IsBetterAsSingleFieldRun(Field)) {
  Run = Field;
  StartBitOffset = getFieldBitOffset(*Field);
  ...
wmi updated this revision to Diff 118181.Oct 8 2017, 10:16 PM
wmi marked 5 inline comments as done.

Address Hal's comments.

hfinkel accepted this revision.Oct 12 2017, 9:35 AM

LGTM

include/clang/Frontend/CodeGenOptions.def
182 ↗(On Diff #118181)

finegrained -> fine-grained

(I suppose we're not sticking to 80 cols in this file anyway)

This revision is now accepted and ready to land.Oct 12 2017, 9:35 AM
This revision was automatically updated to reflect the committed changes.
chill added a subscriber: chill.Aug 21 2019, 8:18 AM

Shouldn't we disable OPT_ffine_grained_bitfield_accesses only if TSAN is active?

Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2019, 8:18 AM
wmi added a comment.Aug 22 2019, 3:13 PM

Shouldn't we disable OPT_ffine_grained_bitfield_accesses only if TSAN is active?

I don't remember why it is disabled for all sanitizer modes. Seems you are right that the disabling the option is only necessary for TSAN. Do you have actual needs for the option to be functioning on other sanitizer modes?

chill added a comment.Aug 23 2019, 2:05 AM
In D36562#1641930, @wmi wrote:

Shouldn't we disable OPT_ffine_grained_bitfield_accesses only if TSAN is active?

I don't remember why it is disabled for all sanitizer modes. Seems you are right that the disabling the option is only necessary for TSAN. Do you have actual needs for the option to be functioning on other sanitizer modes?

Well, yes and no. We have the option enabled by default and it causes a warning when we use it together with -fsanitize=memtag (we aren't really concerned with other sanitizers). That warning broke a few builds (e.g. CMake doing tests and not wanting to see *any* diagnostics. We can work around that in a number of ways, e.g. we can leave the default off for AArch64.

I'd prefer though to have an upstream solution, if that's considered beneficial for all LLVM users and this one seems like such a case: let anyone use the option with sanitizers, unless it's known that some sanitizers'utility is affected negatively (as with TSAN).

wmi added a comment.Aug 25 2019, 10:37 PM
In D36562#1641930, @wmi wrote:

Shouldn't we disable OPT_ffine_grained_bitfield_accesses only if TSAN is active?

I don't remember why it is disabled for all sanitizer modes. Seems you are right that the disabling the option is only necessary for TSAN. Do you have actual needs for the option to be functioning on other sanitizer modes?

Well, yes and no. We have the option enabled by default and it causes a warning when we use it together with -fsanitize=memtag (we aren't really concerned with other sanitizers). That warning broke a few builds (e.g. CMake doing tests and not wanting to see *any* diagnostics. We can work around that in a number of ways, e.g. we can leave the default off for AArch64.

I'd prefer though to have an upstream solution, if that's considered beneficial for all LLVM users and this one seems like such a case: let anyone use the option with sanitizers, unless it's known that some sanitizers'utility is affected negatively (as with TSAN).

Thanks for providing the background in detail. I sent out a patch for it: https://reviews.llvm.org/D66726