This is an archive of the discontinued LLVM Phabricator instance.

Optimize IRGen for load/stores of bitfields
Needs RevisionPublic

Authored by mehdi_amini on Nov 25 2018, 10:33 PM.

Details

Reviewers
chandlerc
rsmith
Summary

The previous strategy was very naive and would always load all of the
bitfield before extracting the desired elements.
This was very pessimizing when using very large bitfield for instance
when accessing HW memory mapping.
Consider for example:

struct Device {

long long a:64;
long long b:64;
long long c:64;

...

long long z:64;

};
long long f(volatile Device *dev, char field) {

switch(field) {
  case 'a': return dev->a;
  case 'b': return dev->b;
  case 'c': return dev->c;

...

case 'z': return dev->l;

}

clang would generate (with -O2) a switch with a repetition of this sequence:

%125 = load volatile i1664, i1664* %124, align 8
%126 = lshr i1664 %125, 1536
%127 = trunc i1664 %126 to i32

The load being unoptimizable, the assembly generated here involves
a significant amount of code.

After this patch the IR would look like:

%47 = bitcast %struct.Device* %dev to i8*
%bf.elt_offset94 = getelementptr i8, i8* %47, i64 192
%48 = bitcast i8* %bf.elt_offset94 to i64*
... = load volatile i64, i64* %48, align 8

On an specific case, a routine used to initialize a particular HW device
using a volatile pointer to a very large bitfield struct (clang would
operate on a bunch of i65536) went down from building for many hours
(I interrupted clang after 12hours) to <2 seconds now. Just because the
IR was getting so large and not optimizable efficiently.

Diff Detail

Event Timeline

mehdi_amini created this revision.Nov 25 2018, 10:33 PM
mehdi_amini edited the summary of this revision. (Show Details)Nov 25 2018, 10:39 PM

At least for TSan, this is undesirable -- we want a chance to observe these as potential data races.

I'm happy narrowing to some sane size relatively early, but not sure it should be in Clang?

Mehdi pointed out that his primary concern is *volatile* access.

I think having volatile access to bitfields differ from "normal" access is actually very reasonable. I wouldn't change non-volatile access, this was already the subject of a prolonged discussion.

However, Clang needs to specifically document what its mapping of volatile semantics onto bitfields is going to be. And we should check closely with GCC to understand the degree to which they differ, if at all, and whether that is reasonable.

Without looking at GCC, I see two semantics that seem plausible:

  1. We try to emit loads sized exactly to match the bitfield *member* itself. This is, of course, awkward to impossible if the bitfield is not aligned at a byte boundary.
  2. We try to emit loads as volatile memcpy of the underlying bytes where the bitfield member sits.

But honestly, both are a bit surprising.

I wonder if we should just warn on volatile access to bitfields generally. It seems hard to give useful memory-mapped I/O semantics to these in full generality and with good portability.

We should at least warn (*aggressively* IMO) on volatile bitfields with non-multiple-of-8 start/end. Maybe even non-power-of-two start/end.

chandlerc requested changes to this revision.Dec 4 2018, 1:55 AM
chandlerc added a reviewer: rsmith.

Marking as waiting update from Mehdi and adding Richard as a reviewer since this will need to deal with what our defined volatile semantics are for bitfields which seems like the kind of thing he would want to be very aware of...

This revision now requires changes to proceed.Dec 4 2018, 1:55 AM