The previous strategy was very naive and would always load all of the
bitfield before extracting the desired elements.
This was very pessimizing when using very large bitfield for instance
when accessing HW memory mapping.
Consider for example:
struct Device {
long long a:64; long long b:64; long long c:64;
...
long long z:64;
};
long long f(volatile Device *dev, char field) {
switch(field) { case 'a': return dev->a; case 'b': return dev->b; case 'c': return dev->c;
...
case 'z': return dev->l;
}
clang would generate (with -O2) a switch with a repetition of this sequence:
%125 = load volatile i1664, i1664* %124, align 8 %126 = lshr i1664 %125, 1536 %127 = trunc i1664 %126 to i32
The load being unoptimizable, the assembly generated here involves
a significant amount of code.
After this patch the IR would look like:
%47 = bitcast %struct.Device* %dev to i8* %bf.elt_offset94 = getelementptr i8, i8* %47, i64 192 %48 = bitcast i8* %bf.elt_offset94 to i64* ... = load volatile i64, i64* %48, align 8
On an specific case, a routine used to initialize a particular HW device
using a volatile pointer to a very large bitfield struct (clang would
operate on a bunch of i65536) went down from building for many hours
(I interrupted clang after 12hours) to <2 seconds now. Just because the
IR was getting so large and not optimizable efficiently.