This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
cfe/trunk/
-
trunk/
-
include/clang/Driver/
-
clang/
-
Driver/
-
Options.td
-
lib/CodeGen/
-
CodeGen/
-
CGRecordLayoutBuilder.cpp
-
test/CodeGenCXX/
-
CodeGenCXX/
-
finegrain-bitfield-type.cpp

Differential D39053

[Bitfield] Add more cases to making the bitfield a separate location
ClosedPublic

Authored by spetrovic on Oct 18 2017, 6:57 AM.

Download Raw Diff

Details

Reviewers

hfinkel
wmi
apazos
mgrang

Commits

rG0f274c011131: This patch provides that bitfields are splitted even in case when current field…
rC331979: This patch provides that bitfields are splitted even in case
rL331979: This patch provides that bitfields are splitted even in case

Summary

This patch provides that bitfields are splitted even in case when current field is not legal integer type. For Example,

struct S3 {

unsigned long a1:28;
unsigned long a2:4;
unsigned long a3:12;

};

struct S4 {

unsigned long long b1:61;
unsigned long b2:3;
unsigned long b3:3;

};

At the moment, S3 is i48 type, and S4 is i72 type, with this change S3 is treated as i32 + 16, and S4 is treated as i32 + i32 + i8.
With this patch we avoid unaligned loads and stores, at least on MIPS architecture.

Diff Detail

Repository: rL LLVM

Event Timeline

spetrovic created this revision.Oct 18 2017, 6:57 AM

Herald added subscribers: arichardson, sdardis. · View Herald TranscriptOct 18 2017, 6:57 AM

With this patch we avoid unaligned loads and stores, at least on MIPS architecture.

Can you please explain the nature of the problematic situations?

Also, this patch does not update the comments that describe the splitting algorithm. That should be improved.

If this is just addressing the classic problem that, once we narrow a load we can't re-widen it, it might be best to solve this another way (e.g., by adding metadata noting the dereferenceable range).

Well, basically I'm just expanding the existing algorithm, why should we split fields just in case when current field is integer,
I'm not resolving specific problem with unaligned loads/stores on MIPS.

In this example:

typedef struct {

unsigned int f1 : 28;
unsigned int f2 : 4;
unsigned int f3 : 12;

} S5;

S5 *cmd;

void foo() {
cmd->f3 = 5;
}

With this patch there is improvement in code size not just on MIPS architecture, on X86 and ARM is also improved code size. If structure S5 is treated as i48 type there are extra instructions for reading it not just on MIPS. We can take results for MIPS just for example:

Output without the patch:

0000000000000000 <foo>:

 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	68220000 	ldl	v0,0(at)
18:	6c220007 	ldr	v0,7(at)
1c:	64030005 	daddiu	v1,zero,5
20:	7c62fd07 	dins	v0,v1,0x14,0xc
24:	b0220000 	sdl	v0,0(at)
28:	03e00008 	jr	ra
2c:	b4220007 	sdr	v0,7(at)

Output with the patch:

0000000000000000 <foo>:

 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	94220004 	lhu	v0,4(at)
18:	24030005 	li	v1,5
1c:	7c62f904 	ins	v0,v1,0x4,0x1c
20:	03e00008 	jr	ra
24:	a4220004 	sh	v0,4(at)

This is simple example, in more complicated examples there is more improvement.

In D39053#906513, @spetrovic wrote:
Well, basically I'm just expanding the existing algorithm, why should we split fields just in case when current field is integer,
I'm not resolving specific problem with unaligned loads/stores on MIPS.

In this example:

typedef struct {
unsigned int f1 : 28;
unsigned int f2 : 4;
unsigned int f3 : 12;
} S5;

S5 *cmd;

void foo() {
cmd->f3 = 5;
}

With this patch there is improvement in code size not just on MIPS architecture, on X86 and ARM is also improved code size. If structure S5 is treated as i48 type there are extra instructions for reading it not just on MIPS. We can take results for MIPS just for example:

Output without the patch:

0000000000000000 <foo>:
 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	68220000 	ldl	v0,0(at)
18:	6c220007 	ldr	v0,7(at)
1c:	64030005 	daddiu	v1,zero,5
20:	7c62fd07 	dins	v0,v1,0x14,0xc
24:	b0220000 	sdl	v0,0(at)
28:	03e00008 	jr	ra
2c:	b4220007 	sdr	v0,7(at)
Output with the patch:

0000000000000000 <foo>:
 0:	3c010000 	lui	at,0x0
 4:	0039082d 	daddu	at,at,t9
 8:	64210000 	daddiu	at,at,0
 c:	dc210000 	ld	at,0(at)
10:	dc210000 	ld	at,0(at)
14:	94220004 	lhu	v0,4(at)
18:	24030005 	li	v1,5
1c:	7c62f904 	ins	v0,v1,0x4,0x1c
20:	03e00008 	jr	ra
24:	a4220004 	sh	v0,4(at)
This is simple example, in more complicated examples there is more improvement.

I think this is part of the slippery slope we didn't want to go down. We introduced this mode in the first place only to resolve a store-to-load forwarding problem that is theoretically unsolvable by any local lowering decisions. This, on the other hand, looks like a local code-generation problem that we can/should fix in the backend. If I'm wrong, we should consider this as well.

I was looking if I can reslove this problem in backend. Example:

C code:

typedef struct {
unsigned int f1 : 28;
unsigned int f2 : 4;
unsigned int f3 : 12;
} S5;

void foo(S5 *cmd) {
cmd->f3 = 5;
}

. ll file (without the patch):

%struct.S5 = type { i48 }

define void @foo(%struct.S5* nocapture %cmd) local_unnamed_addr #0 {
entry:

%0 = bitcast %struct.S5* %cmd to i64*
%bf.load = load i64, i64* %0, align 4
%bf.clear = and i64 %bf.load, -4293918721
%bf.set = or i64 %bf.clear, 5242880
store i64 %bf.set, i64* %0, align 4
ret void

}

The load (%bf.load = load i64, i64* %0, align 4) is NON_EXTLOAD, and backend handles it later as unaligned load. So, maybe one of possible solutions in backend is for each architecture to try to catch node pattern and replace unaligned loads with normal loads in specific cases, but I'm not sure how feasible that is. At the moment, the solution in frontend seems better to me. What do you think?

ping

I think it may be hard to fix the problem in backend. It will face the same issue of store-to-load forwarding if at some places the transformation happens but at some other places somehow it doesn't.

But I am not sure whether it is acceptable to add more finegrain bitfield access transformations in frontend. It is a tradeoff. From my experience trying your patch on x8664, I saw saving of some instructions because with the transformation we now use shorter immediate consts which can be folded into other instructions instead of loading them in separate moves. But the bit operations are not saved (I may be wrong), so I have no idea whether it is critical for performance (To provide some background, we introduced finegrain field access because the original problem introduced a serious performance issue in some critical libraries. But doing this kind of bitfield transformations in frontend can potentially harm some other applications. That is why it is off by default, and Hal had concern about adding more such transformations. ). Do you have performance number on some benchmarks to justify its importance? That may help folks to make a better trade off here.

I tried to compile some important libraries for X86 and MIPS64 within Chromium with clang/llvm. I have compared results between LLVM trunk, and LLVM trunk with my patch. There is code size improvement on many libraries, here are some results:

libframe ~1.5%
librendering ~1.2%
liblayout_2 ~0.7%
libpaint_1 ~0.65%
libglslang ~0.5%
libediting_3 ~0.5%
libcore_generated ~0.5%
libanimation_1 ~0.5%

Mips64

libglslang ~3.2%
libframe ~1.5%
liblayout_0 ~1.2%
libGLESv2 ~1%
librendering ~0.7%
liblayout_1 ~0.5%
libcss_9 ~0.5%

Those are just some of libraries where we have code size improvement with this patch, I have also tried to compile LLVM with LLVM (with my patch) for x86 and MIPS64, and there is also code size improvement on both architectures. For example within clang executable for MIPS64 ~350 unaligned loads is removed. So, what do you think about it ?

ping

This sounds as a valid improvement. Can we have this code committed?

Thanks for the size evaluation. I regarded the change as a natural and limited extension to the current fine-grain bitfield access mode, so I feel ok with the change. Hal, what do you think?

Is everyone OK with the patch now?

ping

Added @apazos and myself a reviewers since this is something we would be interested in getting to work.

With this patch we get ~1800 bytes improvement on one of our internal codebases. I also ran spec2000/spec2006 validations (for RISCV) and there were no regressions. There were also no code size improvements seen in spec.

asb added a subscriber: asb.Apr 19 2018, 4:38 AM

Hi @spetrovic - I think Hal Finkel's earlier request to update the comments of IsBetterAsSingleFieldRun remains unaddressed. It looks like at least the comment string immediately before auto IsBetterAsSingleFieldRun needs to be updated to reflect the changed behaviour.

The documentation -ffine-grained-bitfield-accesses option may also need to be updated. Iit currently reads "Use separate accesses for bitfields with legal widths and alignments.". I think "Use separate accesses for consecutive bitfield runs with legal widths and alignments" might better reflect the behaviour after this patch. Do you agree?

Here is a test case which improves with this patch (for RISCV target). It is able to detect load/store halfword for size 16 bitfields.

struct st {
  int a:1;
  int b:8;
  int c:11;
  int d:12;
  int e:16;
  int f:16;
  int g:16;
} S;

void foo(int x) {
  S.e = x;
}

GCC:

00000000 <foo>:
   0:	000007b7          	lui	x15,0x0
   4:	00a79223          	sh	x10,4(x15) # 4 <foo+0x4>
   8:	00008067          	jalr	x0,0(x1)

LLVM without this patch:

00000000 <foo>:
   0:	000105b7          	lui	x11,0x10
   4:	fff58593          	addi	x11,x11,-1 # ffff <foo+0xffff>
   8:	00b57533          	and	x10,x10,x11
   c:	000005b7          	lui	x11,0x0
  10:	00058593          	addi	x11,x11,0 # 0 <foo>
  14:	0045a603          	lw	x12,4(x11)
  18:	ffff06b7          	lui	x13,0xffff0
  1c:	00d67633          	and	x12,x12,x13
  20:	00a66533          	or	x10,x12,x10
  24:	00a5a223          	sw	x10,4(x11)
  28:	00008067          	jalr	x0,0(x1)

LLVM with this patch:

00000000 <foo>:
   0:	000005b7          	lui	x11,0x0
   4:	00058593          	addi	x11,x11,0 # 0 <foo>
   8:	00a59223          	sh	x10,4(x11)
   c:	00008067          	jalr	x0,0(x1)

Comments addressed

kito-cheng added a subscriber: kito-cheng.Apr 25 2018, 8:18 PM

Thanks for updating the patch, @spetrovic.
Can we have this committed?
This patch has shown to produce code size improvements for a number of targets (Mips, X86, ARM, RISC-V).

Just wanted to explicitly say that I'm happy the updated patch reflects the changes to docs and comments I requested. @hfinkel - are you happy for this to land now?

In D39053#1093977, @asb wrote:

Just wanted to explicitly say that I'm happy the updated patch reflects the changes to docs and comments I requested. @hfinkel - are you happy for this to land now?

Yes, go ahead.

Some of these benefits might still be capturable using the regular method and backend improvements, but with this under the option it should be easier to identify where those opportunities are.

This revision was not accepted when it landed; it landed in state Needs Review.May 10 2018, 5:35 AM

Closed by commit rL331979: This patch provides that bitfields are splitted even in case (authored by spetrovic). · Explain Why

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptMay 10 2018, 5:35 AM

Revision Contents

Path

Size

cfe/

trunk/

include/

clang/

Driver/

Options.td

2 lines

lib/

CodeGen/

CGRecordLayoutBuilder.cpp

22 lines

test/

CodeGenCXX/

finegrain-bitfield-type.cpp

22 lines

Diff 146117

cfe/trunk/include/clang/Driver/Options.td

	Show First 20 Lines • Show All 1,150 Lines • ▼ Show 20 Lines

	def fxray_instrumentation_bundle :			def fxray_instrumentation_bundle :
	JoinedOrSeparate<["-"], "fxray-instrumentation-bundle=">,			JoinedOrSeparate<["-"], "fxray-instrumentation-bundle=">,
	Group<f_Group>, Flags<[CC1Option]>,			Group<f_Group>, Flags<[CC1Option]>,
	HelpText<"Select which XRay instrumentation points to emit. Options: all, none, function, custom. Default is 'all'.">;			HelpText<"Select which XRay instrumentation points to emit. Options: all, none, function, custom. Default is 'all'.">;

	def ffine_grained_bitfield_accesses : Flag<["-"],			def ffine_grained_bitfield_accesses : Flag<["-"],
	"ffine-grained-bitfield-accesses">, Group<f_clang_Group>, Flags<[CC1Option]>,			"ffine-grained-bitfield-accesses">, Group<f_clang_Group>, Flags<[CC1Option]>,
	HelpText<"Use separate accesses for bitfields with legal widths and alignments.">;			HelpText<"Use separate accesses for consecutive bitfield runs with legal widths and alignments.">;
	def fno_fine_grained_bitfield_accesses : Flag<["-"],			def fno_fine_grained_bitfield_accesses : Flag<["-"],
	"fno-fine-grained-bitfield-accesses">, Group<f_clang_Group>, Flags<[CC1Option]>,			"fno-fine-grained-bitfield-accesses">, Group<f_clang_Group>, Flags<[CC1Option]>,
	HelpText<"Use large-integer access for consecutive bitfield runs.">;			HelpText<"Use large-integer access for consecutive bitfield runs.">;

	def flat__namespace : Flag<["-"], "flat_namespace">;			def flat__namespace : Flag<["-"], "flat_namespace">;
	def flax_vector_conversions : Flag<["-"], "flax-vector-conversions">, Group<f_Group>;			def flax_vector_conversions : Flag<["-"], "flax-vector-conversions">, Group<f_Group>;
	def flimited_precision_EQ : Joined<["-"], "flimited-precision=">, Group<f_Group>;			def flimited_precision_EQ : Joined<["-"], "flimited-precision=">, Group<f_Group>;
	def flto_EQ : Joined<["-"], "flto=">, Flags<[CoreOption, CC1Option]>, Group<f_Group>,			def flto_EQ : Joined<["-"], "flto=">, Flags<[CoreOption, CC1Option]>, Group<f_Group>,
	▲ Show 20 Lines • Show All 1,778 Lines • Show Last 20 Lines

cfe/trunk/lib/CodeGen/CGRecordLayoutBuilder.cpp

Show First 20 Lines • Show All 398 Lines • ▼ Show 20 Lines	for (; Field != FieldEnd; ++Field) {
// Bitfields get the offset of their storage but come afterward and remain		// Bitfields get the offset of their storage but come afterward and remain
// there after a stable sort.		// there after a stable sort.
Members.push_back(MemberInfo(bitsToCharUnits(StartBitOffset),		Members.push_back(MemberInfo(bitsToCharUnits(StartBitOffset),
MemberInfo::Field, nullptr, *Field));		MemberInfo::Field, nullptr, *Field));
}		}
return;		return;
}		}

// Check if current Field is better as a single field run. When current field		// Check if OffsetInRecord is better as a single field run. When OffsetInRecord
// has legal integer width, and its bitfield offset is naturally aligned, it		// has legal integer width, and its bitfield offset is naturally aligned, it
// is better to make the bitfield a separate storage component so as it can be		// is better to make the bitfield a separate storage component so as it can be
// accessed directly with lower cost.		// accessed directly with lower cost.
auto IsBetterAsSingleFieldRun = [&](RecordDecl::field_iterator Field) {		auto IsBetterAsSingleFieldRun = [&](uint64_t OffsetInRecord,
		uint64_t StartBitOffset) {
if (!Types.getCodeGenOpts().FineGrainedBitfieldAccesses)		if (!Types.getCodeGenOpts().FineGrainedBitfieldAccesses)
return false;		return false;
unsigned Width = Field->getBitWidthValue(Context);		if (!DataLayout.isLegalInteger(OffsetInRecord))
if (!DataLayout.isLegalInteger(Width))
return false;		return false;
// Make sure Field is natually aligned if it is treated as an IType integer.		// Make sure StartBitOffset is natually aligned if it is treated as an
if (getFieldBitOffset(*Field) %		// IType integer.
Context.toBits(getAlignment(getIntNType(Width))) !=		if (StartBitOffset %
		Context.toBits(getAlignment(getIntNType(OffsetInRecord))) !=
0)		0)
return false;		return false;
return true;		return true;
};		};

// The start field is better as a single field run.		// The start field is better as a single field run.
bool StartFieldAsSingleRun = false;		bool StartFieldAsSingleRun = false;
for (;;) {		for (;;) {
// Check to see if we need to start a new run.		// Check to see if we need to start a new run.
if (Run == FieldEnd) {		if (Run == FieldEnd) {
// If we're out of fields, return.		// If we're out of fields, return.
if (Field == FieldEnd)		if (Field == FieldEnd)
break;		break;
// Any non-zero-length bitfield can start a new run.		// Any non-zero-length bitfield can start a new run.
if (!Field->isZeroLengthBitField(Context)) {		if (!Field->isZeroLengthBitField(Context)) {
Run = Field;		Run = Field;
StartBitOffset = getFieldBitOffset(*Field);		StartBitOffset = getFieldBitOffset(*Field);
Tail = StartBitOffset + Field->getBitWidthValue(Context);		Tail = StartBitOffset + Field->getBitWidthValue(Context);
StartFieldAsSingleRun = IsBetterAsSingleFieldRun(Run);		StartFieldAsSingleRun = IsBetterAsSingleFieldRun(Tail - StartBitOffset,
		StartBitOffset);
}		}
++Field;		++Field;
continue;		continue;
}		}

// If the start field of a new run is better as a single run, or		// If the start field of a new run is better as a single run, or
// if current field is better as a single run, or		// if current field (or consecutive fields) is better as a single run, or
// if current field has zero width bitfield and either		// if current field has zero width bitfield and either
// UseZeroLengthBitfieldAlignment or UseBitFieldTypeAlignment is set to		// UseZeroLengthBitfieldAlignment or UseBitFieldTypeAlignment is set to
// true, or		// true, or
// if the offset of current field is inconsistent with the offset of		// if the offset of current field is inconsistent with the offset of
// previous field plus its offset,		// previous field plus its offset,
// skip the block below and go ahead to emit the storage.		// skip the block below and go ahead to emit the storage.
// Otherwise, try to add bitfields to the run.		// Otherwise, try to add bitfields to the run.
if (!StartFieldAsSingleRun && Field != FieldEnd &&		if (!StartFieldAsSingleRun && Field != FieldEnd &&
!IsBetterAsSingleFieldRun(Field) &&		!IsBetterAsSingleFieldRun(Tail - StartBitOffset, StartBitOffset) &&
(!Field->isZeroLengthBitField(Context) \|\|		(!Field->isZeroLengthBitField(Context) \|\|
(!Context.getTargetInfo().useZeroLengthBitfieldAlignment() &&		(!Context.getTargetInfo().useZeroLengthBitfieldAlignment() &&
!Context.getTargetInfo().useBitFieldTypeAlignment())) &&		!Context.getTargetInfo().useBitFieldTypeAlignment())) &&
Tail == getFieldBitOffset(*Field)) {		Tail == getFieldBitOffset(*Field)) {
Tail += Field->getBitWidthValue(Context);		Tail += Field->getBitWidthValue(Context);
++Field;		++Field;
continue;		continue;
}		}
▲ Show 20 Lines • Show All 434 Lines • Show Last 20 Lines

cfe/trunk/test/CodeGenCXX/finegrain-bitfield-type.cpp

				// RUN: %clang_cc1 -triple x86_64-linux-gnu -ffine-grained-bitfield-accesses \
				// RUN: -emit-llvm -o - %s \| FileCheck %s
				struct S4 {
				unsigned long f1:28;
				unsigned long f2:4;
				unsigned long f3:12;
				};
				struct S4 a4;

				struct S5 {
				unsigned long f1:28;
				unsigned long f2:4;
				unsigned long f3:28;
				unsigned long f4:4;
				unsigned long f5:12;
				};
				struct S5 a5;

				// CHECK: %struct.S4 = type { i32, i16 }
				// CHECK-NOT: %struct.S4 = type { i48 }
				// CHECK: %struct.S5 = type { i32, i32, i16, [6 x i8] }
				// CHECK-NOT: %struct.S5 = type { i80 }
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[Bitfield] Add more cases to making the bitfield a separate locationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 146117

cfe/trunk/include/clang/Driver/Options.td

cfe/trunk/lib/CodeGen/CGRecordLayoutBuilder.cpp

cfe/trunk/test/CodeGenCXX/finegrain-bitfield-type.cpp

[Bitfield] Add more cases to making the bitfield a separate location
ClosedPublic