This is an archive of the discontinued LLVM Phabricator instance.

[DFAJumpThreading] Enable DFAJT with LTO
AcceptedPublic

Authored by martong on Jan 27 2023, 7:43 AM.

Details

Summary

This patch adds the DFAJumpThreading pass into the LTO default pipeline
after the peephole optimization passes. This increases the Coremark
score when LTO is used. I've measured an increase of the score with
roughly 11% on Linux-x86_64 (and we've experienced a similar growth in
case of different RISCV architectures at Codasip).

Here is the primitive Makefile I used with the latest Coremark:

SOURCES=core_list_join.c core_main.c core_matrix.c core_state.c core_util.c posix/core_portme.c
DEFINITIONS=-DHAS_PRINTF=1 -DHAS_STDIO=1 -DTIMES=1 -DITERATIONS=1

CC=/home/gabor.marton/WORK/llvm2/build/release/bin/clang
OBJS=$(SOURCES:%.c=build_x86_64/%.o)

CFLAGS=-I./posix -I. -DCOMPILER_FLAGS=\"N/A\" \
-O3\
-flto\

LDFLAGS=-O3\
-flto\
-Wl,--plugin-opt=-enable-dfa-jump-thread\

build_x86_64/coremark.xexe: $(OBJS)
	$(CC) -o $@ $(OBJS) \
	 $(LDFLAGS)

build_x86_64/%.o: %.c
	mkdir -p build_x86_64
	mkdir -p build_x86_64/posix
	$(CC) -c -o $@ $< $(CFLAGS) $(DEFINITIONS)

Co-author: Martin Ministr (Codasip.com)
Co-author: Pavel Snobl (Codasip.com)

Diff Detail

Event Timeline

martong created this revision.Jan 27 2023, 7:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2023, 7:43 AM
martong requested review of this revision.Jan 27 2023, 7:43 AM

Thanks for the patch.

Question: why LTO? I'm no longer with the company where I worked on this and my recollections are starting to get fuzzy but AFAIR this optimization doesn't need the whole program. If so, you'll want to place this opt into the regular pipeline, where compilation of different translation units is parallelized.

xbolva00 accepted this revision.Jan 27 2023, 8:24 AM

LTO is not needed, but this patch adds this pass to the list of LTO passes - reasonable (maybe you should update some tests?)

This revision is now accepted and ready to land.Jan 27 2023, 8:24 AM

Thanks for the patch.

Question: why LTO? I'm no longer with the company where I worked on this and my recollections are starting to get fuzzy but AFAIR this optimization doesn't need the whole program. If so, you'll want to place this opt into the regular pipeline, where compilation of different translation units is parallelized.

Good question. We experienced that some of our users turn on LTO as the first step and then they experiment (and measure) by adding new extra flags via the linker. I.e. they want whole program optimization from the beginning. For instance, first iteration:

CFLAGS=...
LDFLAGS=-O3 -flto

Second iteration:

CFLAGS=...
LDFLAGS=-O3 -flto -Wl,--plugin-opt=-inline-threshold=10000

Third iteration:

CFLAGS=...
LDFLAGS=-O3 -flto -Wl,--plugin-opt=-inline-threshold=10000 -Wl,--plugin-opt=-enable-dfa-jump-thread
nikic added a subscriber: nikic.Jan 27 2023, 8:39 AM

@martong Would the optimization also work if -Xclang -enable-dfa-jump-thread were specified in CFLAGS instead? If "yes", then I'd say this is firmly in the realm of user error and we should not make this change.

martong added a comment.EditedJan 27 2023, 9:08 AM

@martong Would the optimization also work if -Xclang -enable-dfa-jump-thread were specified in CFLAGS instead? If "yes", then I'd say this is firmly in the realm of user error and we should not make this change.

Yes, that would work with -mllvm -enable-dfa-jump-thread. And the Coremark score is roughly the same (within measurement error). I agree that it would be better if the user added the DFAJT optimization to CFLAGS rather than to the LDFLAGS, but do we really expect the users to have such a subtle knowledge about the opt passes? In particular, it might be surprising that this opt does not have any effect when it is used with LTO at the moment.

@martong Would the optimization also work if -Xclang -enable-dfa-jump-thread were specified in CFLAGS instead? If "yes", then I'd say this is firmly in the realm of user error and we should not make this change.

Yes, that would work with -mllvm -enable-dfa-jump-thread. And the Coremark score is roughly the same (within measurement error). I agree that it would be better if the user added the DFAJT optimization to CFLAGS rather than to the LDFLAGS, but do we really expect the users to have such a subtle knowledge about the opt passes? In particular, it might be surprising that this opt does not have any effect when it is used with LTO at the moment.

This is an internal LLVM option, it is not intended for end-user consumption. If somebody does want to use it, they had better know what they are doing.

The DFA jump threading pass is intended to be enabled by default in the future, at which point no option will be necessary, and this will just leave behind an (to our knowledge) unnecessary pass run in the LTO pipeline.

@martong Would the optimization also work if -Xclang -enable-dfa-jump-thread were specified in CFLAGS instead? If "yes", then I'd say this is firmly in the realm of user error and we should not make this change.

Yes, that would work with -mllvm -enable-dfa-jump-thread. And the Coremark score is roughly the same (within measurement error). I agree that it would be better if the user added the DFAJT optimization to CFLAGS rather than to the LDFLAGS, but do we really expect the users to have such a subtle knowledge about the opt passes? In particular, it might be surprising that this opt does not have any effect when it is used with LTO at the moment.

This is an internal LLVM option, it is not intended for end-user consumption. If somebody does want to use it, they had better know what they are doing.

The DFA jump threading pass is intended to be enabled by default in the future, at which point no option will be necessary, and this will just leave behind an (to our knowledge) unnecessary pass run in the LTO pipeline.

Okay, that makes sense to me. There is, however, something that still bothers me. There is the old JumpThreadingPass which is added both to the function simplification pipeline and to the LTO pipeline. So, adding DFAJumpThreading to the LTO pipeline would make it consistent with the way how the old JumpThreadingPass is added to these pipelines, isn't it?

xbolva00 added inline comments.Jan 27 2023, 11:27 AM
llvm/lib/Passes/PassBuilderPipelines.cpp
1736

And CE is here as well…

Hey guys, thank you for the review so far. But I am not sure if we have a full consensus yet.
@nikic Do you think that the arguments above are reasonable for this change?

nikic added a reviewer: fhahn.Feb 1 2023, 1:36 AM
nikic added a subscriber: fhahn.

Okay, that makes sense to me. There is, however, something that still bothers me. There is the old JumpThreadingPass which is added both to the function simplification pipeline and to the LTO pipeline. So, adding DFAJumpThreading to the LTO pipeline would make it consistent with the way how the old JumpThreadingPass is added to these pipelines, isn't it?

These passes aren't really comparable, even if they both have jump threading in their name. JumpThreading is a basic control-flow optimization pass. It's going to be able to do some useful work pretty much anytime additional inlining opportunities arise. DFAJumpThreading is a pass targeting a very specific pattern -- for LTO to be useful, part of the DFA switch structure must only become visible as a result of LTO. Not impossible, but probably also not particularly common?

That said, I do find the argument compelling in abstract. However, it's not really clear to me how we decide which passes get rerun during LTO and which don't. Maybe @fhahn has some insight there.

If we run the pass during LTO, I'm not sure the proposed position is the right one (the same applies to the position of the ConstraintElimination pass next to it). The FPM this is being added to is commented as "Run a few AA driver optimizations here and now to cleanup the code". It's an FPM that gets run before FuncAttrs and GlobalsAA. There is a MainFPM after it, where most of the passes are scheduled.

Generally, after looking at it just now, I find the way the fat LTO pipeline is structured extremely confusing. Normally we structure the pipeline into a simplification pipeline that gets run as part of a CGSCC pass manager with the inliner, and an optimization pipeline that runs afterwards. However, fat LTO appears to schedule the inliner without any other passes.

Ping.
@fhahn Could you please take a look, if you have some time? Would be great to hear your thoughts.