This is an archive of the discontinued LLVM Phabricator instance.

DAG: Allow targets to override stack temp alignment
Needs ReviewPublic

Authored by arsenm on Jan 19 2017, 2:29 PM.
This revision needs review, but all specified reviewers are disabled or inactive.

Details

Reviewers
tstellarAMD
Summary

AMDGPU doesn't want/need stack alignment > 4 ever. The
DataLayout's preferred alignment cannot be lower than the ABI
required alignment, which defaults to the type size. For
stack temporaries the ABI alignment constraints do not matter,
so request align 4 in all cases to save space.

This will mitigate regressing the stack usage of every program
in a future commit.

Diff Detail

Event Timeline

arsenm created this revision.Jan 19 2017, 2:29 PM

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

I thought about that, but it seems like a risky, IR breaking change. I assume most places using this are using this to get a bigger alignment than required and assuming it is a legal ABI alignment. In codegen there is more freedom to decide what the ABI really means. I can try changing this and see what breaks.

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

I thought about that, but it seems like a risky, IR breaking change. I assume most places using this are using this to get a bigger alignment than required and assuming it is a legal ABI alignment. In codegen there is more freedom to decide what the ABI really means. I can try changing this and see what breaks.

Another issue is this would also vary per address space to some degree. For AMDGPU we would like 8 byte alignment, but only in addrspace(3)

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

I thought about that, but it seems like a risky, IR breaking change. I assume most places using this are using this to get a bigger alignment than required and assuming it is a legal ABI alignment. In codegen there is more freedom to decide what the ABI really means. I can try changing this and see what breaks.

Yea; I figured that we'd end up replacing many uses of preferred alignment with some min-of-preferred-or-abi-alignment utility function if we did that.

Another issue is this would also vary per address space to some degree. For AMDGPU we would like 8 byte alignment, but only in addrspace(3)

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

I'm not sure it's a frontend issue? The LangRef doesn't really define what the ABI alignment of a type really means. It seems of the few uses in clang of getPrefTypeAlignment (and all the ones I've looked at so far in llvm) are all used to create allocas and loads/stores from them.

The DataLayout has getPreferredAlignment(const Global Variable *GV), which appear to return the preferred type alignment only if it has no explicit alignment set. This is only used by a handful of places.

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

Does your backend maintain multiple stacks (e.g. some kind of "global stack" and a "local stack")? Does this depend on some non-recursion analysis? [I'm really curious what you're thinking here - not only does it affect this patch, but it also potentially affects discussions I've been having about OpenMP accelerator semantics].

I'm not sure it's a frontend issue? The LangRef doesn't really define what the ABI alignment of a type really means. It seems of the few uses in clang of getPrefTypeAlignment (and all the ones I've looked at so far in llvm) are all used to create allocas and loads/stores from them.

The DataLayout has getPreferredAlignment(const Global Variable *GV), which appear to return the preferred type alignment only if it has no explicit alignment set. This is only used by a handful of places.

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

Does your backend maintain multiple stacks (e.g. some kind of "global stack" and a "local stack")? Does this depend on some non-recursion analysis? [I'm really curious what you're thinking here - not only does it affect this patch, but it also potentially affects discussions I've been having about OpenMP accelerator semantics].

Not really. Local memory doesn't behave like a stack and is just a block of memory allocated for the workgroup for the entire program. Accessing private memory is much slower, so in some cases for spills and small stack objects we could potentially optimize by writing them there instead. We don't support calls currently, and OpenCL explicitly forbids recursion so I haven't spent much time worrying about that.

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

Does your backend maintain multiple stacks (e.g. some kind of "global stack" and a "local stack")? Does this depend on some non-recursion analysis? [I'm really curious what you're thinking here - not only does it affect this patch, but it also potentially affects discussions I've been having about OpenMP accelerator semantics].

Not really. Local memory doesn't behave like a stack and is just a block of memory allocated for the workgroup for the entire program. Accessing private memory is much slower, so in some cases for spills and small stack objects we could potentially optimize by writing them there instead. We don't support calls currently, and OpenCL explicitly forbids recursion so I haven't spent much time worrying about that.

Makes sense ;)

To come back to your original question, I'd think that you'd want to extend DataLayout to support per-address-space alignments in that case, just like we have per-address-space pointer sizes. You'd want this at the IR-level too.

Makes sense ;)

To come back to your original question, I'd think that you'd want to extend DataLayout to support per-address-space alignments in that case, just like we have per-address-space pointer sizes. You'd want this at the IR-level too.

I've written the patches to allow a lower preferred alignment which wasn't difficult. I think the per-address space alignment is overkill particularly for the amount of benefit. Globals/allocass are assigned the correct ABI alignment, and it would always possible to write an optimization pass to reduce the alignment when valid to do so. The one advantage I think the hook here has is that it allows handling of perverse cases that won't be explicitly listed in the datalayout (e.g. a 513 element vector).