This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Fix memory explosion when converting global variable bodies in ModuleTranslation
ClosedPublic

Authored by tungld on Apr 16 2023, 7:32 PM.

Details

Summary

There is memory explosion when converting the body or initializer region of a large global variable, e.g. a constant array.

For example, when translating a constant array of 100000 strings:

llvm.mlir.global internal constant @cats_strings() {addr_space = 0 : i32, alignment = 16 : i64} : !llvm.array<100000 x ptr<i8>> {
    %0 = llvm.mlir.undef : !llvm.array<100000 x ptr<i8>>
    %1 = llvm.mlir.addressof @om_1 : !llvm.ptr<array<1 x i8>>
    %2 = llvm.getelementptr %1[0, 0] : (!llvm.ptr<array<1 x i8>>) -> !llvm.ptr<i8>
    %3 = llvm.insertvalue %2, %0[0] : !llvm.array<100000 x ptr<i8>>
    %4 = llvm.mlir.addressof @om_2 : !llvm.ptr<array<1 x i8>>
    %5 = llvm.getelementptr %4[0, 0] : (!llvm.ptr<array<1 x i8>>) -> !llvm.ptr<i8>
    %6 = llvm.insertvalue %5, %3[1] : !llvm.array<100000 x ptr<i8>>
    %7 = llvm.mlir.addressof @om_3 : !llvm.ptr<array<1 x i8>>
    %8 = llvm.getelementptr %7[0, 0] : (!llvm.ptr<array<1 x i8>>) -> !llvm.ptr<i8>
    %9 = llvm.insertvalue %8, %6[2] : !llvm.array<100000 x ptr<i8>>
    %10 = llvm.mlir.addressof @om_4 : !llvm.ptr<array<1 x i8>>
    %11 = llvm.getelementptr %10[0, 0] : (!llvm.ptr<array<1 x i8>>) -> !llvm.ptr<i8>
    %12 = llvm.insertvalue %11, %9[3] : !llvm.array<100000 x ptr<i8>>

    ... (ignore the remaining part)
}

where @om_1, @om_2, ... are string global constants.

Each time an operation is converted to LLVM, a new constant is created.
When it comes to llvm.insertvalue, a new constant array of 100000 elements is created and the old constant array (input) is not destroyed.
This causes memory explosion. We observed that, on a system with 128 GB memory, the translation of 100000 elements got killed due to using up all the memory.
On a system with 64 GB, 65536 elements was enough to cause the translation killed.

This patch fixes the issue by checking generated constants and destroyed them if there is no use.
By the fix, the translation of 100000 elements only takes about 1.6 GB memory, and finishes without any error.

Diff Detail

Event Timeline

tungld created this revision.Apr 16 2023, 7:32 PM
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
tungld requested review of this revision.Apr 16 2023, 7:32 PM
tungld edited the summary of this revision. (Show Details)Apr 16 2023, 7:35 PM

Flang breakages may be related, please take a look.

mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
751

Is this limited to array constants, or can we also do the same for other large constants?

tungld updated this revision to Diff 514836.Apr 18 2023, 11:47 PM

Try to handle other types of constant rather than array.

tungld added a comment.EditedApr 19 2023, 2:58 AM

Flang breakages may be related, please take a look.

@ftynse do you know how the code here could have affect on flang? I couldn't find any code dependency. Or, perhaps, check-flang and check-mlir run in parallel and they share the same ConstantContext, but I am not sure.

Flang uses translation from the LLVM dialect to LLVM IR to emit code, so it hits this code path. Here's an excerpt from the stack trace:

0  flang-new       0x00005608fa5e6ee7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39
1  flang-new       0x00005608fa5e4c0e llvm::sys::RunSignalHandlers() + 238
2  flang-new       0x00005608fa5e76cf
3  libpthread.so.0 0x00007f1f021d7140
4  libc.so.6       0x00007f1f01c9ece1 gsignal + 321
5  libc.so.6       0x00007f1f01c88537 abort + 291
6  flang-new       0x00005608fa5618df
7  flang-new       0x00005608fe1b3ccd llvm::Constant::destroyConstant() + 1325
8  flang-new       0x00005608fcc454a1 mlir::LLVM::ModuleTranslation::convertGlobals() + 2865
9  flang-new       0x00005608fcc4aefe mlir::translateModuleToLLVMIR(mlir::Operation*, llvm::LLVMContext&, llvm::StringRef) + 2174
10 flang-new       0x00005608fa614a88 Fortran::frontend::CodeGenAction::generateLLVMIR() + 1192
11 flang-new       0x00005608fa61771a Fortran::frontend::CodeGenAction::executeAction() + 698
12 flang-new       0x00005608fa60b4bc Fortran::frontend::FrontendAction::execute() + 12
13 flang-new       0x00005608fa5fc802 Fortran::frontend::CompilerInstance::executeAction(Fortran::frontend::FrontendAction&) + 242
14 flang-new       0x00005608fa60ea1b Fortran::frontend::executeCompilerInvocation(Fortran::frontend::CompilerInstance*) + 2603
15 flang-new       0x00005608f9208502 fc1_main(llvm::ArrayRef<char const*>, char const*) + 594
16 flang-new       0x00005608f92054b9 main + 1081
17 libc.so.6       0x00007f1f01c89d0a __libc_start_main + 234
18 flang-new       0x00005608f9204eea _start + 42

that shows it crashes from inside ModuleTranslation::convertGlobals. Flang may be translating multiple modules in parallel, in which case the context is shared between modules and the code here hits a race condition. Or something similar, try compiling it with ASAN and/or TSAN.

We can also ask @kiranchandramohan about Flang.

We may end up needing a different approach for constants in the LLVM dialect, instead of iteratively inserting them create some sort of special attribute that would be converted at once without polluting memory. Or an "interpreter" of this global regions that creates such a form on-the-fly, without creating and deleting constants. The original motivation for having a global region was that we didn't want to duplicate IR constructs for operations and for constants like LLVM does.

mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
759

The crash was observed for this line. Is it possible that the constant is being destroyed multiple times?

The smallest example I could find from the failing tests was the following. If we run the flang testing tool tco it crashes. mlir-translate seem to pass with this. I don't clearly know what is the difference.

module attributes {fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.target_triple = "aarch64-unknown-linux-gnu"} {
  llvm.mlir.global internal @globalx() {addr_space = 0 : i32} : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)> {
    %0 = llvm.mlir.constant(0 : index) : i64
    %1 = llvm.inttoptr %0 : i64 to !llvm.ptr<i32>
    %2 = llvm.mlir.null : !llvm.ptr<i32>
    %3 = llvm.getelementptr %2[1] : (!llvm.ptr<i32>) -> !llvm.ptr<i32>
    %4 = llvm.ptrtoint %3 : !llvm.ptr<i32> to i64
    %5 = llvm.mlir.constant(9 : i32) : i32
    %6 = llvm.mlir.undef : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %7 = llvm.insertvalue %4, %6[1] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %8 = llvm.mlir.constant(20180515 : i32) : i32
    %9 = llvm.insertvalue %8, %7[2] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %10 = llvm.mlir.constant(0 : i32) : i32
    %11 = llvm.trunc %10 : i32 to i8
    %12 = llvm.insertvalue %11, %9[3] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %13 = llvm.trunc %5 : i32 to i8
    %14 = llvm.insertvalue %13, %12[4] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %15 = llvm.mlir.constant(2 : i32) : i32
    %16 = llvm.trunc %15 : i32 to i8
    %17 = llvm.insertvalue %16, %14[5] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %18 = llvm.mlir.constant(0 : i32) : i32
    %19 = llvm.trunc %18 : i32 to i8
    %20 = llvm.insertvalue %19, %17[6] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    %21 = llvm.bitcast %1 : !llvm.ptr<i32> to !llvm.ptr<i32>
    %22 = llvm.insertvalue %21, %20[0] : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
    llvm.return %22 : !llvm.struct<(ptr<i32>, i64, i32, i8, i8, i8, i8)>
  } 
  llvm.metadata @__flang_tbaa {
    llvm.tbaa_root @root_0 {id = "Flang Type TBAA Root"}
    llvm.tbaa_type_desc @type_desc_1 {id = "any access", members = {<@root_0, 0>}}
    llvm.tbaa_type_desc @type_desc_2 {id = "any data access", members = {<@type_desc_1, 0>}}
    llvm.tbaa_type_desc @type_desc_3 {id = "descriptor member", members = {<@type_desc_1, 0>}}
  } 
}
tungld added inline comments.Apr 19 2023, 7:01 AM
mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
759

The crash was observed for this line. Is it possible that the constant is being destroyed multiple times?

If hasZeroLiveUses is correct, the answer would be No.

Thank you for the small test! I will find a x86 machine to test your example with tco soon. I am using an s390x machine that is not supported by flang.

mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
759

Sorry, this should fail with mlir-translate as well. When I tried, i did not build mlir tools since I always use the check-flang target.

So you don't have to build flang. You can just work with mlir-translate.

tungld updated this revision to Diff 515255.Apr 20 2023, 1:54 AM

Fix issues with flang example

I found that the example @kiranchandramohan gave failed because of trying to destroy the results of getelementptr and ptrtoint. They are pointers, so we should not destroy them.
The current code change worked for @kiranchandramohan's example. Waiting to see if check-flang passes or not ...

We may end up needing a different approach for constants in the LLVM dialect, instead of iteratively inserting them create some sort of special attribute that would be converted at once without polluting memory. Or an "interpreter" of this global regions that creates such a form on-the-fly, without creating and deleting constants. The original motivation for having a global region was that we didn't want to duplicate IR constructs for operations and for constants like LLVM does.

@ftynse agreed. The current approach is also slow I think. For a C++ program with the same number of elements, clang -S --emit-llvm converts it to LLVM very fast.

tungld updated this revision to Diff 515310.Apr 20 2023, 6:43 AM
  • Only handle InsertValueOp
tungld updated this revision to Diff 515314.Apr 20 2023, 6:57 AM
  • Edit comments

@ftynse @kiranchandramohan Tests are passed. I am not totally understand why flang failed, so I make this patch limited to InsertValueOp that produces a constant array. At least it can fix the memory explosion issue of very large arrays of strings that is used quite often in machine learning. Thanks for your comment and help!

Please add a test that shows the transformation in mlir/test/Target/.

tungld updated this revision to Diff 515344.Apr 20 2023, 8:20 AM
  • Add a lit test for translating a string array

Please add a test that shows the transformation in mlir/test/Target/.

Done. Thanks!

Looks OK to me. Please wait for an approval from @ftynse before submitting.

This revision is now accepted and ready to land.Apr 25 2023, 3:26 AM

@ftynse could you please take a look at this patch again? Thanks!

@ftynse could you please take a look at this patch again? Thanks!

ftynse accepted this revision.May 15 2023, 2:51 AM

Sorry for delay, LGTM.

@ftynse, @kiranchandramohan Thanks for your review!

I don't have permission to push a commit to LLVM, could you please help me to land this?

My info: "Tung D. Le <tung@jp.ibm.com>"

Thanks!

It looks like this change breaks 527.cam and 627.cam4. The failure looks flaky. Sometimes it fails like this:

error: loc("/cam_v6_0/obj/msise00.f90":2272:14): unemittable constant value
error: could not emit LLVM-IR

While deleting: double %
Use still stuck around after Def is destroyed:  <badref> = insertvalue [50 x double] Segmentation fault (core dumped)

Sometimes - like this:

llvm/lib/IR/Constants.cpp:1247: static llvm::Constant* llvm::ConstantArray::getImpl(llvm::ArrayType*, llvm::ArrayRef<llvm::Constant*>): Assertion `C->getType() == Ty->getElementType() && "Wrong type in array element initializer"' failed.
Aborted (core dumped)

Reverting the commit fixes the failure. Please let me know if you need a reproducer (i.e. you are not already looking at it).

@vzakhari, do you have an example in MLIR that we can reproduce the issue. I am on an s390x machine where I cannot build flang. Thanks!

@vzakhari, do you have an example in MLIR that we can reproduce the issue. I am on an s390x machine where I cannot build flang. Thanks!

I created https://github.com/llvm/llvm-project/issues/62802. Please take a look.

@tungld Shall I revert this for now?

@tungld Shall I revert this for now?

Yes, please. I am not sure why it caused the issue for now.