This is an archive of the discontinued LLVM Phabricator instance.

[dwarf] Emit a DIGlobalVariable for constant strings.
ClosedPublic

Authored by hctim on Apr 11 2022, 12:02 PM.

Details

Summary

An upcoming patch will extend llvm-symbolizer to provide the source line
information for global variables. The goal is to move AddressSanitizer
off of internal debug info for symbolization onto the DWARF standard
(and doing a clean-up in the process). Currently, ASan reports the line
information for constant strings if a memory safety bug happens around
them. We want to keep this behaviour, so we need to emit debuginfo for
these variables as well.

Diff Detail

Event Timeline

hctim created this revision.Apr 11 2022, 12:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 12:02 PM
hctim requested review of this revision.Apr 11 2022, 12:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 12:02 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This seems like it would significantly introduce debug info size for at least some kinds of code - have you done any size measurements of this change?

What does the resulting DWARF look like?

hctim added a comment.Apr 15 2022, 2:33 PM

This seems like it would significantly introduce debug info size for at least some kinds of code - have you done any size measurements of this change?

With -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=On:

  • Before the patch: clang was 1247264656 bytes
  • After the patch: clang was 1270191696 bytes (or, a 1.84% increase)

What does the resulting DWARF look like?

Each string DI ends up looking like a regular global variable (i.e. a DW_TAG_variable entry in .debug_info):

$ llvm-dwarfdump --debug-info
0x00000032:   DW_TAG_variable
                DW_AT_name      (".str")
                DW_AT_type      (0x0000003e ".str")
                DW_AT_decl_file ("/tmp/file.c")
                DW_AT_decl_line (3)
                DW_AT_location  (DW_OP_addrx 0x1)
                DW_AT_linkage_name      (".str")

(and obviously an entry in the .debug_addr referenced by the DW_AT_location at index 0x1)

hctim added a comment.Apr 15 2022, 2:45 PM
section : increase in bytes for clang built with full debuginfo : %% of total binary size
.debug_loclists		317782		0.0250%
.debug_abbrev		88590		0.0070%
.debug_info		10070834	0.7929%
.debug_rnglists		143320		0.0113%
.debug_str_offsets	1090516		0.0859%
.debug_str		7789440		0.6132%
.debug_addr		3611552		0.2843%
.debug_line		-187964		-0.0148%
.debug_line_str		2043		0.0002%

This seems like it would significantly introduce debug info size for at least some kinds of code - have you done any size measurements of this change?

With -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=On:

  • Before the patch: clang was 1247264656 bytes
  • After the patch: clang was 1270191696 bytes (or, a 1.84% increase)

That does seem a bit unfortunate/enough to be worth more examination.

What does the resulting DWARF look like?

Each string DI ends up looking like a regular global variable (i.e. a DW_TAG_variable entry in .debug_info):

$ llvm-dwarfdump --debug-info
0x00000032:   DW_TAG_variable
                DW_AT_name      (".str")
                DW_AT_type      (0x0000003e ".str")
                DW_AT_decl_file ("/tmp/file.c")
                DW_AT_decl_line (3)
                DW_AT_location  (DW_OP_addrx 0x1)
                DW_AT_linkage_name      (".str")

(and obviously an entry in the .debug_addr referenced by the DW_AT_location at index 0x1)

What's the DW_AT_type referring to? (why is there a type called ".str"?)

For your particular purposes, it seems like the name/type/linkage name could all be omitted & that might be OK, since these entities aren't actually named in the source?

What's the DW_AT_type referring to? (why is there a type called ".str"?)

I'd expect it to point to a base type, in this case character string. A typeless variable with a location would be pretty weird.
But I agree the name/linkage_name would be unnecessary.

hctim added a comment.Apr 20 2022, 4:19 PM

What's the DW_AT_type referring to? (why is there a type called ".str"?)

For your particular purposes, it seems like the name/type/linkage name could all be omitted & that might be OK, since these entities aren't actually named in the source?

I'd expect it to point to a base type, in this case character string. A typeless variable with a location would be pretty weird.
But I agree the name/linkage_name would be unnecessary.

I've updated the patch to:

  1. Omit the linkage name,
  2. Omit the display name (including associated changes to allow this), and
  3. Used the char[SIZE] type.

#3 could be optimised for file size further by changing to a const char* type, but I:

  1. Can't see an obvious way to synthesize a const char* QualType to pass to createGlobalVariableExpression, and
  2. Think it might actually be useful to have the correct type here (as it now also describes the size of the variable).

Here's the new dwarfdump:

0x00000065:   DW_TAG_variable                                                                               
                DW_AT_type      (0x0000006f "char [11]")                                                    
                DW_AT_decl_file ("/tmp/file.c")                                                             
                DW_AT_decl_line (4)                                                                         
                DW_AT_location  (DW_OP_addrx 0x3)                                                           
                                                                                                            
0x0000006f:   DW_TAG_array_type                                                                             
                DW_AT_type      (0x00000048 "char")                                                         
                                                                                                            
0x00000074:     DW_TAG_subrange_type                                                                        
                  DW_AT_type    (0x0000004c "__ARRAY_SIZE_TYPE__")                                          
                  DW_AT_count   (0x0b)                                                                      
                                                                                                            
0x0000007a:     NULL

Apparently with -DCMAKE_BUILD_TYPE='RelWithDebInfo', bin/clang-15 went from 1373029152B to 1373030960B (~0%), which seems suspiciously low to me. I double checked the builds twice, and they seem to be right...

hctim updated this revision to Diff 424055.Apr 20 2022, 4:20 PM

Patch update.

Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 4:20 PM
hctim added a comment.Apr 20 2022, 4:38 PM

Confirmed again with -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=On:

1510883864 (new) - 1510881800 (old) = 2064 bytes.

hctim added a comment.Apr 20 2022, 7:34 PM

Yeah, as I drove away from my desk I realized my mistake. Will get new accurate numbers.

1263559896 (new) - 1248280328 (old) = 15279568 B (1.22% increase). Seems more reasonable to me.

hctim added a comment.Apr 21 2022, 1:31 PM

I just did an experiment with the following patch that punts everything to const char* types to try and get better folding onto a single type:

The change didn't make a significant difference, with the clang binary size of 1263433512 (126384B reduction, or 1.224% -> 1.214%). I think it's better to keep the extra size diagnostic.

diff --git a/clang/lib/CodeGen/CGDebugInfo.cpp b/clang/lib/CodeGen/CGDebugInfo.cpp
index f82c0e357bc3..175a5fec2989 100644
--- a/clang/lib/CodeGen/CGDebugInfo.cpp
+++ b/clang/lib/CodeGen/CGDebugInfo.cpp
@@ -5454,11 +5454,23 @@ void CGDebugInfo::AddStringLiteralDebugInfo(llvm::GlobalVariable *GV,
   if (!PLoc.isValid())
     return;
 
+  // Official type of the StringLiteral is `char[N]`. This means for each
+  // compilation unit, we have multiple DW_*_type entries, one set for each size
+  // of constant string. This level of precision isn't necessary, so simply
+  // synthesize a `const char*` type so it can be reused across all constant
+  // strings.
+  const auto &Context = CGM.getContext();
+  QualType Type = Context.CharTy;
+  Type.addConst();
+  Type = Context.getPointerType(Type);
+
+  // Similar to above, elide the name variables (as there isn't really anything
+  // useful we can add) to save binary size.
   llvm::DIFile *File = getOrCreateFile(Loc);
   llvm::DIGlobalVariableExpression *Debug =
       DBuilder.createGlobalVariableExpression(
           nullptr, StringRef(), StringRef(), getOrCreateFile(Loc),
-          getLineNumber(Loc), getOrCreateType(S->getType(), File), true);
+          getLineNumber(Loc), getOrCreateType(Type, File), true);
   GV->addDebugInfo(Debug);
 }

Size still seems moderately concerning, but might be acceptable.

Got a summary of what the DWARF looks like now (without names)? (maybe there's something else we can strip/optimize)

& how many of these descriptions get added to the debug info?

Numbers for Split DWARF may be helpful too - given this'll add an extra address/relocation for every string literal, it might make object size (specifically unlinked object size where relocations are expensive/plentiful) significantly larger in problematic ways.

clang/test/CodeGen/debug-info-variables.c
2

Presumably named variables are already tested elsewhere - so I probably wouldn't test them here, and would give this file a more specific name related to string literals. Alternatively/perhaps the other thing to do would be to find some test that already covers global variable debug info, and test string literals there?

hctim added a comment.Apr 26 2022, 4:08 PM

summary of DWARF:
& how many of these descriptions get added to the debug info?

afaict, there is now:
1x .debug_addr entry for each string
1x. debug_info DW_TAG_variable for each string
1x. DW_TAG_array_type + DW_TAG_subrange_type for each unique sizeof(string)

i tried to measure if there's other bits laying around that could be optimised. i thought briefly about diffing the llvm-dwarfdump for the before/after for clang, but as the dumpfiles reached 20gb, rethought that decision. the dwarfdump for the clang/test/CodeGen/debug-info-variables.c dwo is below.

Numbers for Split DWARF may be helpful too - given this'll add an extra address/relocation for every string literal, it might make object size (specifically unlinked object size where relocations are expensive/plentiful) significantly larger in problematic ways.

sorry, i don't understand why split-dwarf means this requires an additional relocation (i'm not really sure what split-dwarf is outside of just putting the dwarf in a separate file, but don't see why that would change relocations). i made a quick dwarfdump diff on clang/test/CodeGen/debug-info-variables.c (with split-dwarf):

sections old:

[Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
[ 2] .debug_str.dwo    PROGBITS        0000000000000000 000040 0000eb 01 MSE  0   0  1
[ 3] .debug_str_offsets.dwo PROGBITS   0000000000000000 00012b 00002c 00   E  0   0  1
[ 4] .debug_info.dwo   PROGBITS        0000000000000000 000157 000077 00   E  0   0  1
[ 5] .debug_abbrev.dwo PROGBITS        0000000000000000 0001ce 000091 00   E  0   0  1

sections new:

[Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
[ 3] .debug_str.dwo    PROGBITS        0000000000000000 000078 0000ff 01 MSE  0   0  1
[ 2] .debug_str_offsets.dwo PROGBITS   0000000000000000 000040 000038 00   E  0   0  1
[ 4] .debug_info.dwo   PROGBITS        0000000000000000 000177 000092 00   E  0   0  1
[ 5] .debug_abbrev.dwo PROGBITS        0000000000000000 000209 0000aa 00   E  0   0  1

so .debug_string += 0x14, .debug_str_offsets += 0xc, .debug_info += 0x1b and .debug_abbrev += 0x19.

as before, the DW_TAG_array_type + DW_TAG_subrange_type would be amortised across strings with the same size.

unfortunately, i don't see any further places to optimise (except for the full const char* amortization, which as in a previous comment, didn't make much of an improvement for the entire clang binary)

diff --git a/tmp/dwo/dwo-dump b/dwo-dump
index e0fdd77..6415086 100644
--- a/tmp/dwo/dwo-dump
+++ b/dwo-dump
@@ -3,14 +3,13 @@ debug-info-variables.dwo:	file format elf64-x86-64
 .debug_abbrev.dwo contents:
 Abbrev table for offset: 0x00000000
 [1] DW_TAG_compile_unit	DW_CHILDREN_yes
-	DW_AT_producer	DW_FORM_GNU_str_index
+	DW_AT_producer	DW_FORM_strx1
 	DW_AT_language	DW_FORM_data2
-	DW_AT_name	DW_FORM_GNU_str_index
-	DW_AT_GNU_dwo_name	DW_FORM_GNU_str_index
-	DW_AT_GNU_dwo_id	DW_FORM_data8
+	DW_AT_name	DW_FORM_strx1
+	DW_AT_dwo_name	DW_FORM_strx1
 
 [2] DW_TAG_variable	DW_CHILDREN_no
-	DW_AT_name	DW_FORM_GNU_str_index
+	DW_AT_name	DW_FORM_strx1
 	DW_AT_type	DW_FORM_ref4
 	DW_AT_external	DW_FORM_flag_present
 	DW_AT_decl_file	DW_FORM_data1
@@ -18,155 +17,194 @@ Abbrev table for offset: 0x00000000
 	DW_AT_location	DW_FORM_exprloc
 
 [3] DW_TAG_base_type	DW_CHILDREN_no
-	DW_AT_name	DW_FORM_GNU_str_index
+	DW_AT_name	DW_FORM_strx1
 	DW_AT_encoding	DW_FORM_data1
 	DW_AT_byte_size	DW_FORM_data1
 
-[4] DW_TAG_subprogram	DW_CHILDREN_no
-	DW_AT_low_pc	DW_FORM_GNU_addr_index
+[4] DW_TAG_variable	DW_CHILDREN_no
+	DW_AT_type	DW_FORM_ref4
+	DW_AT_decl_file	DW_FORM_data1
+	DW_AT_decl_line	DW_FORM_data1
+	DW_AT_location	DW_FORM_exprloc
+
+[5] DW_TAG_array_type	DW_CHILDREN_yes
+	DW_AT_type	DW_FORM_ref4
+
+[6] DW_TAG_subrange_type	DW_CHILDREN_no
+	DW_AT_type	DW_FORM_ref4
+	DW_AT_count	DW_FORM_data1
+
+[7] DW_TAG_base_type	DW_CHILDREN_no
+	DW_AT_name	DW_FORM_strx1
+	DW_AT_byte_size	DW_FORM_data1
+	DW_AT_encoding	DW_FORM_data1
+
+[8] DW_TAG_subprogram	DW_CHILDREN_no
+	DW_AT_low_pc	DW_FORM_addrx
 	DW_AT_high_pc	DW_FORM_data4
 	DW_AT_frame_base	DW_FORM_exprloc
-	DW_AT_name	DW_FORM_GNU_str_index
+	DW_AT_name	DW_FORM_strx1
 	DW_AT_decl_file	DW_FORM_data1
 	DW_AT_decl_line	DW_FORM_data1
 	DW_AT_type	DW_FORM_ref4
 	DW_AT_external	DW_FORM_flag_present
 
-[5] DW_TAG_subprogram	DW_CHILDREN_yes
-	DW_AT_low_pc	DW_FORM_GNU_addr_index
+[9] DW_TAG_subprogram	DW_CHILDREN_yes
+	DW_AT_low_pc	DW_FORM_addrx
 	DW_AT_high_pc	DW_FORM_data4
 	DW_AT_frame_base	DW_FORM_exprloc
-	DW_AT_name	DW_FORM_GNU_str_index
+	DW_AT_name	DW_FORM_strx1
 	DW_AT_decl_file	DW_FORM_data1
 	DW_AT_decl_line	DW_FORM_data1
 	DW_AT_prototyped	DW_FORM_flag_present
 	DW_AT_type	DW_FORM_ref4
 	DW_AT_external	DW_FORM_flag_present
 
-[6] DW_TAG_formal_parameter	DW_CHILDREN_no
+[10] DW_TAG_formal_parameter	DW_CHILDREN_no
 	DW_AT_location	DW_FORM_exprloc
-	DW_AT_name	DW_FORM_GNU_str_index
+	DW_AT_name	DW_FORM_strx1
 	DW_AT_decl_file	DW_FORM_data1
 	DW_AT_decl_line	DW_FORM_data1
 	DW_AT_type	DW_FORM_ref4
 
-[7] DW_TAG_variable	DW_CHILDREN_no
+[11] DW_TAG_variable	DW_CHILDREN_no
 	DW_AT_location	DW_FORM_exprloc
-	DW_AT_name	DW_FORM_GNU_str_index
+	DW_AT_name	DW_FORM_strx1
 	DW_AT_decl_file	DW_FORM_data1
 	DW_AT_decl_line	DW_FORM_data1
 	DW_AT_type	DW_FORM_ref4
 
-[8] DW_TAG_pointer_type	DW_CHILDREN_no
+[12] DW_TAG_pointer_type	DW_CHILDREN_no
 	DW_AT_type	DW_FORM_ref4
 
-[9] DW_TAG_const_type	DW_CHILDREN_no
+[13] DW_TAG_const_type	DW_CHILDREN_no
 	DW_AT_type	DW_FORM_ref4
 
 
 .debug_info.dwo contents:
-0x00000000: Compile Unit: length = 0x00000073, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x00000077)
+0x00000000: Compile Unit: length = 0x0000008e, format = DWARF32, version = 0x0005, unit_type = DW_UT_split_compile, abbr_offset = 0x0000, addr_size = 0x08, DWO_id = 0xa7a0aceb112fa998 (next unit at 0x00000092)
 
-0x0000000b: DW_TAG_compile_unit
-              DW_AT_producer	("clang version 14.0.0 (https://github.com/llvm/llvm-project.git 5357a98c823a5262814e269db265b5d8e1f2c4f2)")
+0x00000014: DW_TAG_compile_unit
+              DW_AT_producer	("clang version 15.0.0 (https://github.com/llvm/llvm-project.git 4c7fff8247ec5485701d4e04ea45b9fe399e1c5a)")
               DW_AT_language	(DW_LANG_C99)
               DW_AT_name	("/usr/local/google/home/mitchp/llvm/clang/test/CodeGen/debug-info-variables.c")
-              DW_AT_GNU_dwo_name	("debug-info-variables.dwo")
-              DW_AT_GNU_dwo_id	(0x02255219cf8b78f3)
+              DW_AT_dwo_name	("debug-info-variables.dwo")
 
-0x00000019:   DW_TAG_variable
+0x0000001a:   DW_TAG_variable
                 DW_AT_name	("global")
-                DW_AT_type	(0x00000024 "int")
+                DW_AT_type	(0x00000025 "int")
                 DW_AT_external	(true)
                 DW_AT_decl_file	(0x01)
                 DW_AT_decl_line	(4)
-                DW_AT_location	(DW_OP_GNU_addr_index 0x0)
+                DW_AT_location	(DW_OP_addrx 0x0)
 
-0x00000024:   DW_TAG_base_type
+0x00000025:   DW_TAG_base_type
                 DW_AT_name	("int")
                 DW_AT_encoding	(DW_ATE_signed)
                 DW_AT_byte_size	(0x04)
 
-0x00000028:   DW_TAG_subprogram
-                DW_AT_low_pc	(indexed (00000001) address = <unresolved>)
+0x00000029:   DW_TAG_variable
+                DW_AT_type	(0x00000033 "char [11]")
+                DW_AT_decl_file	(0x01)
+                DW_AT_decl_line	(8)
+                DW_AT_location	(DW_OP_addrx 0x1)
+
+0x00000033:   DW_TAG_array_type
+                DW_AT_type	(0x0000003f "char")
+
+0x00000038:     DW_TAG_subrange_type
+                  DW_AT_type	(0x00000043 "__ARRAY_SIZE_TYPE__")
+                  DW_AT_count	(0x0b)
+
+0x0000003e:     NULL
+
+0x0000003f:   DW_TAG_base_type
+                DW_AT_name	("char")
+                DW_AT_encoding	(DW_ATE_signed_char)
+                DW_AT_byte_size	(0x01)
+
+0x00000043:   DW_TAG_base_type
+                DW_AT_name	("__ARRAY_SIZE_TYPE__")
+                DW_AT_byte_size	(0x08)
+                DW_AT_encoding	(DW_ATE_unsigned)
+
+0x00000047:   DW_TAG_subprogram
+                DW_AT_low_pc	(indexed (00000002) address = <unresolved>)
                 DW_AT_high_pc	(0x0000000d)
                 DW_AT_frame_base	(DW_OP_reg6 RBP)
                 DW_AT_name	("s")
                 DW_AT_decl_file	(0x01)
                 DW_AT_decl_line	(7)
-                DW_AT_type	(0x00000068 "const char *")
+                DW_AT_type	(0x00000087 "const char *")
                 DW_AT_external	(true)
 
-0x00000037:   DW_TAG_subprogram
-                DW_AT_low_pc	(indexed (00000002) address = <unresolved>)
+0x00000056:   DW_TAG_subprogram
+                DW_AT_low_pc	(indexed (00000003) address = <unresolved>)
                 DW_AT_high_pc	(0x00000018)
                 DW_AT_frame_base	(DW_OP_reg6 RBP)
                 DW_AT_name	("sum")
                 DW_AT_decl_file	(0x01)
                 DW_AT_decl_line	(14)
                 DW_AT_prototyped	(true)
-                DW_AT_type	(0x00000024 "int")
+                DW_AT_type	(0x00000025 "int")
                 DW_AT_external	(true)
 
-0x00000046:     DW_TAG_formal_parameter
+0x00000065:     DW_TAG_formal_parameter
                   DW_AT_location	(DW_OP_fbreg -4)
                   DW_AT_name	("p")
                   DW_AT_decl_file	(0x01)
                   DW_AT_decl_line	(14)
-                  DW_AT_type	(0x00000024 "int")
+                  DW_AT_type	(0x00000025 "int")
 
-0x00000051:     DW_TAG_formal_parameter
+0x00000070:     DW_TAG_formal_parameter
                   DW_AT_location	(DW_OP_fbreg -8)
                   DW_AT_name	("q")
                   DW_AT_decl_file	(0x01)
                   DW_AT_decl_line	(14)
-                  DW_AT_type	(0x00000024 "int")
+                  DW_AT_type	(0x00000025 "int")
 
-0x0000005c:     DW_TAG_variable
+0x0000007b:     DW_TAG_variable
                   DW_AT_location	(DW_OP_fbreg -12)
                   DW_AT_name	("r")
                   DW_AT_decl_file	(0x01)
                   DW_AT_decl_line	(15)
-                  DW_AT_type	(0x00000024 "int")
-
-0x00000067:     NULL
+                  DW_AT_type	(0x00000025 "int")
 
-0x00000068:   DW_TAG_pointer_type
-                DW_AT_type	(0x0000006d "const char")
+0x00000086:     NULL
 
-0x0000006d:   DW_TAG_const_type
-                DW_AT_type	(0x00000072 "char")
+0x00000087:   DW_TAG_pointer_type
+                DW_AT_type	(0x0000008c "const char")
 
-0x00000072:   DW_TAG_base_type
-                DW_AT_name	("char")
-                DW_AT_encoding	(DW_ATE_signed_char)
-                DW_AT_byte_size	(0x01)
+0x0000008c:   DW_TAG_const_type
+                DW_AT_type	(0x0000003f "char")
 
-0x00000076:   NULL
+0x00000091:   NULL
 
 .debug_str.dwo contents:
 0x00000000: "global"
 0x00000007: "int"
-0x0000000b: "s"
-0x0000000d: "char"
-0x00000012: "sum"
-0x00000016: "p"
-0x00000018: "q"
-0x0000001a: "r"
-0x0000001c: "clang version 14.0.0 (https://github.com/llvm/llvm-project.git 5357a98c823a5262814e269db265b5d8e1f2c4f2)"
-0x00000085: "/usr/local/google/home/mitchp/llvm/clang/test/CodeGen/debug-info-variables.c"
-0x000000d2: "debug-info-variables.dwo"
+0x0000000b: "char"
+0x00000010: "__ARRAY_SIZE_TYPE__"
+0x00000024: "s"
+0x00000026: "sum"
+0x0000002a: "p"
+0x0000002c: "q"
+0x0000002e: "r"
+0x00000030: "clang version 15.0.0 (https://github.com/llvm/llvm-project.git 4c7fff8247ec5485701d4e04ea45b9fe399e1c5a)"
+0x00000099: "/usr/local/google/home/mitchp/llvm/clang/test/CodeGen/debug-info-variables.c"
+0x000000e6: "debug-info-variables.dwo"
 
 .debug_str_offsets.dwo contents:
-0x00000000: Contribution size = 44, Format = DWARF32, Version = 4
-0x00000000: 00000000 "global"
-0x00000004: 00000007 "int"
-0x00000008: 0000000b "s"
-0x0000000c: 0000000d "char"
-0x00000010: 00000012 "sum"
-0x00000014: 00000016 "p"
-0x00000018: 00000018 "q"
-0x0000001c: 0000001a "r"
-0x00000020: 0000001c "clang version 14.0.0 (https://github.com/llvm/llvm-project.git 5357a98c823a5262814e269db265b5d8e1f2c4f2)"
-0x00000024: 00000085 "/usr/local/google/home/mitchp/llvm/clang/test/CodeGen/debug-info-variables.c"
-0x00000028: 000000d2 "debug-info-variables.dwo"
+0x00000000: Contribution size = 52, Format = DWARF32, Version = 5
+0x00000008: 00000000 "global"
+0x0000000c: 00000007 "int"
+0x00000010: 0000000b "char"
+0x00000014: 00000010 "__ARRAY_SIZE_TYPE__"
+0x00000018: 00000024 "s"
+0x0000001c: 00000026 "sum"
+0x00000020: 0000002a "p"
+0x00000024: 0000002c "q"
+0x00000028: 0000002e "r"
+0x0000002c: 00000030 "clang version 15.0.0 (https://github.com/llvm/llvm-project.git 4c7fff8247ec5485701d4e04ea45b9fe399e1c5a)"
+0x00000030: 00000099 "/usr/local/google/home/mitchp/llvm/clang/test/CodeGen/debug-info-variables.c"
+0x00000034: 000000e6 "debug-info-variables.dwo"
hctim marked an inline comment as done.Apr 26 2022, 4:10 PM
hctim added inline comments.
clang/test/CodeGen/debug-info-variables.c
2

I found debug-info-global-constant.c and debug-info-static.c, but neither of them test DILocalVariables or strings.

This test was cargo-culted from debug-info-gline-tables-only.c which does the inverse :).

I was hoping that this test would become the main "simple global variables" test :)

summary of DWARF:
& how many of these descriptions get added to the debug info?

afaict, there is now:
1x .debug_addr entry for each string
1x. debug_info DW_TAG_variable for each string
1x. DW_TAG_array_type + DW_TAG_subrange_type for each unique sizeof(string)

& you ran an experiment where the type was omitted, but it didn't save much space?

i tried to measure if there's other bits laying around that could be optimised. i thought briefly about diffing the llvm-dwarfdump for the before/after for clang, but as the dumpfiles reached 20gb, rethought that decision. the dwarfdump for the clang/test/CodeGen/debug-info-variables.c dwo is below.

Yeah, diffing DWARF isn't super practical with all the offsets, etc.

Numbers for Split DWARF may be helpful too - given this'll add an extra address/relocation for every string literal, it might make object size (specifically unlinked object size where relocations are expensive/plentiful) significantly larger in problematic ways.

sorry, i don't understand why split-dwarf means this requires an additional relocation (i'm not really sure what split-dwarf is outside of just putting the dwarf in a separate file, but don't see why that would change relocations).

Sorry, to expound more: Split DWARF is intended to reduce the size of object files processed by the linker, and the resulting linked binary. Doing things like adding more pure type information which only goes in the .dwo file has no impact on .o/executable size - but this change adds another relocation for every string literal in the binary, which does grow the debug info that remains in the .o file (the .debug_addr table in particular) and relocations in particular are quite expensive (since they aren't compressed by -gz and take 3x the bytes of the actual address entry)

We should probably do some measurements inside google on these things, since the object size metrics are quite important to linkability in bazel/blaze (go/binary-size-analysis internally has some info - happy to help with measurements, etc).

i made a quick dwarfdump diff on clang/test/CodeGen/debug-info-variables.c (with split-dwarf):

sections old:

[Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
[ 2] .debug_str.dwo    PROGBITS        0000000000000000 000040 0000eb 01 MSE  0   0  1
[ 3] .debug_str_offsets.dwo PROGBITS   0000000000000000 00012b 00002c 00   E  0   0  1
[ 4] .debug_info.dwo   PROGBITS        0000000000000000 000157 000077 00   E  0   0  1
[ 5] .debug_abbrev.dwo PROGBITS        0000000000000000 0001ce 000091 00   E  0   0  1

sections new:

[Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
[ 3] .debug_str.dwo    PROGBITS        0000000000000000 000078 0000ff 01 MSE  0   0  1
[ 2] .debug_str_offsets.dwo PROGBITS   0000000000000000 000040 000038 00   E  0   0  1
[ 4] .debug_info.dwo   PROGBITS        0000000000000000 000177 000092 00   E  0   0  1
[ 5] .debug_abbrev.dwo PROGBITS        0000000000000000 000209 0000aa 00   E  0   0  1

so .debug_string += 0x14, .debug_str_offsets += 0xc, .debug_info += 0x1b and .debug_abbrev += 0x19.

This data, but for Clang, might be informative - for this small sample it's hard to separate the constant overheads from the linear ones. (eg: the debug_str growth I'd expect to be constant (just the name of the char type and the name of the size type), but the debug_info growth is presumably not constant)

hctim updated this revision to Diff 426172.Apr 29 2022, 3:17 PM
hctim marked an inline comment as done.

Update test, rebase off of string debuginfo.

hctim updated this revision to Diff 426175.Apr 29 2022, 3:19 PM

(restoring after uploading the diff for D123538 here accidentally)

Tried this patch on some internal targets and metrics and it seems actually quite reasonable.

Tested with split DWARF, on Linux, with compressed debug info in .o/.dwo files, uncompressed in exe/dwp files. With -O0 and -O3, roughly speaking. Testing Clang as well as testing a large Google binary only at -O3:
Clang:
-O0:
.o: +0.44%
.dwo: +1.55%
exe: 0.65%
dwp: 1.42%
-O3:
.o: 0.90%
.dwo: 0.26%
exe: 0.33%
dwp: 0.41%

Large Google Binary -O3:
.o: 0.32%
.dwo: 0.16%
exe: 0.19%
dwp: 0.22%

These are % of total size, not % of debug info contribution - so it's growing the debug info by more than that, but the total impact doesn't seem to be too bad.

@rnk @aprantl @probinson: this looks plausible to me to take as a general change to the emitted DWARF, but I'm open to ideas/suggestions if folks find this to be too much growth for not enough benefit & that it should be behind a flag?

hctim added a comment.May 10 2022, 2:09 PM

@rnk @aprantl @probinson: this looks plausible to me to take as a general change to the emitted DWARF, but I'm open to ideas/suggestions if folks find this to be too much growth for not enough benefit & that it should be behind a flag?

Friendly ping @rnk @aprantl @probinson, do you folks have any feedback?

rnk added a comment.May 10 2022, 3:23 PM

This seems reasonable, but I worry about the consequences of creating lots of unnamed global variables. What will gdb do with so many unnamed globals? What will the PDB linker do with all these unnamed globals? I can't answer these questions, and you're welcome to try and find out, but I predict there will be problems, so just be aware of that possibility.

Looking at the S_GDATA32 record format, there is no place for the file & line info:
https://github.com/microsoft/microsoft-pdb/blob/master/include/cvinfo.h#L3629
It's just type info & name, really, so there's not much point in creating these when emitting codeview. It's probably best to filter these unnamed globals out in AsmPrinter/CodeViewDebug.cpp rather than changing the IR clang generates.

hctim added a comment.May 11 2022, 7:07 PM

This seems reasonable, but I worry about the consequences of creating lots of unnamed global variables. What will gdb do with so many unnamed globals? What will the PDB linker do with all these unnamed globals? I can't answer these questions, and you're welcome to try and find out, but I predict there will be problems, so just be aware of that possibility.

Not sure about PDB. I did run a quick test with gdb, and very unscientifically, didn't notice any difference in usability or errors between pre- and post-this patch on a clang invocation under gdb.

Looking at the S_GDATA32 record format, there is no place for the file & line info:
https://github.com/microsoft/microsoft-pdb/blob/master/include/cvinfo.h#L3629
It's just type info & name, really, so there's not much point in creating these when emitting codeview. It's probably best to filter these unnamed globals out in AsmPrinter/CodeViewDebug.cpp rather than changing the IR clang generates.

Sorry, I know pretty nothing about PDB. Are you thinking that these DIGlobalVariables should be filtered out by this patch when outputting PDB, or a subsequent patch? I don't even have a Windows machine, so I wouldn't be at all confident in writing such a patch.

rnk added a comment.May 12 2022, 11:10 AM

Not sure about PDB. I did run a quick test with gdb, and very unscientifically, didn't notice any difference in usability or errors between pre- and post-this patch on a clang invocation under gdb.

That's good enough for me. Maybe there will be problems down the road, but the best way to find out is probably to land the change.

Sorry, I know pretty nothing about PDB. Are you thinking that these DIGlobalVariables should be filtered out by this patch when outputting PDB, or a subsequent patch? I don't even have a Windows machine, so I wouldn't be at all confident in writing such a patch.

That's fine, a Windows machine shouldn't be necessary. This patch has to change the AsmPrinter for DWARF, so it makes sense that it should do the same for CodeView. I think it should be sufficient to continue this loop when a global variable display name is empty:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L3175

It will need a test, and the tests at llvm/test/DebugInfo/COFF/global* should be good templates. Basically, have a file with two globals, one named and unnamed, and use FileCheck to confirm that there is only one GDATA record.

hctim added a comment.May 12 2022, 7:55 PM

That's fine, a Windows machine shouldn't be necessary. This patch has to change the AsmPrinter for DWARF, so it makes sense that it should do the same for CodeView. I think it should be sufficient to continue this loop when a global variable display name is empty:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/AsmPrinter/CodeViewDebug.cpp#L3175

It will need a test, and the tests at llvm/test/DebugInfo/COFF/global* should be good templates. Basically, have a file with two globals, one named and unnamed, and use FileCheck to confirm that there is only one GDATA record.

Thanks for the pointer to the test. That was very helpful in figuring out how to invoke clang to produce CodeView/COFF. Funnily enough, it found a bug where I didn't realise I had to relax the invariant that a DIGlobalVariable needs a name in the LL parser.

I wonder whether it's feasible to re-use FunctionLineTable to encode line:col for data symbols. A hack for sure, and left as an exercise for the reader.

hctim updated this revision to Diff 429121.May 12 2022, 7:56 PM

update patch

hctim updated this revision to Diff 429276.May 13 2022, 9:41 AM

Remove test based on removed invariant (DIGlobalVariables no longer need a name).

aprantl added inline comments.May 13 2022, 10:50 AM
clang/lib/CodeGen/CGDebugInfo.h
536

I would appreciate a comment explaining the motivation behind doing this here.

clang/test/CodeGen/debug-info-variables.c
7

would be nice to also check the the type is char*, etc.

I think this mostly looks good. I left comment about the test inline.

rnk accepted this revision.May 16 2022, 9:16 AM

Looks good to me. I generally agree with Adrian's suggestions.

llvm/test/DebugInfo/COFF/global-no-strings.ll
1 ↗(On Diff #429276)

test lgtm, thanks!

This revision is now accepted and ready to land.May 16 2022, 9:16 AM
ormris removed a subscriber: ormris.May 16 2022, 10:48 AM
hctim marked 3 inline comments as done.May 16 2022, 11:33 AM
hctim updated this revision to Diff 429799.May 16 2022, 11:35 AM

Touch ups from comments.

aprantl accepted this revision.May 16 2022, 3:29 PM
dblaikie accepted this revision.May 16 2022, 4:02 PM

Seems good - if folks find the growth too much, or it creates weird perf issues otherwise, we can discuss that if/when it comes up.

This revision was landed with ongoing or failed builds.May 16 2022, 4:52 PM
This revision was automatically updated to reflect the committed changes.

Hi, we're seeing a breakage in Fuchsia's clang CI for x64 windows that I think is related to this patch.

https://luci-milo.appspot.com/ui/p/fuchsia/builders/toolchain.ci/clang-windows-x64/b8813962058917346337/overview

We see a test failure in Clang :: CodeGen/debug-info-variables.c, which this patch modifies.

If this will be hard to fix, would you mind reverting until a fix is ready?

hctim added a comment.May 16 2022, 7:10 PM

Done, I'll take a look and resubmit later.

Thanks for the quick response.

hctim added a comment.May 17 2022, 9:18 AM

Why, on Fuchsia, on WIndows, the DIs are in the order q -> p -> r is beyond me (instead of the file order and my order p -> q -> r. Looking into it.

hctim reopened this revision.May 18 2022, 1:58 PM
This revision is now accepted and ready to land.May 18 2022, 1:58 PM
hctim updated this revision to Diff 430494.May 18 2022, 1:58 PM

Fix the fuchsia windows bot by making the DILocalVariables in the test order-independent.

This revision was landed with ongoing or failed builds.May 18 2022, 1:58 PM
This revision was automatically updated to reflect the committed changes.

We noticed when building Fuchsia using ToT LLVM that CGo has trouble with the new changes to DWARF introduced in this patch. It appears as if CGo is confused because it found a DW_TAG_variable that doesn't have a name, which is allowed by the DWARF standard.

This patch seems like a good change to me and is well thought out and tested, but I think this may cause issues in the wider ecosystem as people interact with it. It may simply be a problem w/ Go, and a bug fix there is all that is necessary, but there is also the chance that other compilers and toolchains will have similar issues.

I want to be clear that I don't necessarily think a revert is in order, but we should take the opportunity to determine the right course of action.

We noticed when building Fuchsia using ToT LLVM that CGo has trouble with the new changes to DWARF introduced in this patch. It appears as if CGo is confused because it found a DW_TAG_variable that doesn't have a name, which is allowed by the DWARF standard.

This patch seems like a good change to me and is well thought out and tested, but I think this may cause issues in the wider ecosystem as people interact with it. It may simply be a problem w/ Go, and a bug fix there is all that is necessary, but there is also the chance that other compilers and toolchains will have similar issues.

I want to be clear that I don't necessarily think a revert is in order, but we should take the opportunity to determine the right course of action.

Thanks for the heads up. A possible thought is to send an announcement of "this happened" to llvm-dev, but I'm not sure that would reach the intended audience, as LLVM tooling is effectively "fixed up" inside this patch, and it's more the consumers of LLVM that may need changes.

Might be worth a note in the Release Notes given that it's at least caused issues with one DWARF consumer.

tra added a subscriber: tra.EditedJan 27 2023, 4:03 PM

This patch appears to break CUDA compilation: https://github.com/llvm/llvm-project/issues/58491

We apparently emit the symbol with a character (.) that can't be used on the target. Normally such characters get renamed/escaped, as you can see in the generated symbol name itself.
Search for .b64 __func__._Z10foo_kernelv in the dwarf output.
https://godbolt.org/z/8h4rGn1Ka