COFF: Parallelize ICF.
The LLD's ICF algorithm is highly parallelizable. This patch does that
ICF accounted for about one third of total execution time. Previously,
it took 324 ms when self-hosting. Now it takes only 62 ms.
Of course your mileage may vary. My machine is a beefy 24-core Xeon machine,
so you may not see this much speedup. But this optimization should be
effective even for 2-core machine, since I saw speedup (324 ms -> 189 ms)
when setting parallelism parameter to 2.