Recent drivers changed the way GPUs are enumerated, so we've been running
the tests on the wrong GPU which took ~45minutes.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
You don't want to pass in the GPU-deadbeef form from nvidia-smi -L so we don't have this problem?
I'm not confident that the GUIDS remain stable, either. I have no data to tell one way or another, though.
CUDA currently allows specifying enumeration order by PCI_ID or by 'fastest'.
If it becomes a problem I'll just force enumeration by PCI_ID, which should be somewhat more stable (though it may change on BIOS update or if I add/remove other PCIe devices).
Another option would be to not include PTX in the binaries, so we'll know right away if we attempt to run the tests on the wrong GPU.
I'm not confident that the GUIDS remain stable, either. I have no data to tell one way or another, though.
That's a good point, but presumably if the GUIDs change, they're not going to *permute* and point to a different GPU. That is, the failure mode is noisy? Dunno if the same is true for pcid, but it's certainly not true for the integer identifiers.