This change equips lit.py with two new options, --num-shards=M and
--run-shard=N (set by default from env vars LIT_NUM_SHARDS and LIT_RUN_SHARD).
The options must be used together, and N must be in 0..M-1.
Together these options effect only test selection: they partition the testsuite
into M equal-sized "shards", then select only the Nth shard. They can be used
in a cluster of test machines to achieve a very crude (static) form of
parallelism, with minimal configuration work.
Would it be better to shard in a round robin fashion? There is some tendency for tests to be clumped by where they are defined, and where they are defined to be (weakly) correlated with how long they take to run, so that would distribute long running tests across machines, which should help reduce the deviation between total testing time among shards.