Many large codes are designed to run in parallel environments. Each program thread then needs its own private Random which needs to be uniquely seeded. In a message passing environment the code would look like
// Parallel code where each thread gets a unique seed RandomLib::Random r(0); // Thread local Random std::vector<unsigned long> seed; // Thread local seed if (comm.Id() == 0) { // On master node if (debug) seed.push_back(314159265UL); else seed = RandomLib::Random::SeedVector(); logfile << "Master seed: " << RandomLib::Random::VectorToString(seed) << std::endl; comm.Send(slaveIds, seed); // Send seed to slaves } else // On slave nodes comm.Receive(seed); // Receive seed from master seed.push_back(comm.Id()); // Make seed unique to this process r.Reseed(seed); // Reseed Random ... // The work goes here
This approach has the disadvantage that the results of the code will depend on the number of processors (assuming that debug is true, so that the master seed is fixed). If your code divides the computation out amongst the threads in a way that depends on how quickly each thread completes a unit of work, then the results will vary from run to run even with a fixed number of threads. This is a huge problem when developing a code: debugging a multi-threaded code is difficult enough; debugging a code where the results aren't repeatable just increases the difficulty. The solution is to require that the results be independent of the number of processors. This would allow the algorithms and basic code to be debugged with a non-parallel version of the code. When moving to the parallel version, we require that the results stay the same, and any deviation from this signals a bug in the parallelization. This can be accomplished by having the seed depend on the the data (e.g., a loop index) rather than on processor. Here's how this might be done:
// Parallel code where each loop iteration gets a unique seed RandomLib::Random r(0); // Thread local Random std::vector<unsigned long> seed; // Thread local seed if (comm.Id() == 0) { // On master node if (debug) seed.push_back(314159265UL); else seed = RandomLib::Random::SeedVector(); logfile << "Master seed: " << RandomLib::Random::VectorToString(seed) << std::endl; comm.Send(slaveIds, seed); // Send seed to slaves } else comm.Receive(seed); // Receive seed from master seed.push_back(0); // Increase the seed size by 1 // Num units are work are to be divided between nproc processes. // Assume comm.Id() is in [0,nproc). for (size_t i = 0; i < num; ++i) { if (i % nproc != comm.Id()) // Handle this index? continue; seed.back() = i; // Make seed unique to this index r.Reseed(seed); // Reseed Random ... // The work goes here }
This code will satisfy the goal of producing results which are independent on the number of processors. However, if num is large (say 108), and if each iteration through the loop consumes only a small number of random numbers (say, at most, 100). Then we are abusing the SFMT19937 random number generator. There are two potential problems.
Function | Relative time |
---|---|
s = Random::SeedWord(); | 500000 |
v = Random::SeedVector(); | 10000 |
r.Reseed(v), r.SetCount(0) | 4000 |
i = r(); | 1 |
r.StepCount(N); | abs(N)/2 |
In order to address efficiency, we should use at least 104 random numbers for each seed. In order to preserve the statistical properties of SFMT19937, we should consume at least 105 random numbers for each seed. (Incidentally, this table also shows the high cost of Random::SeedWord(). This is mainly because of accessing /dev/urandom. Typically you should call Random::SeedWord() at most once per code run.)
A way to ameliorate this problem is to change the loop into a double loop and do the parallelization and reseeding at the level of the outer loop. Here we choose the block size so that maxrandoms * blocksize is at least 105 where maxrandoms is the maximum number of random numbers consumed per iteration (100 in this example) and blocksize is the number of iterations in the inner loop. It's easy to estimate maxrandoms with calls to Count() at the beginning and end of the body of the loop. Thus:
// Parallel code where each block gets a unique seed ... // Get the master seed as above // Num units are work are to be divided between nproc processes in // blocks of size blocksize. const size_t blocksize = 1000; for (size_t i = 0, j = 0; i < num; ++j) { // j is block index if (j % nproc != comm.Id()) // Handle this block? continue; seed.back() = j; // Make seed unique to this block r.Reseed(seed); // Reseed Random do { ... // A block of work goes here } while ((++i) % blocksize && i < num); }
With this splitting of the code into block you loose some flexibility in parallelization. We can overcome this by letting the reseeding be interspersed with the assignment to processors. To do this we need an estimate of the maximum number of random numbers consumed by one iteration of the inner loop. We then use SetCount(n) to position the random number sequence to a definite position at the beginning of each loop. (We saw in the table above that SetCount is relatively cheap.) In this example, we avoid unnecessary reseeding by having a given processor to deal with a consecutive set of loop iterations.
// Parallel code where each iteration gets a unique (seed, count) ... // Get the master seed as above // Num units are work are to be divided between nproc processes. // Assignment to processes and seeding are treated independently. const size_t blocksize = 1000; const long long int maxrandoms = 100; // Max # of randoms per loop long long int checkmaxrandoms = 0; size_t ibeg = (comm.Id() * num)/nproc, // The index range iend = ((comm.Id() + 1) * num)/nproc; for (size_t i = ibeg, b = num; i < iend; ++i) { // b is block size_t newb = i/blocksize; // Which block? if (newb != b) { // in a new block? b = newb; seed.back() = b; r.Reseed(seed); // then reseed } size_t k = i % blocksize; // index within block r.SetCount(k * maxrandoms); // step to starting point ... // The work goes here checkmaxrandoms = // How many randoms used std::max(checkmaxrandoms, r.Count() - k * maxrandoms); } if (checkmaxrandoms > maxrandoms) { ... // flag a warning that maxrandoms needs to be increased }
If more that maxrandoms numbers are used in a loop, you will have reused some random numbers and the results should be treated with some suspicion. In this case, you should set maxrandoms to, say, double the value of checkmaxrandoms and try again. The number of randoms consumed will usually be normally distributed and it should be easy to pick a value of maxrandoms which is safe. Typically, you should adjust blocksize so that maxrandoms * blocksize lies between 105 and 107 resulting in a reasonably long sequence of random numbers per seed.
In the example above, the call to SetCount will typically step the Random by less than maxrandoms. On the first iteration of the loop for a particular process, SetCount might step the Random by as much as maxrandoms * blocksize = 105; but this typically takes less than 1 ms on a 2 GHz machine. So the cost of this strategy will be small.
There are two other applications where this technique of resetting the count can be useful.
In the method described above, the use of SetCount() to reset the position in the random number stream depends on knowing in advance the bound on the number of random numbers generated in each iteration of the loop. In the absence of this a priori knowledge, we can use leapfrogging. In this technique, one iteration through the loop might use random numbers with indices 30, 40, 50, ..., the next will use 31, 41, 51, ..., and so on. Consider a program with a compute intensive loop which we wish to execute in parallel. Here's the starting code:
#include "RandomLib/Random.hpp" RandomLib::Random r; // Use r size_t n = 20; // The following loop has dependencies through r and cannot be // parallelized for (size_t i = 0; i < n; ++i) { // Use r some more. } // Use r again
We give 3 techniques for removing the dependency:
Random number streams with different seeds
#include "RandomLib/Random.hpp" #include <vector> RandomLib::Random r; // Use r size_t n = 20; { // Create independent random number streams. In this example, each stream // will use a unique seed. std::vector<RandomLib::Random> s(n, r); // The dependencies through r have been eliminated and this loop is now a // candidate for parallelization. for (size_t i = 0; i < n; ++i) { { // Base the seed for each new random object on the main seed. std::vector<RandomLib::seed_type> seed = s[i].Seed(); // Add a loop identifier (123 in this example) to the seed, in case // several loops need to be parallelizeed with this technique. seed.push_back(123); // Add a loop index to the seed to make is unique to this iteration seed.push_back(i); s[i].Reseed(seed); // Initialize stream with new seed } // Use s[i] instead of r } } // Use r again
Block skipping with SetCount()
#include "RandomLib/Random.hpp" #include <vector> RandomLib::Random r; // Use r size_t n = 20; { const size_t maxrandom = 1000; const long long base = r.Count(); // Create independent random number streams. In this example, each stream // will use a block of maxrandom consecutive random numbers in base + [i * // maxrandom, (i+1) * maxrandom). std::vector<RandomLib::Random> s(n, r) // The dependencies through r have been eliminated and this loop is now a // candidate for parallelization. for (size_t i = 0; i < n; ++i) { s[i].SetCount(base + i * maxrandom); // Use s[i] instead of r if (s[i].Count() > base + (i+1) * maxrandom) { // Issue a warning that maxrandom is too small } } if (n > 0) // s[n-1] is already stepped nearly to the end of the n blocks r = s[n-1]; // Step r past the n blocks. r.SetCount(base + n * maxrandom); } // Use r again
Leapfrogging with SetStride()
#include "RandomLib/Random.hpp" #include <vector> RandomLib::Random r; // Use r size_t n = 20; { // Create independent random number streams. In this example, each stream // will random numbers with indices, base + i + n * [0, 1, 2, 3, ...]. std::vector<RandomLib::Random> s(n, r); // The dependencies through r have been eliminated and this loop is now a // candidate for parallelization. for (size_t i = 0; i < n; ++i) { // Set up leapfrogging s[i].SetStride(n, i); // Use s[i] instead of r } // Advance r to the high-water mark of s[i]. if (n > 0) { r = s[0]; // Turn off leapfrogging r.SetStride(); } for (size_t i = 1; i < n; ++i) if (s[i].Count() > r.Count()) r.SetCount(s[i].Count()); } // Use r again
Leapfrogging slows down the generation of raw random numbers by a factor of approximately 1 + (stride - 1)/2. Thus, in normal use, the stride should less than 100 or so. (However this will of course depend on what other work in done in the loops.) If the loop count is large, we might be able to reduce the stride by partially unrolling the loop. Alternatively, we could replace the loop by a double loop and then the stride would be the loop count on the inner loop. The choice of the strategy will naturally depend on how parallelization is being implemented.