Parallelization

Back to Programming tips. Forward to Shuffling the lines in a file. Up to Contents.

Many large codes are designed to run in parallel environments. Each program thread then needs its own private Random which needs to be uniquely seeded. In a message passing environment the code would look like

  // Parallel code where each thread gets a unique seed
  RandomLib::Random r(0);             // Thread local Random
  std::vector<unsigned long> seed;    // Thread local seed
  if (comm.Id() == 0) {               // On master node
    if (debug)
      seed.push_back(314159265UL);
    else
      seed = RandomLib::Random::SeedVector();
    logfile << "Master seed: " << RandomLib::Random::VectorToString(seed)
            << std::endl;
    comm.Send(slaveIds, seed);       // Send seed to slaves
  } else                             // On slave nodes
    comm.Receive(seed);              // Receive seed from master
  seed.push_back(comm.Id());         // Make seed unique to this process
  r.Reseed(seed);                    // Reseed Random
  ...                                // The work goes here

This approach has the disadvantage that the results of the code will depend on the number of processors (assuming that debug is true, so that the master seed is fixed). If your code divides the computation out amongst the threads in a way that depends on how quickly each thread completes a unit of work, then the results will vary from run to run even with a fixed number of threads. This is a huge problem when developing a code: debugging a multi-threaded code is difficult enough; debugging a code where the results aren't repeatable just increases the difficulty. The solution is to require that the results be independent of the number of processors. This would allow the algorithms and basic code to be debugged with a non-parallel version of the code. When moving to the parallel version, we require that the results stay the same, and any deviation from this signals a bug in the parallelization. This can be accomplished by having the seed depend on the the data (e.g., a loop index) rather than on processor. Here's how this might be done:

  // Parallel code where each loop iteration gets a unique seed
  RandomLib::Random r(0);                        // Thread local Random
  std::vector<unsigned long> seed;    // Thread local seed
  if (comm.Id() == 0) {
    // On master node
    if (debug)
      seed.push_back(314159265UL);
    else
      seed = RandomLib::Random::SeedVector();
    logfile << "Master seed: " << RandomLib::Random::VectorToString(seed)
            << std::endl;
    comm.Send(slaveIds, seed);       // Send seed to slaves
  } else
    comm.Receive(seed);              // Receive seed from master
  seed.push_back(0);                 // Increase the seed size by 1
  // Num units are work are to be divided between nproc processes.
  // Assume comm.Id() is in [0,nproc).
  for (size_t i = 0; i < num; ++i) {
    if (i % nproc != comm.Id())     // Handle this index?
      continue;
    seed.back() = i;                 // Make seed unique to this index
    r.Reseed(seed);                  // Reseed Random
    ...                              // The work goes here
  }

This code will satisfy the goal of producing results which are independent on the number of processors. However, if num is large (say 10⁸), and if each iteration through the loop consumes only a small number of random numbers (say, at most, 100). Then we are abusing the SFMT19937 random number generator. There are two potential problems.

Efficiency: We can see from the following table that reseeding the generator is roughly as expensive as consuming 2000 random numbers from the generator.
Statistical accuracy: A sequence of numbers produced by MT19937 with a given seed has some proven good statistical properties. Little is known about the properties of the sequence of numbers obtained when SFMT19937 is frequently reseeded.

Approximate relative times for basic seeding operations
Function	Relative time
s = Random::SeedWord();	500000
v = Random::SeedVector();	10000
r.Reseed(v), r.SetCount(0)	4000
i = r();	1
r.StepCount(N);	abs(N)/2

In order to address efficiency, we should use at least 10⁴ random numbers for each seed. In order to preserve the statistical properties of SFMT19937, we should consume at least 10⁵ random numbers for each seed. (Incidentally, this table also shows the high cost of Random::SeedWord(). This is mainly because of accessing /dev/urandom. Typically you should call Random::SeedWord() at most once per code run.)

A way to ameliorate this problem is to change the loop into a double loop and do the parallelization and reseeding at the level of the outer loop. Here we choose the block size so that maxrandoms * blocksize is at least 10⁵ where maxrandoms is the maximum number of random numbers consumed per iteration (100 in this example) and blocksize is the number of iterations in the inner loop. It's easy to estimate maxrandoms with calls to Count() at the beginning and end of the body of the loop. Thus:

  // Parallel code where each block gets a unique seed
  ...                                // Get the master seed as above
  // Num units are work are to be divided between nproc processes in
  // blocks of size blocksize.
  const size_t blocksize = 1000;
  for (size_t i = 0, j = 0; i < num; ++j) { // j is block index
    if (j % nproc != comm.Id())     // Handle this block?
      continue;
    seed.back() = j;                 // Make seed unique to this block
    r.Reseed(seed);                  // Reseed Random
    do {
      ...                            // A block of work goes here
    } while ((++i) % blocksize && i < num);
  }

With this splitting of the code into block you loose some flexibility in parallelization. We can overcome this by letting the reseeding be interspersed with the assignment to processors. To do this we need an estimate of the maximum number of random numbers consumed by one iteration of the inner loop. We then use SetCount(n) to position the random number sequence to a definite position at the beginning of each loop. (We saw in the table above that SetCount is relatively cheap.) In this example, we avoid unnecessary reseeding by having a given processor to deal with a consecutive set of loop iterations.

  // Parallel code where each iteration gets a unique (seed, count)
  ...                                // Get the master seed as above
  // Num units are work are to be divided between nproc processes.
  // Assignment to processes and seeding are treated independently.
  const size_t blocksize = 1000;
  const long long int maxrandoms = 100; // Max # of randoms per loop
  long long int checkmaxrandoms = 0;
  size_t ibeg = (comm.Id() * num)/nproc,  // The index range
         iend = ((comm.Id() + 1) * num)/nproc;
  for (size_t i = ibeg, b = num; i < iend; ++i) { // b is block
    size_t newb = i/blocksize;       // Which block?
    if (newb != b) {                 // in a new block?
      b = newb;
      seed.back() = b;
      r.Reseed(seed);                // then reseed
    }
    size_t k = i % blocksize;        // index within block
    r.SetCount(k * maxrandoms);      // step to starting point
      ...                            // The work goes here
    checkmaxrandoms =                // How many randoms used
      std::max(checkmaxrandoms, r.Count() - k * maxrandoms);
  }
  if (checkmaxrandoms > maxrandoms) {
    ... // flag a warning that maxrandoms needs to be increased
  }

If more that maxrandoms numbers are used in a loop, you will have reused some random numbers and the results should be treated with some suspicion. In this case, you should set maxrandoms to, say, double the value of checkmaxrandoms and try again. The number of randoms consumed will usually be normally distributed and it should be easy to pick a value of maxrandoms which is safe. Typically, you should adjust blocksize so that maxrandoms * blocksize lies between 10⁵ and 10⁷ resulting in a reasonably long sequence of random numbers per seed.

In the example above, the call to SetCount will typically step the Random by less than maxrandoms. On the first iteration of the loop for a particular process, SetCount might step the Random by as much as maxrandoms * blocksize = 10⁵; but this typically takes less than 1 ms on a 2 GHz machine. So the cost of this strategy will be small.

There are two other applications where this technique of resetting the count can be useful.

We are running the code on two different platforms were the arithmetic operations are done in different orders yielding slightly different results. The concern is at an early iteration of the loop will consume one more random number on one platform compared to the other (e.g., due to the use of a rejection method in NormalDistribution). This results in a cascade of completely different results for the rest of the calculation. Presumably both results are valid. However, in tracking down bugs, we have wish the two versions to stay closer together. By resetting the count on each iteration, we prevent this cascading effect.
Correlated sampling is a variance reduction technique where a Monte Carlo code is run twice with slightly different parameters in order, for example, to determine the change in neutron diffusion as the temperature of the moderator varies in a fission reactor. Such calculations are important since they determine the stability of the reactor. Since the results from Monte Carlo codes are intrinsically noisy, it's very inaccurate to "numerically differentiate" by running the code twice with slightly changed parameters. With correlated sampling the two runs of the code use the same seeds and hence random number sequence, so that main component of the noise cancels when numerically differentiating. But here again we would need to resynchronize the random number streams, e.g., on the start of each neutron flight, in order to avoid a cascading of discrepancies. In this case one or two neutrons may have completely different histories (e.g., be captured instead of escaping), but most of the neutrons have histories which are close.

Leapfrogging

In the method described above, the use of SetCount() to reset the position in the random number stream depends on knowing in advance the bound on the number of random numbers generated in each iteration of the loop. In the absence of this a priori knowledge, we can use leapfrogging. In this technique, one iteration through the loop might use random numbers with indices 30, 40, 50, ..., the next will use 31, 41, 51, ..., and so on. Consider a program with a compute intensive loop which we wish to execute in parallel. Here's the starting code:

  #include "RandomLib/Random.hpp"

  RandomLib::Random r;
  // Use r
  size_t n = 20;
  // The following loop has dependencies through r and cannot be
  // parallelized
  for (size_t i = 0; i < n; ++i) {
    // Use r some more.
  }
  // Use r again

We give 3 techniques for removing the dependency:

Random number streams with different seeds
Block skipping with SetCount()
Leapfrogging with SetStride()

Random number streams with different seeds

  #include "RandomLib/Random.hpp"
  #include <vector>

  RandomLib::Random r;
  // Use r
  size_t n = 20;
  {
    // Create independent random number streams.  In this example, each stream
    // will use a unique seed.
    std::vector<RandomLib::Random> s(n, r);
    // The dependencies through r have been eliminated and this loop is now a
    // candidate for parallelization.
    for (size_t i = 0; i < n; ++i) {
      {
        // Base the seed for each new random object on the main seed.
        std::vector<RandomLib::seed_type> seed = s[i].Seed();
        // Add a loop identifier (123 in this example) to the seed, in case
        // several loops need to be parallelizeed with this technique.
        seed.push_back(123);
        // Add a loop index to the seed to make is unique to this iteration
        seed.push_back(i);
        s[i].Reseed(seed);      // Initialize stream with new seed
      }
      // Use s[i] instead of r
    }
  }
  // Use r again

Block skipping with SetCount()

  #include "RandomLib/Random.hpp"
  #include <vector>

  RandomLib::Random r;
  // Use r
  size_t n = 20;
  {
    const size_t maxrandom = 1000;
    const long long base = r.Count();
    // Create independent random number streams.  In this example, each stream
    // will use a block of maxrandom consecutive random numbers in base + [i *
    // maxrandom, (i+1) * maxrandom).
    std::vector<RandomLib::Random> s(n, r)
    // The dependencies through r have been eliminated and this loop is now a
    // candidate for parallelization.
    for (size_t i = 0; i < n; ++i) {
      s[i].SetCount(base + i * maxrandom);
      // Use s[i] instead of r
      if (s[i].Count() > base + (i+1) * maxrandom) {
        // Issue a warning that maxrandom is too small
      }
    }
    if (n > 0)
      // s[n-1] is already stepped nearly to the end of the n blocks
      r = s[n-1];
    // Step r past the n blocks.
    r.SetCount(base + n * maxrandom);
  }
  // Use r again

Leapfrogging with SetStride()

  #include "RandomLib/Random.hpp"
  #include <vector>

  RandomLib::Random r;
  // Use r
  size_t n = 20;
  {
    // Create independent random number streams.  In this example, each stream
    // will random numbers with indices, base + i + n * [0, 1, 2, 3, ...].
    std::vector<RandomLib::Random> s(n, r);
    // The dependencies through r have been eliminated and this loop is now a
    // candidate for parallelization.
    for (size_t i = 0; i < n; ++i) {
      // Set up leapfrogging
      s[i].SetStride(n, i);
      // Use s[i] instead of r
    }
    // Advance r to the high-water mark of s[i].
    if (n > 0) {
      r = s[0];
      // Turn off leapfrogging
      r.SetStride();
    }
    for (size_t i = 1; i < n; ++i)
      if (s[i].Count() > r.Count())
        r.SetCount(s[i].Count());
  }
  // Use r again

Leapfrogging slows down the generation of raw random numbers by a factor of approximately 1 + (stride - 1)/2. Thus, in normal use, the stride should less than 100 or so. (However this will of course depend on what other work in done in the loops.) If the loop count is large, we might be able to reduce the stride by partially unrolling the loop. Alternatively, we could replace the loop by a double loop and then the stride would be the loop count on the inner loop. The choice of the strategy will naturally depend on how parallelization is being implemented.

Back to Programming tips. Forward to Shuffling the lines in a file. Up to Contents.