RandomLib  1.10
Parallelization
Back to Programming tips. Forward to Function index. Up to Contents.

Many large codes are designed to run in parallel environments. The parallelization may be across the cores of a multi-core computer or across a cluster of computers. Some of the more-or-less portable methods for parallelization are

I will use OpenMP to illustrate the techniques for paralleling RandomLib because both g++ and Visual Studio support it and because it entails the least modification of a serial code. However the techniques are readily applied to the other parallelization methods listed above.

We can stipulate some goals for a parallel code which uses random numbers:

1. The result should be correct.
2. The program should make effective use of the computational resources.
3. Running the code twice with the same random number seed on the same hardware with the same number of processors should produce identical results.
4. Running the code twice with the same random number seed on the same hardware with a different number of processors should produce identical results.
5. Running the code twice with the same random number seed on the same hardware in a serial mode should produce identical results.

Goal 1 requires (a) that updates to shared objects are properly protected by locks and (b) that the random numbers used by different threads are independent. There are three ways that (b) can be achieved:

• Different threads use a single Random object with access to it protect by a lock. The drawbacks to this approach are the potentially high cost of frequently locking the Random object and the impossibility of achieving goal 3 using such a scheme (since the order that different thread acquire the lock on the Random object will be indeterminate).
• Different threads use copies of a single Random object (or more generally they use distinct Random objects which have the same seed), but they access non-overlapping sequences out of the Random object.
• Different threads use distinct Random objects which have distinct seeds.

We shall use a combination of the last two ways.

Achieving goal 2 depends on the problem. However, many applications using random numbers can be decomposed into many independent (or weakly dependent) pieces which can be run as separate threads. There will potentially be stalls in the threads waiting for locks to be released. In some cases, the overall throughput will be limited by some other critical resource, e.g., bandwidth to the disk. However, it's often possible to ensure that the all cores on a multiprocessor machine are busy and to speed up dramatically applications using random sampling.

The remaining goals, 3–5, concern the issues of being able to repeat a calculation possibly in a simpler environment for the purposes of validation, debugging, etc. For some applications, this may not matter. However, a Monte Carlo simulation of neutron transport in a fission reactor is an example of a code where it's very important to ensure that the code is correct and where it's easy to diagnose problems.

An example: estimate the volume of an n-dimensional unit sphere. A simple way to do this is to sample points uniformly in the enclosing n-dimensional cube and count how many of these points lie inside the sphere. The following code is a slight improvement to this where we integrate analytically in one of the dimensions. (These samples of code are taken from RandomThread.cpp.)

double dotrials(Random& r, int d, long long n) {
// Require d > 0, n > 0
double w = 0;
for ( ; n--; ) { // Iterate n times
double h = 0;
for (int i = 1; i < d; ++i) { // Iterate d-1 times
double x = 2 * r.FixedS(); // x is in (-1, 1)
h += x * x; // cumulative radius^2
if (h >= 1) break; // Point can't be in sphere; bail out,
}
// If h < 1 then inside a (d-1) dimensional unit sphere at radius
// sqrt(h), so extent of last dimension is +/- sqrt(1 -h)
w += h < 1 ? sqrt(1 - h) : 0;
}
return w;
}
double result(int d, long long n, double w) {
// Volume of (d-1) dimensional box = 2^(d-1).
// Multiply by another 2 to account for +/- extent in last dimension.
return double(1U << d) * w / double(n);
}
int main() {
int d = 4; // Number of dimensions
long long n = 100000000; // Number of trials 10^8
Random r; r.Reset();
double weight = dotrials(r, d, n);
double volume = result(d, n, weight);
cout << volume << "\n";
}

I've divided the computation of the volume into two functions, dotrials and result, to aid in the exposition. Nearly all the time is spent in dotrials and, as it stands, the loops in this function cannot of executed in parallel because of the updates to the random number generator, r.

Convert this to a code which is ready for parallelization by making multiple calls to dotrials as follows

int main() {
int d = 4; // Number of dimensions
long long n = 100000000; // Number of trials 10^8
int k = 100; // Number of tasks
vector<double> w(k); // Vector for partial weights
for (int i = 0; i < k; ++i) { // the main loop over tasks
Random r; // task specific Random
// Initialize r in a task specific way ...
// Do the work; last argument splits n exactly into k pieces
w[i] = dotrials(r, d, (n * (i + 1))/k - (n * i)/k);
}
// Sum up the weights from the individual tasks
double weight = accumulate(w.begin(), w.end(), 0.0);
// Compute the result
double volume = result(d, n, weight);
cout << volume << "\n";
}

The loop here splits the n samples into k independent tasks. This loop is a candidate for parallelization. First we need to decide on a suitable choice for k and on how the random number generators should be initialized. These need to address two issues:

• Efficiency: We can see from the following table that reseeding the generator is roughly as expensive as consuming 4000 random numbers from the generator. There may also be overhead in dealing with a large number of tiny tasks.
• Statistical accuracy: A sequence of numbers produced by MT19937 or SFMT19937 with a given seed has some proven good statistical properties. Little is known about the properties of the sequence of numbers obtained when these generators are frequently reseeded.
Approximate relative times for basic seeding operations
Function Relative time
s = Random::SeedWord(); 500000
v = Random::SeedVector(); 10000
r.Reseed(v), r.SetCount(0) 4000
i = r(); 1
r.StepCount(N); abs(N)/3

In order to address efficiency, we should pick k small enough that we use at least 104 to 105 random numbers for each task. In order to preserve the statistical properties of SFMT19937, we should consume as many random numbers as possible for each task. (Incidentally, this table also shows the high cost of Random::SeedWord(). This is mainly because of accessing /dev/urandom. Typically you should call Random::SeedWord() at most once per code run.) In addition we should choose k large enough to spread the tasks out between the CPUs. It is good to choose k to be a several times the number of CPUs. This ensures good utilization of resources if some of the tasks complete more quickly than others (because they entail less computation or because some CPUs are faster than others); it also allows you to more your code to a larger machine (or group of machines) without adjusting k.

An obvious way to initialize the random generator for each task is to include the task number i in the seed. For example

// Set master_seed to a "unique" vector
vector<unsigned long> master_seed(Random::SeedVector());
master_seed.push_back(0); // Reserve an additional slot
for (int i = 0; i < k; ++i) { // the main loop over tasks
Random r; // task specific Random
{
vector<unsigned long> seed(master_seed); // task specific seed
seed.back() = i; // include task id in seed
r.Reseed(seed);
}
w[i] = dotrials(r, d, (n * (i + 1))/k - (n * i)/k);
}

Note the ease with which the seed can be made to depend on the task id — merely by appending it to the seed vector. This is easily generalized in more complicated applications, e.g., if the tasks are indexed in two directions.

However, we can improve on this a little by using leapfrogging. This is illustrated by

vector<unsigned long> master_seed(Random::SeedVector());
for (int i = 0; i < 10; ++i) { // the main loop over tasks
Random r(master_seed); // use the same seed for each task
r.SetStride(10, i); // turn on leapfrogging with an offset i
...
}

where I have taken k = 10. In this example, each of the 10 tasks uses the same seed. However, the calls to SetStride cause task 0 to use the random numbers with indices 0, 10, 20, 31..., task 1 to use those with indices 1, 11, 21, 31, ..., and so on. This interval between random indices (10 in this example) is called the stride.

There is an overhead to this approach, since the numbers skipped over still have to be computed. However the cost of an unused number is only 1/3 of the cost of a number that is used. The relative cost might be even smaller if each tasks does an appreciably amount of additional computation for each random number it consumes. In the example described here, the overhead of when a stride is 4 is about 25%. This would be less in a more realistic example. The advantage of leapfrogging is that we retain the statistical benefits of using fewer longer sequences from the SFMT19937 generator.

We can combine the two approaches with

// Set master_seed to a "unique" vector
vector<unsigned long> master_seed(Random::SeedVector());
master_seed.push_back(0); // Reserve an additional slot
int l = 4; // The leapfrogging stride
for (int i = 0; i < k; ++i) { // the main loop over tasks
Random r; // task specific Random
{
vector<unsigned long> seed(master_seed); // task specific seed
seed.back() = i / l; // include task id in seed
r.Reseed(seed);
// Turn on leapfrogging with an offset that depends on the task id
r.SetStride(l, i % l);
}
// Do the work; last argument splits n exactly into k pieces
w[i] = dotrials(r, d, (n * (i + 1))/k - (n * i)/k);
}

With the numerical values given here, the 100 tasks use 25 sequences (each with a distinct seed), and 4000000 samples are taken from each sequence.

The final step is to cause the iterations of the main task loop to be carried out in parallel. With OpenMP, this is easily achieved by inserting the pragma

#pragma omp parallel for
for (int i = 0; i < k; ++i) { // the main loop over tasks
...
}

and compiling the code with -fopenmp, for g++, or turning on OpenMP support in Visual Studio (C/C++ -> Language -> OpenMP Support). OpenMP is easy to configure using cmake (see the file, examples/CMakeLists.txt)

RandomThread.cpp is a complete program that carries out this computation. Running

RandomThread -n 1e10

gives

Estimate volume of a 4-dimensional sphere;
samples = -n 10000000000; tasks = -k 100; leapfrog stride = -l 4;
using RandomEngine<SFMT19937<Random_u32>,MixerSFMT>
with master seed = -s [64121,1307135579,11192,562213576,2011].
Estimated volume = 4.93475199 

On an 8-core Intel Xeon, x86_64 2.66GHz machine with SSE2 instructions this takes about 70 sec. The exact result for the volume of a 4-dimensional sphere is π2/2 = 4.9348022...

You can verify that goals 3–5 have been met by varying the number of threads allocated. This is accomplished by setting the environment variable OMP_NUM_THREADS at run time. In particular, setting this to 1 causes the code to be executed serially; thus

env OMP_NUM_THREADS=1 RandomThread -n 1e10 -s 64121,1307135579,11192,562213576,2011


returns the identical result, 4.93475199 (but takes about 8 times longer).

Back to Programming tips. Forward to Function index. Up to Contents.