RandomLib
1.10
|
Many large codes are designed to run in parallel environments. The parallelization may be across the cores of a multi-core computer or across a cluster of computers. Some of the more-or-less portable methods for parallelization are
I will use OpenMP to illustrate the techniques for paralleling RandomLib because both g++ and Visual Studio support it and because it entails the least modification of a serial code. However the techniques are readily applied to the other parallelization methods listed above.
We can stipulate some goals for a parallel code which uses random numbers:
Goal 1 requires (a) that updates to shared objects are properly protected by locks and (b) that the random numbers used by different threads are independent. There are three ways that (b) can be achieved:
We shall use a combination of the last two ways.
Achieving goal 2 depends on the problem. However, many applications using random numbers can be decomposed into many independent (or weakly dependent) pieces which can be run as separate threads. There will potentially be stalls in the threads waiting for locks to be released. In some cases, the overall throughput will be limited by some other critical resource, e.g., bandwidth to the disk. However, it's often possible to ensure that the all cores on a multiprocessor machine are busy and to speed up dramatically applications using random sampling.
The remaining goals, 3–5, concern the issues of being able to repeat a calculation possibly in a simpler environment for the purposes of validation, debugging, etc. For some applications, this may not matter. However, a Monte Carlo simulation of neutron transport in a fission reactor is an example of a code where it's very important to ensure that the code is correct and where it's easy to diagnose problems.
An example: estimate the volume of an n-dimensional unit sphere. A simple way to do this is to sample points uniformly in the enclosing n-dimensional cube and count how many of these points lie inside the sphere. The following code is a slight improvement to this where we integrate analytically in one of the dimensions. (These samples of code are taken from RandomThread.cpp.)
I've divided the computation of the volume into two functions, dotrials and result, to aid in the exposition. Nearly all the time is spent in dotrials and, as it stands, the loops in this function cannot of executed in parallel because of the updates to the random number generator, r.
Convert this to a code which is ready for parallelization by making multiple calls to dotrials as follows
The loop here splits the n samples into k independent tasks. This loop is a candidate for parallelization. First we need to decide on a suitable choice for k and on how the random number generators should be initialized. These need to address two issues:
Function | Relative time |
---|---|
s = Random::SeedWord(); | 500000 |
v = Random::SeedVector(); | 10000 |
r.Reseed(v), r.SetCount(0) | 4000 |
i = r(); | 1 |
r.StepCount(N); | abs(N)/3 |
In order to address efficiency, we should pick k small enough that we use at least 104 to 105 random numbers for each task. In order to preserve the statistical properties of SFMT19937, we should consume as many random numbers as possible for each task. (Incidentally, this table also shows the high cost of Random::SeedWord(). This is mainly because of accessing /dev/urandom. Typically you should call Random::SeedWord() at most once per code run.) In addition we should choose k large enough to spread the tasks out between the CPUs. It is good to choose k to be a several times the number of CPUs. This ensures good utilization of resources if some of the tasks complete more quickly than others (because they entail less computation or because some CPUs are faster than others); it also allows you to more your code to a larger machine (or group of machines) without adjusting k.
An obvious way to initialize the random generator for each task is to include the task number i in the seed. For example
Note the ease with which the seed can be made to depend on the task id — merely by appending it to the seed vector. This is easily generalized in more complicated applications, e.g., if the tasks are indexed in two directions.
However, we can improve on this a little by using leapfrogging. This is illustrated by
where I have taken k = 10. In this example, each of the 10 tasks uses the same seed. However, the calls to SetStride cause task 0 to use the random numbers with indices 0, 10, 20, 31..., task 1 to use those with indices 1, 11, 21, 31, ..., and so on. This interval between random indices (10 in this example) is called the stride.
There is an overhead to this approach, since the numbers skipped over still have to be computed. However the cost of an unused number is only 1/3 of the cost of a number that is used. The relative cost might be even smaller if each tasks does an appreciably amount of additional computation for each random number it consumes. In the example described here, the overhead of when a stride is 4 is about 25%. This would be less in a more realistic example. The advantage of leapfrogging is that we retain the statistical benefits of using fewer longer sequences from the SFMT19937 generator.
We can combine the two approaches with
With the numerical values given here, the 100 tasks use 25 sequences (each with a distinct seed), and 4000000 samples are taken from each sequence.
The final step is to cause the iterations of the main task loop to be carried out in parallel. With OpenMP, this is easily achieved by inserting the pragma
and compiling the code with -fopenmp, for g++, or turning on OpenMP support in Visual Studio (C/C++ -> Language -> OpenMP Support). OpenMP is easy to configure using cmake (see the file, examples/CMakeLists.txt)
RandomThread.cpp is a complete program that carries out this computation. Running
RandomThread -n 1e10
gives
Estimate volume of a 4-dimensional sphere; samples = -n 10000000000; tasks = -k 100; leapfrog stride = -l 4; using RandomEngine<SFMT19937<Random_u32>,MixerSFMT> with master seed = -s [64121,1307135579,11192,562213576,2011]. Estimated volume = 4.93475199
On an 8-core Intel Xeon, x86_64 2.66GHz machine with SSE2 instructions this takes about 70 sec. The exact result for the volume of a 4-dimensional sphere is π2/2 = 4.9348022...
You can verify that goals 3–5 have been met by varying the number of threads allocated. This is accomplished by setting the environment variable OMP_NUM_THREADS
at run time. In particular, setting this to 1 causes the code to be executed serially; thus
env OMP_NUM_THREADS=1 RandomThread -n 1e10 -s 64121,1307135579,11192,562213576,2011
returns the identical result, 4.93475199 (but takes about 8 times longer).