Introduction
One of the first exercises given to me as a mathematics student was to write a random number generator (RNG) – which turned out not to be so easy. Test sequences cycled quickly, or were too predictable, or were not evenly distributed. Typically, when we talk about RNG’s, we are describing pseudorandom number generators. Nowadays, there are many programs that will generate pseudorandom numbers.
Where are random numbers used? When I was first a developer they were rarely required. Recently, however we’ve seen them appear in more and more places – it seems they are everywhere!
In DevOps, I’ve used RNG’s for creating message payloads of arbitrary size, and for file or directory names, among other things. These values are often created using scripts written in bash.
This article will explore three simple RNG’s that can be run from bash, and some ways to test just how random they actually are.
The three RNG’s being evaluated here are:
- Bash
RANDOM
variable - Awk
rand()
function - /dev/urandom device
It’s not an exhaustive list; there are certainly others (such as jot). However, the three described here are likely to already be installed on your Linux box.
A word of caution: NONE of these tools are suitable for generating passwords or for cryptography.
Source and test data for this article can be found on github.
In future articles I will take a closer look at random values used in testing code and how they can be used for model simulation in statistics.
Testing Numeric Sequences for Randomness
I am going to evaluate the apparent randomness (or otherwise) of a short list of 1000 values generated by each RNG. To ease comparison the values will be scaled to the range 0 ≤ x < 1. They are rounded to two decimal places and the list of values is formatted as 0.nn
.
There are many tests that can be applied to sequences to check for randomness. In this article I will use the Bartels Rank Test. It’s limitations (and those of other tests) are described in this paper.
I’ve chosen this test as it is relatively easy to understand and interpret. Rather than comparing the magnitude of each observation with its preceding sample, the Bartels Rank Test ranks all the samples from the smallest to the largest. The rank is the corresponding sequential number in the list of possibilities. Under the null hypothesis of randomness, all rank arrangements from all possibilities should be equally likely to occur. The Bartels Rank Test is also suitable for small samples.
To get a feel of the test: consider two of the small data sets provided by the R package randtests.
Example 1: Annual data on total number of tourists to the United States, 1970-1982
Example 5.1 in Gibbons and Chakraborti (2003), p.98
What this low p-value tells us about the sample data is there is strong evidence against the null hypothesis of randomness. Instead, it favours the alternative hypothesis: that of a trend.
(For a simple guide on how to interpret the p-value, see this)
Example: Changes in Australian Share Prices, 1968-78
Changes in stock levels for 1968-1969 to 1977-1978 (in AUD million), deflated by the Australian gross domestic product (GDP) price index (base 1966-1967) – example from Bartels (1982)
Here, the sample data provides weak evidence against the null hypothesis of randomness (which does not fully support the alternative hypothesis of non-random data).
Random Number Generators in Bash scripts
bash RANDOM variable
How to use RANDOM
Bash provides the shell variable RANDOM. On interrogation it will return a pseudorandom signed 16-bit integer between 0 and 32767.
RANDOM
is easy to use in bash:
Here, RANDOM
is seeded with a value (314). Seeding a random variable with the same seed will return the same sequence of numbers thereafter. This is a common feature of RNG’s and is required for results to be reproducible.
To generate a random integer between START
and END
, where START
and END
are non-negative integers, and START
< END
, use this code:
For example, to simulate 10 rolls of a 6 sided dice:
Checking Sequences from RANDOM
The following code will generate a random sample using bash RANDOM. The sample will be scaled to 1. The generated sample will then be tested using Bartels Rank test, where the null hypothesis is that the sequence is random.
Prepare the test data:
Show first 10 values:
Plot the sequence versus value from RNG:
Run Bartels Rank test:
Result
With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.
awk rand()
The following code will generate a random sample using the rand
function from GNU Awk. This function generates random numbers between 0 and 1. . The generated sample will then be tested using Bartels Rank test, where the null hypothsis is that the sequence is random.
Example (as before rand()
is seeded).
If you don’t specify a seed in srand()
it will not return the same results.
You can also generate random integers in a range. For example, to simulate 10 rolls of a 6 sided dice:
Checking Sequences from awk rand()
Prepare the test data:
Show first 10 values:
Plot the sequence vs value from RNG:
Run Bartels Rank test:
Result
With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.
urandom device
The final tool is the /dev/urandom device. The device provides an interface to the linux kernel’s random number generator. This is a useful tool as it can generate a wide variety of data types.
For example, to print a list of unsigned decimal using od(1):
It can also be used to source random hexadecimal values:
Or it can generate a block of 64 random alphanumeric bytes using:
Checking Sequences from urandom
To create a test sample that is scaled to 1, I will sample two digits from urandom and use these as decimal values. The generated sample will then be tested using the Bartels Rank test, where the null hypothesis is that the sequence is random.
Prepare the test data:
Show first 10 values:
Plot the sequence vs value from RNG:
Run the Bartels Rank test:
Result
With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.
Some final thoughts
Of the tools explored, urandom
is the most versatile, so it has broader application. The downside is that its results are not easily reproducible and issues have been identified by a study by Gutterman, Zvi; Pinkas, Benny; Reinman, Tzachy (2006-03-06) for the Linux kernel version 2.6.10.
Personally, this has been a useful learning exercise. For one, it showed the limitations in generating and testing for (pseudo)random sequences. Indeed, Aaron Roth, has suggested:
As others have mentioned, a fixed sequence is a deterministic object, but you can still meaningfully talk about how “random” it is using Kolmogorov Complexity: (Kolmogorov complexity). Intuitively, a Kolmogorov random object is one that cannot be compressed. Sequences that are drawn from truly random sources are Kolmogorov random with extremely high probability.
Unfortunately, it is not possible to compute the Kolmogorov complexity of sequences in general (it is an undecidable property of strings). However, you can still estimate it simply by trying to compress the sequence. Run it through a Zip compression engine, or anything else. If the algorithm succeeds in achieving significant compression, then this certifies that the sequence is -not- Kolmogorov random, and hence very likely was not drawn from a random source. If the compression fails, of course, it doesn’t prove that the sequence has high Kolmogorov complexity (since you are just using a heuristic algorithm, not the optimal (undecidable) compression). But at least you can certify the answer in one direction.
In light of this knowledge, lets run the compression tests for the sequences above:
Compress using zip:
Compare this to non-random (trend) data:
Or just constant data:
So zipping is a good, if rough, proxy for a measure of randomness.
Stay tuned for part two to discover how random data can be used in testing code.