Posts

Photo by Artem Sapegin on Unsplash

Using Randomness to Test Code

In part 1 of this series, we explored pseudo-random values. These are values that are statistically random, but are derived from a known starting point and is typically repeated over and over. In this article we explore how random values can be used in testing. You may already be familiar with randomness in test invocation. For instance JUnit5 provides an annotation to randomise the order of test execution. Here, however, we are looking at a style of testing that uses randomly generated input values that test properties of your code. This is known as "Property Based Testing".

You may be asking: "Why would you use random values in testing? Doesn’t that defeat the whole purpose of unit testing, where known values are tested to known responses?" Well, no. Often it is hard to think of suitable positive and negative test cases that exercise your code. In addition: what if many randomly selected tests can be automatically run? And what if these tests cover many different input values? And what if instances where tests fail are automatically recorded so they can be reported and replayed later? These are just some of the benefits of this approach to testing.

Property-based testing verifies your program code using a large range of relevant inputs. It does this by generating a random sample of valid input values. Perhaps an example may help. Given a utility method to convert a string field into uppercase text, a unit test would use some expected values. In pseudo-code this may look like:

Given "abc123" then expect "ABC123"
Given "ABC" then expect "ABC"
Given "123" then expect "123"

In comparison, property-based tests look at the behaviour of any field that matches the input type. In pseudo-code this looks like:

For any lowercase alphanumeric string then expect the same string but in uppercase.
For any uppercase alphanumeric string then expect the same string, unchanged.
For any non-alphabetic string then expect the same string, unchanged.

The "for any" is where the randomness comes in.

This article will review where these ideas came from, before outlining the core principles of property-based testing. It will introduce the concepts of Generators and Shrinkage and discuss approaches to reproducing tests results.

History

These concepts aren’t new. Tools like Lorem Ipsum have been around since the 1960s to model text.

Kent Beck developed a unit testing framework for Smalltalk in 1989. Those tests had to be hand crafted. This introduced a number of key concepts that we now take for granted. It organised and provided recipes for unit tests. For each test case, the test data was created then thrown away at the end. Test cases were aggregated into a test suite. The test suite formed part of a framework that also produced a report — an example of what is known as literate programming.

In 1994, Richard Hamlet wrote about Random Testing. Hamlet posited that computers could efficiently test a "vast number" of random test points. Another benefit he identified was that random testing provided a "statistical prediction of significance in the observed results". This last point is somewhat technical. In essence it describes the ability to quantify the significance of a test that does not fail. In other words: is this just testing trivial cases?

A few years later, in 1999, the influential paper by Claessen and Hughes, QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs, provided a whole new way to run tests using randomised values. Their paper was written for the functional programming language, Haskell. It inspired a suite of property-based testing tools for many other languages. A list of current implementations appears on the QuickCheck Wikipedia page.

Introducing Property Based Testing

The core principle of property-based testing is that for a function or method, any valid input should yield a valid response. Likewise, any input outside this range should return an appropriate failure. Compare this to how systematic tests are normally written: Given a specific input, check the program’s return value. Therein lies the problem: you need to be sure you have chosen correct and sufficient input values to test your code. The tools we use to check test coverage do not check the adequacy of your tests — just that you have a test for a control flow path. So the quality of a test is dependent upon the quality of the inputs. Property-based testing provides tools to test using randomly generated values selected over the range of input. This changes the focus of our tests. We concentrate on the properties of functions under test. i.e. What the inputs are and what the outputs are expected to be. Testing the properties of a function or method over a large range of values can help find bugs otherwise ignored in specific unit tests. We have experienced this first hand. It is a true "aha" moment when these tests uncover a use-case with input we hadn’t thought of.

In summary:

  • With unit testing we provide fixed inputs (e.g. 0,1,2,…) and get a fixed result (e.g. 1,2,4,…).

  • With property-based testing we provide a declaration of inputs (e.g. all non-negative ints) and declaration of conditions that must be held (e.g. result is an int).

At its core, property-based testing requires the production of randomised input test values. These test values are produced using generators.

Generators

Random values are produced using generators. These are specific functions that produce a random value. Some common generators are used to manufacture booleans, numeric types (e.g. floats, ranges of integers), characters and strings. Both QuickCheck and JUnit-QuickCheck provide many generators. Once you have a primitive generator, you can then compose these into more elaborate generators and structures like lists and maps, or other bespoke structures.

Apart from custom values, you may also want a custom distribution. Random testing is most effective when the values being tested closely match the distribution of the actual data. As the provided generators know nothing of your data, they will typically produce a uniform distribution. To control the data distribution you will need to write your own generator. Luckily, this is not a difficult task. The test tools we have used (Haskell:QuickCheck, Java:JUnit-QuickCheck and Python:Hypothesis) have rich libraries of generators that can be easily extended.

Shrinkage

Generators can produce large test values. When a failure has been detected it would be nice to find a smaller example. This is known as shrinkage.

On failure, QuickCheck reduces the selection to the minimum set. So, from a large set of test values, QuickCheck finds the minimal case that fails the test. In practice, what this does is concentrate tests to the extremes of an input value. However, this behaviour can be modified by the generator.

Shrinkage is an important feature of property based testing. Having an example of failure is good. Having a minimal example of failure is better. With a minimal example you are more likely to understand the reasons for the failure.

Test Reproduction

Random testing is useful, but once a failure has been identified and fixed, we would like to repeat the failed tests to ensure they have been resolved. Tools such as Python’s Hypothesis record all failed tests. Future runs automatically include prior failed tests.

Other tools such as Java’s JUnit-QuickCheck allow the tests to be repeated by specifying a random seed. When a test fails, the random seed used to generate that test is reported, and can then be used to reproduce the same test inputs.

Code Examples

So, what does this look like in real code? Marlo uses Java for development of digital solutions, so the first examples are based on the Java JUnit-QuickCheck package.

The following generator will create a alphanumeric "word" with length of between 1 and 12 characters.

import com.pholser.junit.quickcheck.generator.GenerationStatus;
import com.pholser.junit.quickcheck.generator.Generator;
import com.pholser.junit.quickcheck.random.SourceOfRandomness;
import java.util.stream.IntStream;

/** Generate alpha-numeric characters. */
public final class AlphaNumericGenerator extends Generator<String> {

  /** Alphanumeric characters: "0-9A-Za-z". */
  private static final String ALPHANUMERICS =
      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

  /** Maximum word length. */
  private static final int MAX_WORD_LENGTH = 11;

  /** Inherit form super class. */
  public AlphaNumericGenerator() {
    super(String.class);
  }

  /** Generate a alphanumeric word of length 1 to 12 characters. Do not create null words. */
  @Override
  public String generate(final SourceOfRandomness randomness, final GenerationStatus status) {
    final int stringSize = randomness.nextInt(MAX_WORD_LENGTH) + 1; // non-empty words
    final StringBuilder randomString = new StringBuilder(stringSize);
    IntStream.range(0, stringSize)
        .forEach(
            ignored -> {
              final int randomIndex = randomness.nextInt(ALPHANUMERICS.length());
              randomString.append(ALPHANUMERICS.charAt(randomIndex));
            });
    return randomString.toString();
  }
}

(source)

To use this generator in a unit test:

/**
  * Test alphanumeric word is same for stream as scanner using Alphanumeric generator. Trials
  * increased from the default of 100 to 1000.
  *
  * @param word a random alphanumeric word
  */
@Property(trials = 1000)
public void testAlphanumericWord(final @From(AlphaNumericGenerator.class) String word) {
  assertEquals(1, WordCountUtils.count(new Scanner(word)));
  assertEquals(1, WordCountUtils.count(Stream.of(word)));
}

(source)

Here we are using our custom generator, and have increased the trials to 1000 from the default of 100. The expected property of our word count utility is that given this input string, its output would indicate that it counted one word.

The following code uses this generator to build a list of strings that are delimited by a space. The code to be tested contains two word count methods accepting different input types. Using our custom generator we can compose test data for both input types. Then test to see if the word count methods agree:

/**
  * Test a "sentence" of alphanumeric words. A sentence is a list of words separated by a space.
  *
  * @param words build a sentence from a word stream
  */
@Property
public void testAlphanumericSentence(
    final List<@From(AlphaNumericGenerator.class) String> words) {
  final String sentence = String.join(" ", words);
  assertEquals(
      WordCountUtils.count(new Scanner(sentence)), WordCountUtils.count(Stream.of(sentence)));
}

(source)

At Marlo we also use Ansible for automation, with some custom modules written in Python. An excellent QuickCheck library for Python is Hypothesis. An equivalent generator to the Java example above is the text strategy. Used in a test, it looks like:

@given(text(min_size=1, max_size=12, alphabet=ascii_letters + digits))
def test_alphanumeric(a_string):
    """
    Generate alphanumeric sized strings like:
        'LbkNCS4xl2Xl'
        'z3M4jc1J'
        'x'
    """
    assert a_string.isalnum()
    a_length = len(a_string)
    assert a_length >= 1 and a_length <= 12

(source)

While the above are trivial examples, they do demonstrate how this style of testing is a valuable complement to systematic tests. They enable a larger number of test cases to run against your code. The style of tests are different, in that they focus on the generalised behaviour of code rather than specific use cases. This makes them powerful addition to a test suite.

Summary

In this article we took a brief look at the features of property-based testing, which uses random inputs to improve the quality and coverage of tests.

However, it is important to note that property-based tests don’t replace unit tests. Instead they should be used to augment your existing tests with values that you may not have thought about. Generating a large number of tests may, however, give a false sense of security if most of these test cases are trivial. So, choosing the correct inputs, whether randomly generated or systematically selected is important.

Property-based tests are easy to write and can help identify bugs that traditional testing approaches might miss. So, why not use randomness to your advantage?

Get in touch with Marlo today if you’d like to learn more about how we can help modernise your testing regime.

Links

There are many resources online resources available if you want to learn more:

Dice

Introduction

One of the first exercises given to me as a mathematics student was to write a random number generator (RNG) – which turned out not to be so easy. Test sequences cycled quickly, or were too predictable, or were not evenly distributed. Typically, when we talk about RNG’s, we are describing pseudorandom number generators. Nowadays, there are many programs that will generate pseudorandom numbers.

Where are random numbers used? When I was first a developer they were rarely required. Recently, however we’ve seen them appear in more and more places – it seems they are everywhere!

In DevOps, I’ve used RNG’s for creating message payloads of arbitrary size, and for file or directory names, among other things. These values are often created using scripts written in bash.

This article will explore three simple RNG’s that can be run from bash, and some ways to test just how random they actually are.

The three RNG’s being evaluated here are:

It’s not an exhaustive list; there are certainly others (such as jot). However, the three described here are likely to already be installed on your Linux box.

A word of caution: NONE of these tools are suitable for generating passwords or for cryptography.

Source and test data for this article can be found on github.

In future articles I will take a closer look at random values used in testing code and how they can be used for model simulation in statistics.

Testing Numeric Sequences for Randomness

I am going to evaluate the apparent randomness (or otherwise) of a short list of 1000 values generated by each RNG. To ease comparison the values will be scaled to the range 0 ≤ x < 1. They are rounded to two decimal places and the list of values is formatted as 0.nn.

There are many tests that can be applied to sequences to check for randomness. In this article I will use the Bartels Rank Test. It’s limitations (and those of other tests) are described in this paper.

I’ve chosen this test as it is relatively easy to understand and interpret. Rather than comparing the magnitude of each observation with its preceding sample, the Bartels Rank Test ranks all the samples from the smallest to the largest. The rank is the corresponding sequential number in the list of possibilities. Under the null hypothesis of randomness, all rank arrangements from all possibilities should be equally likely to occur. The Bartels Rank Test is also suitable for small samples.

To get a feel of the test: consider two of the small data sets provided by the R package randtests.

Example 1: Annual data on total number of tourists to the United States, 1970-1982

Example 5.1 in Gibbons and Chakraborti (2003), p.98

years <- seq(ymd('1970-01-01'), ymd('1982-01-01'), by = 'years')

tourists <- c(12362, 12739, 13057, 13955, 14123, 15698, 17523, 18610, 19842, 20310, 22500, 23080, 21916)

qplot(years, tourists, colour = I('purple')) + ggtitle('Tourists to United States (1970-1982)') + theme(plot.title = element_text(hjust = 0.5)) + scale_x_date()
Graph of US tourist numbers 1970-1982
bartels.rank.test(tourists, alternative = "left.sided", pvalue = "beta")

## 
## Bartels Ratio Test
## 
## data:  tourists
## statistic = -3.6453, n = 13, p-value = 1.21e-08
## alternative hypothesis: trend

What this low p-value tells us about the sample data is there is strong evidence against the null hypothesis of randomness. Instead, it favours the alternative hypothesis: that of a trend.

(For a simple guide on how to interpret the p-value, see this)

Example: Changes in Australian Share Prices, 1968-78

Changes in stock levels for 1968-1969 to 1977-1978 (in AUD million), deflated by the Australian gross domestic product (GDP) price index (base 1966-1967) – example from Bartels (1982)

gdp <- c(528, 348, 264, -20, 167, 575, 410, 4, 430, 122)

df <- data.frame(period = paste(sep = '-', 1968:1977, 1969:1978), gdp = gdp, stringsAsFactors = TRUE)

ggplot(data = df, aes(period, gdp)) +
   geom_point(colour = I('purple')) +
   ggtitle('Australian Deflated Stock Levels') +
   theme(plot.title = element_text(hjust = 0.5)) +
   xlab('Financial Year') +
   theme(axis.text.x = element_text(angle = 90)) +
   ylab('GDP (AUD million)')
bartels.rank.test(gdp, pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  gdp
## statistic = 0.083357, n = 10, p-value = 0.9379
## alternative hypothesis: nonrandomness

Here, the sample data provides weak evidence against the null hypothesis of randomness (which does not fully support the alternative hypothesis of non-random data).

Random Number Generators in Bash scripts

bash RANDOM variable

How to use RANDOM

Bash provides the shell variable RANDOM. On interrogation it will return a pseudorandom signed 16-bit integer between 0 and 32767.

RANDOM is easy to use in bash:

RANDOM=314
echo $RANDOM
## 1750

Here, RANDOM is seeded with a value (314). Seeding a random variable with the same seed will return the same sequence of numbers thereafter. This is a common feature of RNG’s and is required for results to be reproducible.

To generate a random integer between START and END, where START and END are non-negative integers, and START < END, use this code:

RANGE=$(( END - START + 1))
echo $(( (RANDOM % RANGE) + START ))

For example, to simulate 10 rolls of a 6 sided dice:

START=1
END=6
RANGE=$(( END - START + 1 ))
RANDOM=314
for i in $(seq 10); do
   echo -n $(( (RANDOM % RANGE) + START )) " "
done
## 5  4  6  3  2  1  1  2  2  4

Checking Sequences from RANDOM

The following code will generate a random sample using bash RANDOM. The sample will be scaled to 1. The generated sample will then be tested using Bartels Rank test, where the null hypothesis is that the sequence is random.

Prepare the test data:

RANDOM=314
# bc interferes with RANDOM
temprandom=$(mktemp temp.random.XXXXXX)
for i in $(seq 1000); do
   echo "scale=2;$RANDOM/32768"
done > $temprandom

cat $temprandom | bc | awk '{printf "%0.2f\n", $0}' > bash.random

rm $temprandom
bashText <- readLines(con <- file(paste0(getwd(), '/bash.random')))
close(con)
bashRandom <- as.numeric(bashText)

Show first 10 values:

head(bashRandom, n = 10)
##  [1] 0.05 0.59 0.02 0.84 0.17 0.69 0.41 0.51 0.94 0.55

Plot the sequence versus value from RNG:

bashDF <- data.frame(sequence = seq(length(bashRandom)), RANDOM = bashRandom)
ggplot(bashDF, aes(x = sequence, y = RANDOM)) +
   geom_point(size = 0.5, colour = I('purple')) +
   ggtitle('bash RANDOM') +
   theme(plot.title = element_text(hjust = 0.5))
Random numbers graph

Run Bartels Rank test:

bartels.rank.test(bashRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  bashRandom
## statistic = -1.116, n = 1000, p-value = 0.2646
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

awk rand()

The following code will generate a random sample using the rand function from GNU Awk. This function generates random numbers between 0 and 1. . The generated sample will then be tested using Bartels Rank test, where the null hypothsis is that the sequence is random.

Example (as before rand() is seeded).

echo | awk -e 'BEGIN {srand(314)} {print rand();}'
## 0.669965

If you don’t specify a seed in srand() it will not return the same results.

You can also generate random integers in a range. For example, to simulate 10 rolls of a 6 sided dice:

echo | awk 'BEGIN {srand(314)} {for (i=0; i<10; i++) printf("%d ", int(rand() * 6 + 1));}'
## 5 2 2 5 3 6 1 6 5 5

Checking Sequences from awk rand()

Prepare the test data:

seq 1000 | awk -e 'BEGIN {srand(314)} {printf("%0.2f\n",rand());}' > awk.random

awkText <- readLines(con <- file(paste0(getwd(), '/awk.random')))

close(con)

awkRandom <- as.numeric(awkText)

Show first 10 values:

head(awkRandom, n = 10)
##  [1] 0.67 0.33 0.22 0.78 0.40 0.84 0.11 0.94 0.70 0.68

Plot the sequence vs value from RNG:

awkDF <- data.frame(sequence = seq(length(awkRandom)), rand = awkRandom)

ggplot(awkDF, aes(x = sequence, y = rand)) +
   geom_point(size = 0.5, colour = I('purple')) +
   ggtitle('awk rand()') +
   theme(plot.title = element_text(hjust = 0.5))

awk rand() output graph

Run Bartels Rank test:

bartels.rank.test(awkRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  awkRandom
## statistic = 0.29451, n = 1000, p-value = 0.7685
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

urandom device

The final tool is the /dev/urandom device. The device provides an interface to the linux kernel’s random number generator. This is a useful tool as it can generate a wide variety of data types.

For example, to print a list of unsigned decimal using od(1):

seq 5 | xargs -I -- od -vAn -N4 -tu4 /dev/urandom
##  2431351570
##  2713494048
##  2149248736
##  2371965899
##  4265714405

It can also be used to source random hexadecimal values:

seq 5 | xargs -I -- od -vAn -N4 -tx4 /dev/urandom
##  5dd6daf1
##  819cf41e
##  c1c9fddf
##  5ecf12b0
##  cae33012

Or it can generate a block of 64 random alphanumeric bytes using:

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 64 | head -n 5
## Rui7CImJglwzzoB4XNKljLPe0WPX7WZLDy1PGl6nxkwcttTNrPDwly3d5tXTQfpp
## VhdEVxX6Il8DOTY4rKISrW5fyqF7AeZmbTwKxwd8Ae2SCEwINiRkeQVtzeXY2N8j
## pZFVpBHEEZawu6OQ52uNzRVaI5qqbDeoXKbR0R8jeGKTRcqPlKEpSFv5UaFJgo5w
## SGN6dh03OJMF8cVDPrOTcDhL1esDpK2FJl2qnIW67A9hKedIPukVHJp0ySBlvTWY
## zLH9thMYXYK6qi6IRF8vw5iXaiWut1ZT9BazJfzyCGYTxPMxmOCHIqUt2yyTrAQD

Checking Sequences from urandom

To create a test sample that is scaled to 1, I will sample two digits from urandom and use these as decimal values. The generated sample will then be tested using the Bartels Rank test, where the null hypothesis is that the sequence is random.

Prepare the test data:

cat /dev/urandom | tr -dc '0-9' | fold -w2 | \
    awk '{printf("0.%02d\n",$1)}' | head -1000 > urandom.random
urText <- readLines(con <- file(paste0(getwd(), '/urandom.random')))

close(con)

urRandom <- as.numeric(urText)

Show first 10 values:

head(urRandom, n = 10)
##  [1] 0.78 0.62 0.03 0.03 0.66 0.36 0.75 0.34 0.91 0.81

Plot the sequence vs value from RNG:

urandomDF <- data.frame(sequence = seq(length(urRandom)), urandom = urRandom)
ggplot(urandomDF, aes(x = sequence, y = urandom)) +
   geom_point(size = 0.5, colour = I('purple')) +
   ggtitle('/dev/urandom') +
   theme(plot.title = element_text(hjust = 0.5))
/dev/urandom output graph

Run the Bartels Rank test:

bartels.rank.test(urRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  urRandom
## statistic = -0.668, n = 1000, p-value = 0.5044
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

Some final thoughts

Of the tools explored, urandom is the most versatile, so it has broader application. The downside is that its results are not easily reproducible and issues have been identified by a study by Gutterman, Zvi; Pinkas, Benny; Reinman, Tzachy (2006-03-06) for the Linux kernel version 2.6.10.

Personally, this has been a useful learning exercise. For one, it showed the limitations in generating and testing for (pseudo)random sequences. Indeed, Aaron Roth, has suggested:

As others have mentioned, a fixed sequence is a deterministic object, but you can still meaningfully talk about how “random” it is using Kolmogorov Complexity: (Kolmogorov complexity). Intuitively, a Kolmogorov random object is one that cannot be compressed. Sequences that are drawn from truly random sources are Kolmogorov random with extremely high probability.

Unfortunately, it is not possible to compute the Kolmogorov complexity of sequences in general (it is an undecidable property of strings). However, you can still estimate it simply by trying to compress the sequence. Run it through a Zip compression engine, or anything else. If the algorithm succeeds in achieving significant compression, then this certifies that the sequence is -not- Kolmogorov random, and hence very likely was not drawn from a random source. If the compression fails, of course, it doesn’t prove that the sequence has high Kolmogorov complexity (since you are just using a heuristic algorithm, not the optimal (undecidable) compression). But at least you can certify the answer in one direction.

In light of this knowledge, lets run the compression tests for the sequences above:

ls -l *.random
## -rw-r--r-- 1 frank frank 5000 2019-02-12 09:51 awk.random
## -rw-r--r-- 1 frank frank 5000 2019-02-12 09:51 bash.random
## -rw-r--r-- 1 frank frank 5000 2019-02-12 09:51 urandom.random

Compress using zip:

for z in *.random; do zip ${z%%.random} $z; done
##   adding: awk.random (deflated 71%)
##   adding: bash.random (deflated 72%)
##   adding: urandom.random (deflated 72%)

Compare this to non-random (trend) data:

for i in $(seq 1000); do printf "0.%02d\n" $(( i % 100 )) ; done< > test.trend
zip trend test.trend
##   adding: test.trend (deflated 96%)

Or just constant data:

for i in $(seq 1000); do echo 0.00 ; done > test.constant
zip constant test.constant
##   adding: test.constant (deflated 99%)

So zipping is a good, if rough, proxy for a measure of randomness.

Stay tuned for part two to discover how random data can be used in testing code.