Photo source: Christopher Gower Unsplash

Omni-channel, Cloud, Open Source, Microservices, Security, Scalability, Agility – these are just some of the concerns facing technology teams as they work to quickly deliver customer focused digital solutions.

At Marlo, we have seen organisations spin their wheels while designing and building the infrastructure and delivery capability to operate in a digital environment. In response, we have tapped into our combined experience to produce the Marlo Digital Enablement Platform [MDEP]. MDEP is an opinionated and extensible platform that has been designed around the following principles:

  • Combine the best open-source, SaaS and cloud-native services
  • Containerised workloads are the unit of deployment
  • Managed Kubernetes is the runtime environment
  • APIs/messaging are the standard model of external interaction
  • The platform is cloud agnostic
  • Security is designed in from the ground up
  • Delivery pipelines are fully automated
  • Platform provisioning and upgrades are zero-outage

That’s nice, but what do I do with it?

Much as it’s fun to kick off a CI/CD pipeline and see a new production-ready cloud platform spring into life in less that an hour, we knew that we had to show how this platform can reduce the workload on teams including developers, testers, and DevOps.

To do this, we have set about building two technology demonstrators that cover business domains that we are heavily involved in. Even if you don’t work in banking or government, they will still show how the platform accelerates delivery.

Our demonstration applications

The Open Banking demonstration provides both web and mobile interfaces allowing users to logon and interact with typical banking features including account and transaction lookups, changing personal details and making payments. Core system functionality comes from a mix of a mock banking system and live calls to public Open Banking APIs.

The Victorian Government Planning demonstration simulates providing access to VicPlan information for a citizen wishing to find details of a property including the local government area and planning scheme overlays. This demonstration retrieves details from public APIs on the Internet.

Each application showcases technology features that are critical to providing modern real-world applications:

Microservices managed as a mesh. A microservice is a small, business-oriented software component that takes exclusive responsibility for an individual domain. This architecture helps teams manage scale and the need for rapid change. The platform automatically deploys microservices into the open source Istio service mesh which abstracts API traffic management concerns such as discovery and security away from developers, as well as providing common resilience patterns including retries, and circuit breakers.

APIs and Integration. Microservice logic as well as core systems and external interfaces are abstracted behind well structured REST and RPC APIs. This provides quick adoption by multiple user channels such as the web and mobile interfaces implemented in the demonstrations.

Containerised deployment onto the Cloud. By packaging into containers and deploying onto public cloud infrastructure, MDEP leverages the enormous scalability and resilience that can be provided by the major cloud providers. Deployable units are Docker images which allows them to be distributed across Kubernetes clusters.

On demand provisioning of supporting components. The build pipelines have been designed to readily provision extension components such as databases and caching in support of the business logic.

Security. MDEP has been designed to be secure from its inception. Many security features including inter-service communication, network zoning, and policy enforcement via an API gateway and service mesh are provisioned by default using the CI/CD pipelines that build both the platform instances and deploy applications. The Open Banking application demonstrates the integration of an external identity provider to provide OAuth 2.0 and multi-factor authentication.

DevOps pipeline automation. The MDEP platform and agile development practices are aligned with modern DevOps practices. Changes to platforms are only permitted via the CI/CD pipelines, ensuring that all infrastructure and code is managed under Source Control Management and CI/CD processes.

What’s a Digital Enablement Platform?

Digital delivery requires speed and a focus on customer experience rather than technology. To enable this, a digital platform needs to remove as many technology concerns as possible. Marlo’s platform provides an opinionated and automated default configuration for the entire end-to-end lifecycle of digital development. To achieve this it leverages what we believe to be current best-practice tools and services including:

  • Deployment onto any of the major cloud providers
  • Use of cloud-native and open source components to encourage scaling cost to zero for unused components
  • Full automation via CI/CD pipelines using a combination of GitLab, Red Hat Ansible, and Hashicorp Terraform
  • Docker, Kubernetes and Istio for workload management


What do build teams get from the platform?

Product Owners avoid a lengthy planning, architecture and procurement ramp-up period by using an opinionated platform based on our experience and best practice.

Architects avoid license driven architectures and product lock-in by using cloud-native, SaaS, and open source components.

Designers and Developers focus on business logic while using development standards including SCM, naming standards, monitoring & logging, automated code defect scanning, and API documentation.

Testers benefit from the Karate test automation framework that is embedded into the CI/CD pipelines, tests are written using the Behaviour Driven Development (BDD) syntax. The Selenium framework provides UI testing. Together they provide full coverage of different testing types including functional, UI and performance.

DevOps teams are provided with automated and zero-outage deployments, the ability to quickly provision new platform instances, source and artefact management, and a simple mechanism to provide supporting components such as databases.

Support teams can readily visualise the state of both the platform instances and the microservices running on them. The open source Kiali service mesh management console, and cloud platform services such as AWS CloudWatch are utilised to ensure each platform is easy to operate.

Can I see this for myself?

If you are starting your digital journey or if your current technology practises are delivering too slowly then Marlo would be happy to demonstrate and discuss how MDEP can address your specific needs. Using automation, we can show a new secure and scalable platform instance being created in real-time during our discussions.

    

Introduction

In our earlier article on Git pipelines, we mentioned that GitHub had released a beta of Actions, their latest CI/CD workflow automation tool. Let’s take a quick look at some of its features.

For simplicity, we’ll use the same example as in the previous article – that of rendering this article into HTML – which is more than enough to demonstrate the basic features.

To recap, the workflow for Git pipelines was:

  1. Get the latest commit in the repository
  2. Install GNU Make
  3. Install pandoc which is used to render Markdown into HTML
  4. Render the HTML document from Markdown
  5. Archive HTML document

The Actions based workflow is similar, but quite a bit simpler. It performs the following tasks:

  1. Get latest commit in the repository
  2. Render the HTML document from Markdown
  3. Publish the rendered HTML document to GitHub pages

It’s simpler, because we don’t need to install the dependent software – we can use pre-prepared Docker Hub images instead.

What are GitHub Actions?

Actions introduce integrated pipelines called workflows into a GitHub repository. That means we can access workflows directly from GitHub’s dashboard via the Actions tab. (Note that when we were preparing this article, the job history in the Actions tab did not show until after we had published to the master branch.)

From the Actions tab we can view job history as well as view, edit or add workflows:

What are GitHub Workflows?

Workflows define the automation steps of a pipeline. Workflows are stored in the .github/workflows directory at the root of your project. A workflow has one or more jobs that contains a sequence of tasks called steps. As an example, lets work through this project’s workflow, which is defined in the yaml file below:



There are three core sections to a workflow (1) – (3):

(1) name

A workflow has a name. This name will appear as a title on the dashboard.

(2) on

This describes how this workflow gets triggered. There are multiple ways that a workflow can be triggered:

  1. on push or pull request on branch or tag
  2. on push or pull request on a path
  3. on a schedule

Here, we are experimenting with being triggered by a push on file changes to README.md.

(3) jobs

Jobs contain steps for execution. The bulk of a jobs workflow appears under section (3). These are explained in sections (4) to (14) below.

(4) id

Jobs are given a unique id. Here, we have labelled it build.

(5) name

Jobs have a name which will appear on GitHub.

(6) runs-on

Jobs are run on GitHub hosted virtual machine images. The current choices offer these three virtual environments types:

  1. Ubuntu
  2. Windows Server
  3. macOS X.

Apart from latest there are a choice of versions for each virtual environment. The limitation here is that you must use one of these images. If you are invoking Docker based Actions, then you must use a Linux image. These Docker images also must run as root, which could be problematic. For example, Haskell Stack will complain when installing dependencies with a user of different privileges.

(7) steps

The remainder of the job is composed of Steps. Steps are the workhorse of workflows. Steps can run set-up tasks, run commands or run actions. Our workflow performs three named tasks:

  1. shallow checkout
  2. render document
  3. publish to pages

(8) checkout

Previously with Azure pipelines we only needed to specify how the pipeline was triggered – it was assumed that the code was already checked out. With Actions this step is explicit: that is, we need to invoke an action to checkout from GitHub. The benefit is that you can finely tune how and what to checkout. In the example action, (8), we are performing a shallow checkout (depth of 1 commit) from the master branch.

(9) uses

To perform the checkout we are using the standard checkout action. We would recommend that you specify a specific version instead of a generic tag like @latest.

When we were reviewing actions, it was helpful and instructive to view the source code to check whether the action provided the required features. For instance, we were able to trial three different actions to publish content, before settling on the current solution.

(10) with

Some actions require parameters. These are provide using the with clause. In this case, (10) we are supplying specific checkout options.

Each Action can define its own values or defaults so it pays to read the source to determine the available choices for the specific version being used.

In other examples (11), we are overriding the default entry point of the Docker container, or specifying the directory location to publish, (12).

(11) using custom Docker

Custom Docker containers can be called as Actions. In this example we are calling a prepared image with all the tools used for rendering this project from markdown to HTML.

(12) publish pages

In our previous article we rendered markdown to HTML and provided it as an archive to download. A better solution is to publish static content to GitHub Pages. This required the creation of an access token which is nicely described here. This token is added to the project as a Settings > Secret named GH_PAGES_TOKEN. This token is passed to the action so it is able to publish the rendered static HTML page to the gh_pages branch.

(13) if

If can conditionally execute a step. The conditional expression can be a Boolean expression or a GitHub context. If the condition is true, the step will execute. In our example it uses a context to check the status of the previous step.

(14) secrets

Secrets are encrypted environment variables. They are restricted for use in Actions. Here, we store the token required to publish to GitHub Pages. See (12).

Putting It All Together

We now have all the pieces in place to execute our workflow that will:

  1. Invoke an action to perform a shallow checkout of our repository from the master branch
  2. Render the markdown using a custom Action from our own pandoc Docker container
  3. Use a public Action to publish the static HTML to GitHub pages

Workflows are integrated into GitHub unlike the previous Azure pipelines. A big relief!

Some Extras

Workflow Logs

Job runs are recorded. You can review a job by following the link from Workflow runs. This will show a run history like:

Each job step has logs that can be viewed and/or downloaded.

Editing a Workflow

GitHub provides an online editor for your workflow:

However, this editor does not currently validate the workflow. So, why is it even provided as it offers nothing that normal online editing doesn’t?

 

First Impressions

Our first impression of GitHub Actions is that they are a significant improvement over the former Azure pipelines. Features we particularly like are:

  • Actions are well integrated into GitHub
  • There’s an active marketplace for Actions and Apps. See a comparison between Actions and Apps here.
  • Documentation is good
  • The ability to use custom Docker images
  • Fast workflows

However, there are also some drawbacks:

  • There is no cache between jobs. The current recommended practice is to archive the required data, and then restore the archive on the required job. To do this you will require to execute Actions. Having a local cache is really important for projects like Java that have many dependencies. No cache means downloading each and every build!
  • The recommended practice is to write actions in JavaScript since these actions are performed on the GitHub host, and do not need to be pulled from external sources. Really? JavaScript? It seems like a bizarre choice – JavaScript is not the first language DevOps would turn to when building workflow pipelines. Will GitHub Actions support other languages in the future?

We also found Docker Actions available on the marketplace are of variable quality. We spent time experimenting with different variations until we found those that matched our requirements. As the source code is available it was easy to evaluate an Actions implementation. Or, you could simply write your own following these instructions. We also found that we could use our existing Docker images without modification.

There are some good features to GitHub Actions which are easily composed. While JavaScript is not the first tool we would consider as a workflow language, Docker is very workable compromise, even with the small performance hit.

Resources

Introduction

Git has become the de facto standard for version control, but until recently you needed external tools such as Jenkins or GoCD to manage Continuous Integration / Continuous Delivery (CI/CD) pipelines.

Now, though, we’re seeing vendors like Gitlab and others providing pipeline features with extensible suites of tools to build, test and deploy code. These integrated CI/CD features greatly streamline solution delivery and have given rise to whole new ways of doing things like GitOps.

In this article we examine and compare some of the current pipeline features from three popular Git hosting sites: GitLabBitbucket and GitHub, and ask the question: “Is it time to switch from your current CI/CD toolset?”

Example Pipeline

Let’s use pipelines to render the Git Markdown version of this article into an HTML document.

The pipeline features we are using:

  • using Docker images to execute build tasks
  • customising the build environment
  • pipeline stages
  • archiving generated artefacts – in this case a document, but in real life you might be archiving a built Docker image

The pipeline workflow is:

  1. install GNU Make
  2. install pandoc – we are using this to render Markdown to HTML
  3. render the HTML document from Markdown
  4. archive rendered document

The code for this project can be viewed from these Git repositories:

GitLab

GitLab’s Community Edition pipelines are a well-integrated tool, and are our current pipeline of choice.

Example Pipeline

The CI/CD pipelines are easily accessed from the sidebar:

Viewing jobs gives you a pipelines history:

The YAML configuration file .gitlab-ci.yml for this pipeline is:

image: conoria/alpine-pandoc

variables:
  TARGET: README.html

stages:
  - build

before_script:
  - apk update
  - apk add make

render:
  stage: build
  script:
    - make $TARGET
  artifacts:
    paths:
      - $TARGET

Where:

  • image – specifies a custom Docker image from Docker Hub (can be custom per job)
  • variables – define a variable to be used in all jobs
  • stages – declares the jobs to run
  • before_script – commands to run before all jobs
  • render – name of job associated with a stage. Jobs in the same stage are run in parallel
  • stage – associates a job with a stage
  • script – commands to run for this job
  • artitacts – path to objects to archive, these can be downloaded if the job completes successfully

What this pipeline configuration does is:

  • load an Alpine Docker image for pandoc
  • invoke the build stage which
    • initialises with alpine package update and install
    • runs the render job which generates the given target HTML
    • on successful completion, the target HTML is archived for download

Features and Limitations

There are many other features including scheduling pipelines and the ability to configuring jobs by branch.

One useful feature for Java / Maven projects is caching of the .m2 directory. This speeds up the build as you don’t have a completely new environment for each build, but can leverage previous cached artefacts instead. GitLab also provides a clear cache button on the pipeline page.

GitLab also supports hosting of static pages. This is simple to set-up and use, requiring only an additional pages job in the deployment stage to move static content into a directory called public. This makes it very easy to host a project’s generated documentation and test results.

Finally, GitLab provides additional services that can be integrated with your project. For example: JIRA tracking, Kubernetes, and monitoring using Prometheus.

Summary

Overall, GitLab is easy to configure and easy to navigate, and provides Marlo with our current preferred Git pipeline solution.

Bitbucket

Atlassian’s Bitbucket pipeline functionality and configuration is similar to GitLab.

Example Pipeline

Again, pipelines and settings are easily navigated into using the side-bar.

But there are some important differences. Below is the configuration file bitbucket-pipelines.yml:

pipelines:
  branches:
    master:
      - step:
          name: render
          image: conoria/alpine-pandoc
          trigger: automatic
          script:
            - apk update && apk add make curl
            - export TARGET=README.html
            - make -B ${TARGET}
            - curl -X POST --user "${BB_AUTH_STRING}" +
                "https://api.bitbucket.org/2.0/" +
                "repositories/${BITBUCKET_REPO_OWNER}/" +
                "${BITBUCKET_REPO_SLUG}/downloads " +
                --form files=@"${TARGET}"

Here the pipeline will be triggered automatically (trigger: automatic) when you commit to the master branch.

You can define a Docker image (image: conoria/alpine-pandoc) to provision at the level of the pipeline step.

Variables (${BB_AUTH_STRING}${BITBUCKET_REPO_OWNER} and ${BITBUCKET_REPO_SLUG}) can be defined and read from the Bitbucket settings page. This is useful for recording secrets that you don’t want to have exposed in your source code.

Internal script variables are set via the script language, which here is Bash. Finally, in order for the build artefacts to be preserved after the pipeline completes, you can publish to a downloads location. This requires that a secure variable be configured, as described here. If you don’t, the pipeline workspace is purged on completion.

That you have to externally / manually configure repository settings has some benefits. The consequence though, is that there are then settings that are not recorded by your project.

Pipeline build performance is very good, where this entire step took only around 11 seconds to complete.

Features and Limitations

One limitation is that the free account limits you to only 50 minutes per month with 1GB storage.

A feature of being able to customise the Docker image used at the step level is that your build and test steps can use different images. This is great if you want to trial your application on a production-like image.

GitHub

GitHub was recently acquired by Microsoft.

When you create a GitHub repository, there is an option to include Azure Pipelines. However this is not integrated to GitHub directly, but is configured under Azure DevOps.

Broadly, the steps to set-up a pipeline are:

  • sign up to Azure pipelines
  • create a project
  • add GitHub repository to project
  • configure pipeline job

Builds are managed from the Azure DevOps dashboard. There appears to be no way to manually trigger a build directly from the GitHub repository. Though, if you commit, it will happily trigger a build for you. But, again, you need to be on the Azure DevOps dashboard to monitor the pipeline jobs.

Example Pipeline

The following YAML configuration uses an Ubuntu 16.04 image provided by Azure. There are limited number of images, but they are well maintained with packages kept up-to-date. They come with many pre-installed packages.

Below is the Azure pipeline configuration azure-pipelines.yml:

trigger:
  - master

pool:
  vmImage: 'Ubuntu-16.04'

steps:

  - script: |
      sudo apt-get install pandoc
    displayName: 'install_pandoc'

  - script: |
      make -B README.html
    displayName: 'render'

  - powershell: |
      gci env:* |
      sort-object name |
      Format-Table -AutoSize |
      Out-File $env:BUILD_ARTIFACTSTAGINGDIRECTORY/environment-variables.txt

  - task: PublishBuildArtifacts@1
    inputs:
      pathtoPublish: '$(System.DefaultWorkingDirectory)/README.html'
      artifactName: README

If the package you need is not installed, then you can install it if available from the Ubuntu package repositories. The default user profile is not root, so installation requires the use of sudo.

To create an archive of artefacts for download, you need to invoke a specific PublishBuildArtifacts task.

Azure is fast as it uses images that Microsoft manages and hosts. The above job to install pandoc and render this page as HTML takes only 1 minute.

Features and Limitations

The biggest negative to Azure Pipelines is its limited integration to the GitHub dashboard. Instead, you are strongly encouraged to manage pipelines using the Azure DevOps dashboard.

Update

Since the first draft of this article, GitHub announced the support of pipeline automation called GitHub Actions. Marlo is engaged in the beta program and we will have some new information to post here shortly.

Summary

In Marlo’s DevOps practice we are constantly looking at ways to increase our productivity and effectiveness in solution delivery. Of the three Git pipelines looked at here, we found GitLab the easiest to adopt and use. It’s YAML based syntax is simple, but functionality broad. Our developers have quickly picked up and implemented pipeline concepts.

Git pipelines will not be suitable in every circumstance – for example Ansible infrastructure provisioning projects. However, there are clear advantages to using a hosted pipeline that ensures that your project builds somewhere other than on your machine. It also removes the cost of building and maintaining your own infrastructure. This could be of great benefit to projects where time constraints limit ones ability to prepare an environment.

The pipeline configuration augments your projects documentation for build, test and deployment. It is an independent executable description for your project that explicitly lists dependencies.

Since the first draft of this article was written, there has been increasing competition and continuous innovation amongst Git repository vendors:

So, yes: it is a great time to switch to a Git pipeline toolset!

Tech Lead Vishal Raizada recently conducted a very informative Tech Forum at the Marlo Office. He presented on Istio: Architecture, Application and Ease of Implementation.

Our tech forum presentation is downloadable here and showcases an example of Istio’s implementation, application and benefits.

Istio is now a key part of the Marlo Digital Enablement Platform – our open source, cloud-native platform which provides a complete on-demand environment for digital delivery.

The enterprise application landscape has changed a lot in the last decade: from managing on premise servers to using infrastructure as a service; from monolithic applications to building microservices.

The new world offers many benefits but it also introduces new challenges. With the distributed nature of the application landscape, service discovery and general application composition becomes extremely complex. Controls, such as traffic management, security and observability, which could previously be managed in one place now become a scattered problem.

Enter Istio, a service mesh framework, which wraps around a cloud native architecture and adds a layer of abstraction to manage these complexities. It enables a truly automated delivery process, where a development team can purely focus on code, and Istio handles the rest, including service discovery, security, circuit breaking and much more. In addition, it is programmable, hence it can be incorporated as part of the DevOps & DevSecOps process with ease. A service mesh gives control back to the enterprise application world without taking away any of the benefits.

Read Vish’s full presentation here.

Cutting Environment Costs In The Digital Age

If you’re a CIO, or an infrastructure manager, then you’ve probably got a mandate from the CFO or the CEO to cut costs. And you’re running a complex set of applications, across multiple environments – at least 3 (production, test and dev). Depending on how mature your infrastructure team is, you might already be running 5 or 6 environments, or even more.

But how many environments do you really need?

Multiple dev and test environments are needed to deal with different projects and agile teams delivering at different cadences, all wanting their own separate dev and test environments. You’re probably operating in multiple data centres and have to worry about multiple cloud providers and SaaS vendors.

If money was no object, you’d be scaling to 20 or 30 environments because that’s what your delivery teams are telling you that they need. Costs aren’t going down in line with your cost-cutting mandate, they’re going up.

So, here’s a radical thought: the number of environments that you actually need to look after is… 1. (And if you’re good, it might be none).

What Do You Actually Want, Anyway?

You want to do the things you need to be able to do and do them well. So, if you’re working for a brewing company, that means you need to ensure your company is good at making, selling and delivering beer.

But as the CIO, you’re in charge of the apps that enable all that good stuff. You want software that works, running in production, on kit that doesn’t fall over, at a reasonable cost. That’s about it.

If you didn’t have to worry about managing multiple non-production environments across the data centre and the cloud, and all the cost and complexity that comes with them, then we bet that, frankly, you’d give it all up tomorrow.

Getting to One

To see why you only need that one environment, and why you can get rid of all the rest, let’s think about how the development of 3 key technologies that have grown up over the last 10 years: Cloud, DevOps, and API’s and microservices.

Cloud

The grand promise of cloud is that Cloud says infrastructure is available on demand. You can have any number of servers, at any scale, whenever you want them. As much as you like. Somewhere in Sydney, Tokyo, Stockholm, London, São Paolo or Mumbai is a data centre the size of a football field, and it’s yours for the taking. If you want a dozen 128-CPU boxes with over 3TB of RAM, several petabytes of storage and 25-gigabit networking, they’re all yours (as long as your credit card is working!) You can have this, literally in minutes, any time of day or night.

DevOps

We can go one step further than that: DevOps says not only is infrastructure available on demand, but that it is code. You can automate the provisioning of infrastructure, and on top of that, automate the deployment of all your applications.

You can have software on demand, not just infrastructure. By extension you can construct an entire environment whenever you need it, wherever you need it – and again by extension, you can throw it away whenever you don’t need it.

API’s and Microservices

But that’s not going quite far enough. The API Gateway means you can securely compartmentalise your environments – by insisting that every interaction between systems is mediated through an API gateway, you build a standard interface mechanism that is network-agnostic – so it matters less which network your API’s (and the (micro)services they provide façades for) live on. Coupled with the ability – in non-production environments at least – to mock and stub API services, this vastly reduces the need to be managing and running monolithic environments that contain all your services at once.

If your infrastructure is available on demand, and infrastructure is code, and environments are compartmentalised by API Gateways, then anyone can bring a dev or test environment – you don’t need to care where it is. It doesn’t need to be in your data centre, and it doesn’t really need to be in your VPC either.

Which Environments Do You Actually Need?

Production, maybe, and then only because you’ve still got legacy applications that you haven’t yet hidden behind API’s. But give as much of that away as you can, as soon as you can, using the SaaS model as your template.

Wherever possible, you should outsource the problem of running dev environments to your vendors who do build and test. They should be doing it on their kit at their cost.

They’ll be super-efficient: there will be no dev environment running if they’re not actually doing dev right this minute, unless they enjoy the smell of burning (their own) money. There’s no point in you running dev environments any more. Platforms like Marlo’s Digital Enablement Platform [MDEP] provide for very rapid start environments where dev teams can be up and running, building business code, in a few hours, not days or weeks.

Furthermore, you should be making vendors run your testing environments for the applications that they’re delivering, and for the same reasons as dev. You still have to manage test data (and most organisations still have to solve for privacy, but they seem to manage that just fine when they implement Salesforce). And you’ll need to ensure that they make their environments available whenever end-to-end testing is going on.

What You’re Still Going To Have To Solve

  • Security provisioning and network access to any environments that you’re still running
  • Making sure that legacy applications have their own API’s (and API Gateways) in front of them, so they can be accessed safely by external developers
  • Vendor contracts that encourage the right behaviour when vendors run dev and test environments
  • Access to code (escrow arrangements)
  • Standards and guidelines for vendors delivering applications and services to you
  • Providing platforms like the Marlo Digital Enablement Platform [MDEP] to standardise and govern the way that your applications are built and deployed – mostly for non-functionals like security, monitoring, logging and auditing
  • Dependency management on a grand scale (but you already have this problem, and well-designed API’s help)

Conclusion

  • Make your vendors bring their own environments for digital delivery; embed requirements for how they should behave in contracts
  • Implement standards and guidelines for delivery – solve problems like containerisation, security, reliability, scalability, monitoring and logging etc in standard, cloud-native ways
  • Provide standardised platforms for hosting in production like MDEP, so that delivery can concentrate on business value
  • Engage with organisations like Marlo who truly understand the challenges of – and how to succeed in – today’s complex digital environments
Dice

Introduction

One of the first exercises given to me as a mathematics student was to write a random number generator (RNG) – which turned out not to be so easy. Test sequences cycled quickly, or were too predictable, or were not evenly distributed. Typically, when we talk about RNG’s, we are describing pseudorandom number generators. Nowadays, there are many programs that will generate pseudorandom numbers.

Where are random numbers used? When I was first a developer they were rarely required. Recently, however we’ve seen them appear in more and more places – it seems they are everywhere!

In DevOps, I’ve used RNG’s for creating message payloads of arbitrary size, and for file or directory names, among other things. These values are often created using scripts written in bash.

This article will explore three simple RNG’s that can be run from bash, and some ways to test just how random they actually are.

The three RNG’s being evaluated here are:

It’s not an exhaustive list; there are certainly others (such as jot). However, the three described here are likely to already be installed on your Linux box.

A word of caution: NONE of these tools are suitable for generating passwords or for cryptography.

Source and test data for this article can be found on github.

In future articles I will take a closer look at random values used in testing code and how they can be used for model simulation in statistics.

Testing Numeric Sequences for Randomness

I am going to evaluate the apparent randomness (or otherwise) of a short list of 1000 values generated by each RNG. To ease comparison the values will be scaled to the range 0 ≤ x < 1. They are rounded to two decimal places and the list of values is formatted as 0.nn.

There are many tests that can be applied to sequences to check for randomness. In this article I will use the Bartels Rank Test. It’s limitations (and those of other tests) are described in this paper.

I’ve chosen this test as it is relatively easy to understand and interpret. Rather than comparing the magnitude of each observation with its preceding sample, the Bartels Rank Test ranks all the samples from the smallest to the largest. The rank is the corresponding sequential number in the list of possibilities. Under the null hypothesis of randomness, all rank arrangements from all possibilities should be equally likely to occur. The Bartels Rank Test is also suitable for small samples.

To get a feel of the test: consider two of the small data sets provided by the R package randtests.

Example 1: Annual data on total number of tourists to the United States, 1970-1982

Example 5.1 in Gibbons and Chakraborti (2003), p.98

years <- seq(ymd('1970-01-01'), ymd('1982-01-01'), by = 'years')

tourists <- c(12362, 12739, 13057, 13955, 14123, 15698, 17523, 18610, 19842, 20310, 22500, 23080, 21916)

qplot(years, tourists, colour = I('purple')) + ggtitle('Tourists to United States (1970-1982)') + theme(plot.title = element_text(hjust = 0.5)) + scale_x_date()
Graph of US tourist numbers 1970-1982
bartels.rank.test(tourists, alternative = "left.sided", pvalue = "beta")

## 
## Bartels Ratio Test
## 
## data:  tourists
## statistic = -3.6453, n = 13, p-value = 1.21e-08
## alternative hypothesis: trend

What this low p-value tells us about the sample data is there is strong evidence against the null hypothesis of randomness. Instead, it favours the alternative hypothesis: that of a trend.

(For a simple guide on how to interpret the p-value, see this)

Example: Changes in Australian Share Prices, 1968-78

Changes in stock levels for 1968-1969 to 1977-1978 (in AUD million), deflated by the Australian gross domestic product (GDP) price index (base 1966-1967) – example from Bartels (1982)

gdp <- c(528, 348, 264, -20, 167, 575, 410, 4, 430, 122)

df <- data.frame(period = paste(sep = '-', 1968:1977, 1969:1978), gdp = gdp, stringsAsFactors = TRUE)

ggplot(data = df, aes(period, gdp)) +
   geom_point(colour = I('purple')) +
   ggtitle('Australian Deflated Stock Levels') +
   theme(plot.title = element_text(hjust = 0.5)) +
   xlab('Financial Year') +
   theme(axis.text.x = element_text(angle = 90)) +
   ylab('GDP (AUD million)')
bartels.rank.test(gdp, pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  gdp
## statistic = 0.083357, n = 10, p-value = 0.9379
## alternative hypothesis: nonrandomness

Here, the sample data provides weak evidence against the null hypothesis of randomness (which does not fully support the alternative hypothesis of non-random data).

Random Number Generators in Bash scripts

bash RANDOM variable

How to use RANDOM

Bash provides the shell variable RANDOM. On interrogation it will return a pseudorandom signed 16-bit integer between 0 and 32767.

RANDOM is easy to use in bash:

RANDOM=314
echo $RANDOM
## 1750

Here, RANDOM is seeded with a value (314). Seeding a random variable with the same seed will return the same sequence of numbers thereafter. This is a common feature of RNG’s and is required for results to be reproducible.

To generate a random integer between START and END, where START and END are non-negative integers, and START < END, use this code:

RANGE=$(( END - START + 1))
echo $(( (RANDOM % RANGE) + START ))

For example, to simulate 10 rolls of a 6 sided dice:

START=1
END=6
RANGE=$(( END - START + 1 ))
RANDOM=314
for i in $(seq 10); do
   echo -n $(( (RANDOM % RANGE) + START )) " "
done
## 5  4  6  3  2  1  1  2  2  4

Checking Sequences from RANDOM

The following code will generate a random sample using bash RANDOM. The sample will be scaled to 1. The generated sample will then be tested using Bartels Rank test, where the null hypothesis is that the sequence is random.

Prepare the test data:

RANDOM=314
# bc interferes with RANDOM
temprandom=$(mktemp temp.random.XXXXXX)
for i in $(seq 1000); do
   echo "scale=2;$RANDOM/32768"
done > $temprandom

cat $temprandom | bc | awk '{printf "%0.2f\n", $0}' > bash.random

rm $temprandom
bashText <- readLines(con <- file(paste0(getwd(), '/bash.random')))
close(con)
bashRandom <- as.numeric(bashText)

Show first 10 values:

head(bashRandom, n = 10)
##  [1] 0.05 0.59 0.02 0.84 0.17 0.69 0.41 0.51 0.94 0.55

Plot the sequence versus value from RNG:

bashDF <- data.frame(sequence = seq(length(bashRandom)), RANDOM = bashRandom)
ggplot(bashDF, aes(x = sequence, y = RANDOM)) +
   geom_point(size = 0.5, colour = I('purple')) +
   ggtitle('bash RANDOM') +
   theme(plot.title = element_text(hjust = 0.5))
Random numbers graph

Run Bartels Rank test:

bartels.rank.test(bashRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  bashRandom
## statistic = -1.116, n = 1000, p-value = 0.2646
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

awk rand()

The following code will generate a random sample using the rand function from GNU Awk. This function generates random numbers between 0 and 1. . The generated sample will then be tested using Bartels Rank test, where the null hypothsis is that the sequence is random.

Example (as before rand() is seeded).

echo | awk -e 'BEGIN {srand(314)} {print rand();}'
## 0.669965

If you don’t specify a seed in srand() it will not return the same results.

You can also generate random integers in a range. For example, to simulate 10 rolls of a 6 sided dice:

echo | awk 'BEGIN {srand(314)} {for (i=0; i<10; i++) printf("%d ", int(rand() * 6 + 1));}'
## 5 2 2 5 3 6 1 6 5 5

Checking Sequences from awk rand()

Prepare the test data:

seq 1000 | awk -e 'BEGIN {srand(314)} {printf("%0.2f\n",rand());}' > awk.random

awkText <- readLines(con <- file(paste0(getwd(), '/awk.random')))

close(con)

awkRandom <- as.numeric(awkText)

Show first 10 values:

head(awkRandom, n = 10)
##  [1] 0.67 0.33 0.22 0.78 0.40 0.84 0.11 0.94 0.70 0.68

Plot the sequence vs value from RNG:

awkDF <- data.frame(sequence = seq(length(awkRandom)), rand = awkRandom)

ggplot(awkDF, aes(x = sequence, y = rand)) +
   geom_point(size = 0.5, colour = I('purple')) +
   ggtitle('awk rand()') +
   theme(plot.title = element_text(hjust = 0.5))

awk rand() output graph

Run Bartels Rank test:

bartels.rank.test(awkRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  awkRandom
## statistic = 0.29451, n = 1000, p-value = 0.7685
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

urandom device

The final tool is the /dev/urandom device. The device provides an interface to the linux kernel’s random number generator. This is a useful tool as it can generate a wide variety of data types.

For example, to print a list of unsigned decimal using od(1):

seq 5 | xargs -I -- od -vAn -N4 -tu4 /dev/urandom
##  2431351570
##  2713494048
##  2149248736
##  2371965899
##  4265714405

It can also be used to source random hexadecimal values:

seq 5 | xargs -I -- od -vAn -N4 -tx4 /dev/urandom
##  5dd6daf1
##  819cf41e
##  c1c9fddf
##  5ecf12b0
##  cae33012

Or it can generate a block of 64 random alphanumeric bytes using:

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 64 | head -n 5
## Rui7CImJglwzzoB4XNKljLPe0WPX7WZLDy1PGl6nxkwcttTNrPDwly3d5tXTQfpp
## VhdEVxX6Il8DOTY4rKISrW5fyqF7AeZmbTwKxwd8Ae2SCEwINiRkeQVtzeXY2N8j
## pZFVpBHEEZawu6OQ52uNzRVaI5qqbDeoXKbR0R8jeGKTRcqPlKEpSFv5UaFJgo5w
## SGN6dh03OJMF8cVDPrOTcDhL1esDpK2FJl2qnIW67A9hKedIPukVHJp0ySBlvTWY
## zLH9thMYXYK6qi6IRF8vw5iXaiWut1ZT9BazJfzyCGYTxPMxmOCHIqUt2yyTrAQD

Checking Sequences from urandom

To create a test sample that is scaled to 1, I will sample two digits from urandom and use these as decimal values. The generated sample will then be tested using the Bartels Rank test, where the null hypothesis is that the sequence is random.

Prepare the test data:

cat /dev/urandom | tr -dc '0-9' | fold -w2 | \
    awk '{printf("0.%02d\n",$1)}' | head -1000 > urandom.random
urText <- readLines(con <- file(paste0(getwd(), '/urandom.random')))

close(con)

urRandom <- as.numeric(urText)

Show first 10 values:

head(urRandom, n = 10)
##  [1] 0.78 0.62 0.03 0.03 0.66 0.36 0.75 0.34 0.91 0.81

Plot the sequence vs value from RNG:

urandomDF <- data.frame(sequence = seq(length(urRandom)), urandom = urRandom)
ggplot(urandomDF, aes(x = sequence, y = urandom)) +
   geom_point(size = 0.5, colour = I('purple')) +
   ggtitle('/dev/urandom') +
   theme(plot.title = element_text(hjust = 0.5))
/dev/urandom output graph

Run the Bartels Rank test:

bartels.rank.test(urRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  urRandom
## statistic = -0.668, n = 1000, p-value = 0.5044
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

Some final thoughts

Of the tools explored, urandom is the most versatile, so it has broader application. The downside is that its results are not easily reproducible and issues have been identified by a study by Gutterman, Zvi; Pinkas, Benny; Reinman, Tzachy (2006-03-06) for the Linux kernel version 2.6.10.

Personally, this has been a useful learning exercise. For one, it showed the limitations in generating and testing for (pseudo)random sequences. Indeed, Aaron Roth, has suggested:

As others have mentioned, a fixed sequence is a deterministic object, but you can still meaningfully talk about how “random” it is using Kolmogorov Complexity: (Kolmogorov complexity). Intuitively, a Kolmogorov random object is one that cannot be compressed. Sequences that are drawn from truly random sources are Kolmogorov random with extremely high probability.

Unfortunately, it is not possible to compute the Kolmogorov complexity of sequences in general (it is an undecidable property of strings). However, you can still estimate it simply by trying to compress the sequence. Run it through a Zip compression engine, or anything else. If the algorithm succeeds in achieving significant compression, then this certifies that the sequence is -not- Kolmogorov random, and hence very likely was not drawn from a random source. If the compression fails, of course, it doesn’t prove that the sequence has high Kolmogorov complexity (since you are just using a heuristic algorithm, not the optimal (undecidable) compression). But at least you can certify the answer in one direction.

In light of this knowledge, lets run the compression tests for the sequences above:

ls -l *.random
## -rw-r--r-- 1 frank frank 5000 2019-02-12 09:51 awk.random
## -rw-r--r-- 1 frank frank 5000 2019-02-12 09:51 bash.random
## -rw-r--r-- 1 frank frank 5000 2019-02-12 09:51 urandom.random

Compress using zip:

for z in *.random; do zip ${z%%.random} $z; done
##   adding: awk.random (deflated 71%)
##   adding: bash.random (deflated 72%)
##   adding: urandom.random (deflated 72%)

Compare this to non-random (trend) data:

for i in $(seq 1000); do printf "0.%02d\n" $(( i % 100 )) ; done< > test.trend
zip trend test.trend
##   adding: test.trend (deflated 96%)

Or just constant data:

for i in $(seq 1000); do echo 0.00 ; done > test.constant
zip constant test.constant
##   adding: test.constant (deflated 99%)

So zipping is a good, if rough, proxy for a measure of randomness.

Stay tuned for part two to discover how random data can be used in testing code.