This project was to develop a platform for simulating the performance of train control software. Because this particular software executed in the QNX RTOS, it was not possible to use container-based solutions. Instead, this platform took advantage of an on-premises compute cluster and a number of QNX worker VMs.
The platform was constructed from a set of shared services:
- nginx for load balancing and reverse proxying
- NFS shares for file sharing across services and machines
- MongoDB for providing database records across services and machines
The platform was also constructed from a number of custom software modules, most of which were written in Node.JS:
- canary-frontend: provides the CRUD API to request simulation activities
- canary-manager: provides the allocation of simulation tasks to the VM worker pool and the endpoints for the VM worker pool to return simulation results to the platform; also maintains the state of the VMs in the pool and applies software updates as needed
- canary-testrunner: evaluates simulation results in three ways
- Finds train handling rule violations by funneling the timeseries in the simulation results through custom Chai.JS modules
- Calculates aggregate results (such as fuel consumption and trip time) from results
- Calculates a 'fingerprint' (hash) of the results so that they can be trivially compared for exact matches with other results, supporting the software engineering use case of ensuring zero regressions
- canary-jenkins: provides the endpoint for Jenkins to upload software builds for simulation and verification testing as well as to provide the testing results to Jenkins's webhooks
- canary-profile: provides authentication services to the API by interfacing with an Atlassian Crowd instance
There is also a single custom python module:
- canary-optimize: links simulation results to a numpy optimizer for automatic tuning of control system parameters
The system architecture of these components is shown below:
A single API was provided by the system both to the main, web-based frontend and to other clients, such as in Matlab and Python. The web frontend was implemented as a Single Page Application (SPA) using Angular.JS.
The next two images show the SPA providing monitoring of the VM pool and details about a single VM.
The following images show creation of a single simulation scenario. A scenario consists of the train makeup, the route that it should travel, the speed limits in place on the route, initial conditions of the train (such as location along the route, speed, throttle notches), and a condition for ending the simulation (such as a location along the route).
Once simulation scenarios are defined, simulations can be executed against those scenarios in one of three waits.
First, a single build of the train control software can be executed against a collection of scenarios and the results aggregated. This provides a means for comparing a software build in aggregate against another. For example, if the software was not expected to affect the simulation results, the aggregated results can be used to demonstrate that there are no differences. Similarly, if changes were made to the train control software with the expectation that they would have a particular effect, the aggregated results can be used to determine if any deleterious side effects were caused by the change. If so, engineering judgement would have to be employed to decide if the improvements are worth the negative side effects.
This modality can be used along with Jenkins to provide automated regression testing as part of the CI/CD pipeline.
The second modality that the system supports for simulating control system performance is to repeatedly simulate a single scenario while varying certain simulated train parameters. When the number of iterations is very high, this modality is known as a Monte Carlo simulation and provides an expected range of outcomes from the control software when there exists model uncertainty. This range of outcomes can be depicted as a one-sigma (or other probability band) performance envelope. Such a depiction is commonly used in the aerospace industry and is very useful for setting expectations and communicating likelihoods with customers.
The third modality provided by the system is to iteratively simulate a fixed set of scenarios for a single software build while modifying parameters of the train control software to optimize the aggregated outcome for a user-defined objective. For example, when circumstances change such that the trains need to travel as close as possible to the speed limit to maximize network velocity, the objective function can be defined to result in high train speed without exceeding the speed limit. However, when the control software should save fuel while maintaining a minimal network speed, the objective function can be tuned to achieve this.
Unfortunately, the system was very difficult to troubleshoot and maintain, mostly because it relied on so much custom code. In retrospect, the system ought to have used Apache Kafka for allocating tasks to the VM pool. Similarly, the system ought to have used Ansible to manage the VM worker pool and the Elastic Stack for monitoring the system and providing notification of system malfunctions.
While these enhancements were planned, they were never put into practice. The enhancements would have greatly improved the system stability and maintainability.