When developing software, we typically apply some form of a Develop… Test… Build… Deploy cycle; a structured progression of work steps from developing code to testing that code, to building a release version of that code, which is then deployed in a production environment. Although these steps are typically executed in succession, steps can be executed multiple times before a cycle completes. For instance, if we find errors in the behavior of our code while testing it, we go back to the Develop step to make fixes after which we test again. Sometimes results found at one step may kick us back several steps.
Testing code can happen on several scales and several levels of integration. So-called ‘unit tests’ test smaller units of code to see if those units behave properly under a variety of circumstances. ‘Integration tests’ on the other hand test if the larger code complex in which the tested units are integrated works correctly. Of course, this separation of units and complexes is somewhat arbitrary and can be hierarchically layered in good system-theoretical fashion. Whereas several units of code can be integrated into a larger complex, that complex itself can be regarded as a unit in a yet larger complex, in the same way that an element of a system can be a (sub)system itself if we choose to further decompose it.
In order to test the complex, which is composed of units, we must first Build it. With that we mean that we must indeed integrate the units into a consistent and (hopefully) working whole. In software engineering the term ‘build’ is reserved for this integration of codes into such a working whole. This integration step works differently for different types of programming languages, which does not have to concern us for this discussion. For now, it suffices to think of the build step as the integration of all the units of code, all functions and methods that the system programmers have written, into a cohesive and hopefully working whole.
Regardless of how much we test, however, the likelihood that we have tested all possible code execution paths is small, and hence, the likelihood that some untested code path will at some point in time be executed and cause a problem is larger. Such problems ─better known as ‘bugs’─ may necessitate code fixes and hence, a return to the Develop step. However, once code has been put in production ─also known as being ‘deployed’ or ‘released’ and sometimes as ‘shipped’─ that code can typically not be taken out of production in order to be repaired. In such cases a parallel develop-test-deploy cycle must be executed and a whole or partial code update must be released.
As we will see in the remainder of this chapter, both TE 1.0 and TE 2.0 applied a structured and carefully followed develop-test-build-deploy process, but TE 2.0’s implementation of this process was brought in line with the newer, more modern ways of doing it.
But First: Version Control
Both TE 1.0 and 2.0 rely on a central repository of code shared and agreed upon by all developers. This repository is kept and maintained in a so-called code-repository or source-control or version-control system. These systems ─examples are Git, Subversion, CVS, TFS, etc.─ manage all changes made to code, allow multiple developers to jointly work on code without overwriting each other’s work, and support so-called ‘branching;’ i.e., the forking of a complex of code into a new complex of code.
In practice it is very difficult to develop and maintain a body of code without using one of these version-control systems. Using them, developers can revert to older versions of code, can track which changes were made by whom and when, can compare different versions of the same code base, line by line and character by character.
These systems also provide protection against multiple developers working on the same code base overwriting each other’s work. How can that happen? Easy! Suppose that a certain file contains the code for several methods (functions) and that one developer must work on the code for one of those methods and another developer must work on another method from that same file. It would be quite inefficient if one of these developers would have to wait for the other developer to be done with the file before being able to make code changes. Yet if both developers each work on a copy of the file, there is a very real danger that one merges the modifications back into the code base at time t whereas the other merges his or her code at t+x, thereby overwriting the work of the first developer. Version-control systems manage this process by keeping track of who checks in what code at which times. When the system sees a potential cross-developer code override, it flags this as a so-called ‘conflict’ and gives the developer triggering the conflict several options to resolve the conflict.
Version-control systems also help out a lot with code integration; that ‘build’ step we mentioned earlier. Suppose, for instance, that a developer writes a new segment of code and that after carefully checking and testing it (s)he checks the code into the version-control system. Between the time the developer started working on this code and the time (s)he checks it in, other developers made changes to existing code and added code of their own. Hence, it is possible that the new code checked in last ‘breaks the build;’ i.e., that it is not functionally compatible with the rest of the code. Version-control systems provide at least one means of avoiding and one means of mitigating this problem. Developers can update their local copy of the code they have not worked on and test their additions to see if they cause any problems (avoidance). On the integration/build side of the process we can revert to a previous version of the code which caused the build to brake and report to the developer who broke the build that (s)he must modify the new code so that it no longer breaks the build (mitigation).
TE 1.0 used the CVS version-control system early on, but migrated to Subversion a few years later. TE 2.0 uses Git.
TE 1.0 Develop-Test-Deploy
The TE 1.0 develop-test-build-deploy process was effective and simple, although perhaps not maximally flexible. The process consisted of three steps:
- Step 1. Sandbox or development site coding and testing. New code, code adjustments and code extensions are developed on an internal system. For TE 1.0 this was a web site which, although visible and accessible to the world, was anonymous in that no links on the web pointed to it. Such an internal system or site is typically called a sandbox (developers are free to ‘play’ in it). In TE 1.0 we called our sandbox our ‘new’ site.
Depending on how a software development group sets up its sandboxing, individual developers can have their own, individual sandbox, or, as in the TE 1.0 case, they can share a common sandbox. Obviously, individual sandboxes provide more opportunities to work on code without impacting other developers. The TE 1.0 team, however, was small enough that a single common sandbox, in combination with a version-control system, worked just fine.Code developed in the sandbox would typically be reviewed by TE 1.0 project members for functional adequacy and robustness. Once approved, the code would be moved to the next step, namely the ‘test’ site.
- Step 2. Integration testing. Beside the (shared) sandbox, TE 1.0 maintained a release test site. This site ─software, database content and document repository─ was synchronized with the production/release site, but was used for testing all new and modified code against the complete system. Hence, once sandbox code was approved for release, it was deployed on the test site for integration testing. The time it would take to conduct this integration testing varied from just a few minutes for a simple user interface change; g., a color change or fixing a typo, to a day or longer for testing new or updated periodic back-end processes.
Only once the software was verified to operate correctly on the ‘test’ site, would it be released in production. In case errors were found, the process returned to the sandbox stage.
- Step 3. Production/Deployment. Releasing sandbox-approved and test site-verified code was quite easy because it consisted of deploying the test site-verified code from the updated version-control system to the production site.
All is Flux, Nothing Stays… TE 1.0 Continuous System Monitoring
One good way to experience Heraclitus’s famous “All is Flux, Nothing Stays” or to experience the universal phenomenon of system entropy, is to release code into production and sit back and wait for it to stop functioning. Although the stepped process of Sandbox à Test site à Production site is relatively safe in that it limits the risk of releasing faulty or dysfunctional code, it is always possible, and indeed likely, that at a later stage ─sometimes months later─ a problem emerges. This can happen for a variety of reasons. Perhaps a developer relied on a specific file system layout which later on became invalid. Or perhaps code relies on pulling data from an external service which for some reason or other suddenly stops or seems to stop working (Did we not pay our annual license fee? Did we run out of our free allocation of search queries? Is the service still running? Did the service change its API without us making the necessary adjustments?).
Experienced software developers have great appreciation and awareness of the principles of permanent flux and system entropy and hence, will make sure that they build and deploy facilities which continuously monitor the functioning of their systems. Of course, these monitoring facilities need some monitoring themselves as well. Although at least in theory this leads to an infinite regress, monitoring the monitoring processes can mostly be accomplished through simple and often manual procedures which can be integrated into a team member’s job responsibilities. Table ?? contains a list of TE 1.0’s (automated) system monitoring processes.
|Systems up test||Once per minute||A simple test to see if our Website and database are up and running.|
|Regression tests||Every 12 hours||Tests for most new features and all bug fixes are run in sequence.
The following is the summary of the last TE 1.0 regression test run on April 28, 2016:
Start Time: Thu Apr 28 04:00:04 2016
Total Run Time: 357.458 seconds
Test Cases Run: 139
Test Cases Passed: 137
Test Cases Failed: 2
Verifications Passed: 268
Verifications Failed: 2
Average Response Time: 2.544 seconds
Max Response Time: 85.099 seconds
Min Response Time: 0.002 seconds
|Link diagnostics||Every 12 hours||Test all Web links on the TE pages and report failing links (the link, the source of the link, the contributor of the source, the error code associated with the failed link)|
|HTML diagnostics||Once a month||Run an HTML checker on a random sample of Web (static and dynamic) web pages.|
|Metadata harvesting checker||Once a month||A process which queries sites which harvest our content, making sure that the sites continue to harvest our content.|
TE 2.0 Develop-Test-Deploy
The development, testing, and deployment process in TE 2.0 is similar to that of TE 1.0. The differences are in the details.
- Step 4. Staging. Once a set of code changes is ready to be released, it is merged from the QA branch onto a branch called Release. This triggers another build process that results in the code being deployed to a staging environment which is an exact duplicate of the production environment and uses the production database. Additionally, this build process ends with a series of ‘Smoke Tests’ which perform automated browser-based testing using Selenium, a tool for web browser scripting. These tests exercise key functionality of the site to ensure that nothing is broken; e., on fire.
- Step 5. Release. Once the staging site is verified to be working correctly, it is swapped with the production site. That is, all production traffic is redirected to the staging site, which becomes the new production site. In the event a problem is encountered after release that necessitates a roll back, it is easy to redirect traffic back to the prior version of the site.
In TE 2.0, each of these steps is entirely automated and can be initiated by executing one or two command line statements. Automation is key to having quick, repeatable, and error-free releases. This automation allows updates to TE 2.0 to be released frequently, as often as once a day or more. Releasing software updates more frequently results in smaller, less-risky updates. Frequently integrating and releasing code is known as ‘Continuous Integration’ and ‘Continuous Deployment.’ Prior to the widespread adoption of these two practices, integration and releases would happen much less frequently, often as infrequently as once a quarter. This resulted in increased risk and longer feedback cycles.
Behind the Magic Curtain
There is a lot going on for each of the develop-test-deploy steps described in the previous section. Code is retrieved from source control, compiled, tested, and deployed. The process is highly automated, and thus can seem somewhat magical at first glance. Not too long ago, setting up an automated build and deployment process like this required setting up, configuring, and maintaining a build server such as Jenkins or TeamCity as well as a server for running the chosen source control system. Similarly, hosting development, testing, and production instances of an application would typically involve buying, configuring, and maintaining multiple servers.
With the emergence of software-as-a-service and the ‘cloud,’ it is no longer necessary to configure and maintain the basic infrastructure; i.e., hardware and software needed to develop, deploy, and host applications. For example, TE 2.0 utilizes Visual Studio Team Services (VSTS) to host its Git repository and perform the build and deployment process. As a cloud-hosted solution, VSTS saves the TE team from having to maintain source control and continuous integration servers.
Similarly, the development, beta, staging, and production environments are hosted in Azure, Microsoft’s cloud hosting platform. TE 2.0 uses Azure’s Platform as a Service (PaaS) offering known as Azure App Service. With PaaS, the cloud provider takes care of maintaining and updating the server and operating system that runs the application. In effect, everything below the application layer is abstracted away and managed by the cloud hosting provider. This is especially beneficial for a small team such as the TE team. Instead of worrying about operating system updates and hardware maintenance, our limited resources can be focused on activities that make TE a better product.
Azure App Service also provides a number of other value-added capabilities. For example, if there is a sudden surge of traffic to the TE site, Azure App Service will automatically add more server capacity. When traffic levels subsequently drop to a level that does not require additional capacity, the extra capacity is withdrawn. Server capacity is billed by the minute and you only pay for capacity when you are using it. This is one of the key benefits of hosting applications in the cloud. In a traditional hosting model, one would have to pay up front for the server capacity needed for peak load, even if it is unused a vast majority of the time. With Platform as a Service, capacity can be added and removed as demand warrants.
The ability to deploy code to a staging site and swap it with the production site as described in steps 4 and 5 of the previous section is also a feature of Azure App Service. This can be done with just a few clicks or with a single command-line statement.
TE 2.0 Continuous System Monitoring
Azure App Service also provides a number of capabilities for monitoring application health. For TE 2.0, the site is configured to send an alert via e-mail if certain adverse events happen. These include heavy CPU load, heavy memory usage, excessive 500 errors, and slow site response.
In addition, TE 2.0 uses Azure Application Insights, an Application Performance Management tool. Application Insights captures detailed data about application performance, errors, and user activity. This data is fed into a web-based dashboard. It also uses machine learning to detect events such as slow performance for users in specific geographic locations or a rise in the number of times a specific error happens. Application Insights has also been configured to access TE from five different geographic locations every five minutes. An email alert is generated if 3 or more of the locations are unable to access the site.
TE Meta Monitoring
Besides pure functional aspects of system performance, there are other, higher level (or ‘meta’) aspects which need regular reporting and checking. In today’s web- and internet-based world, one of these aspects is whether third parties which drive traffic to our system; i.e., search engines such as Google, know about your content and assess our content as an attractive target. This information obviously must come from the parties owning the search engines; it is not information internal to our system. Fortunately, search engines such as Google often make their diagnostics tools available so that as content providers we can know how the search engines assess our content. In Google’s case one of those tools is Google Search Console. This tool provides lots of information on how Google harvests your content. Needless to say, then, that periodic monitoring of Google Search Console, either manually or by using its API, provides valuable information on how well your site is viewed by the world’s most popular search engine.
Another, valuable meta monitoring service is Google Analytics (GA). GA is a service to which one can report requests coming into one’s website. GA keeps a record of those requests and reports them back on demand using any number of user-chosen facets. For example, one may ask for a timeline of requests, for any time frame and pretty much any time step. One can also ask for a breakdown by technology, browser, operating system, location, page, etc. Obviously, a lot of useful information will be hiding in these data. Both TE 1.0 and TE 2.0 use(d) GA.
A third type of meta monitoring simply collects and reports aggregate information about a system. For TeachEngineering this means being able to tell how many items the library has at any moment for any grade or combination of grades in the K-12 grade range or how many different institutions have contributed curriculum and how much they have contributed.
- In computer science, the application of so-called ‘formal methods’ is aimed at mathematically and logically analyzing and specifying the execution paths of code. This is in contrast to the more common way of simply empirically testing code paths by submitting the code to a variety of specified use cases. Proponents of formal methods propose that the proper application of such methods significantly reduces the likelihood that faulty code execution paths remain undetected prior to code deployment. ↵
- Code obfuscation refers to the practice of purposefully rendering source code difficult to read for humans, typically in order to make it more difficult for ill-willed individuals to search for weaknesses and security exploits. ↵
- At the time of TE 1.0, this service was known as Google Webmaster Tools. ↵