We used a feature toggle to gradually re-build parts of the business logic of our app. We were then able to switch to the new logic, and seamlessly roll back to the old logic in case of unforeseen problems. We had to come up with two interesting hacks to be able to test both code paths asynchronously.
In this post, I will tell the story about a large code migration behind a feature toggle, and show the necessary steps for implementing it. Besides, it covers the process dictionary and how the SQL sandboxing works to make asynchronous end-to-end tests possible.
Our Elixir application needed to change drastically – database schema, queries, decision logic. The necessary code changes would require several weeks of development time, during which we still had to do minor tweaks and fixes to the live application. Developing the new logic in a separate branch over such a long time seemed like a dangerous option. Also, we did not want to deploy the new code late at night, incurring a longer downtime, only to find an unforeseen problem with the new code that required us to roll back…
Instead, we decided to develop the new logic step by step, in parallel with maintenance work of the live application. We would continually and frequently deploy all of the code – including the dormant new logic. Using a feature toggle, we would be able to dynamically switch the running application to using the new code paths, once they were ready. In case of an unforeseen problem, we could flick the switch back to the old code, fix and redeploy, then try again. Our tests should exercise both the old and the new business logic. After the final switch to the new-world code, we would remove the old-world tests, then the toggle and finally all code which had then become unused.
Feature toggles come in many forms. If you haven’t already, take the time to read what Martin Fowler has to say about them. Below, I will use some of his vocabulary. As Fowler points out, toggles differ drastically in longevity and dynamism. Ours was indented to introduce a single large code change at runtime, so it was meant to be short-lived, and dynamic for the entire application at once (as opposed to per-request dynamic).
It is worthwhile to point out that the toggle served two purposes: seamless parallel development and testing, and the ability to revert back to the old logic. For the latter, the toggle needs to be an application-wide state. As you will see later, full-stack asynchronous tests forced us to implement the switch per-request. In this post, I’ll ignore this at first.
In the next section, I sketch the decision code behind the toggle. Enough is going on there that we need to extract this logic, rather than duplicating it at every feature branching point. Fowler uses the term toggle router: a device to “decouple a toggling decision point from the logic behind that decision.” We already had a “switchboard”, centralizing environment-specific decisions in our application, a natural place for the implementation of the feature toggle.
A simple toggle router, and storage for the toggle
Our toggle shall be a simple function in a module:
This is already the final API for the toggle router. For the controllers involved, things couldn’t be simpler!
The state of the toggle is application-wide state that must survive deployments. We need to persist it so that a new app instance booting up can pick it up. We used Redis to store the value, but anything that the operations environment allows for and that can be polled is fine, really. We do not want to make a request to the backing store every time we need to check the toggle value, so we do some caching. Let’s implement that:
I’m leaving out the details of the backing store. When storing shared data in memory on the Erlang VM, ETS tables are usually the way to go. Above, we (mis-)use the application environment as the cache. It is backed by ETS anyway, has a much simpler interface, minimal performance overhead, and it saves us from setting up an ETS table in the first place.
Above, I’m also providing the code for changing the toggle value (there’s a race condition which I did not need to care about, can you spot it?). This code works perfectly assuming we have only one instance running at a time. In the case of horizontal scaling, our application develops split-brain syndrome: The instance where
use_new_logic! was called will use the new value, but other instances will continue to use their cached value until they are cycled out. To fix that, we need to extract
into a process wrapping these lines in an endless loop, sleeping for a couple of seconds during each round. Our app still was only under mild load and had multiple instances running only during times of deployment, so we gladly skipped this step.
Testing with a feature toggle
Recall that one purpose of our feature toggle was to be able to test both new-world and old-world code in the same codebase. Asynchronous tests pose a problem with an application-wide feature toggle: Both the backing store and the cache are shared state, accessed concurrently by different tests – but we need those tests to set and read different values. Let’s dissect this problem a little.
- Unit tests should not be a problem – the code (units) tested should operate below the level which is querying the toggle (so the decision has already been made when the units’ code is entered). This depends a bit on your application architecture and what you call a unit, but it probably holds true in a web application.
- Request and end-to-end tests might execute code that would run through the toggle router. In their case, we must replace the use of the cache and the backing store with a different mechanism.
So, we need to find mechanisms for passing the toggle state from these tests down to the implementation. The next sections deal with that in the case of request tests and end-to-end tests separately. The mechanisms should be easy to remove after making the switch, if possible.
Per-request test toggle values
A request test (in Phoenix parlance, an endpoint test using ConnTest) builds a special
Plug.Conn struct for each request, sends it into the Phoenix application stack (the application endpoint), receives it back and asserts against it. You can think of
ConnTest as a lightweight utility to short-circuit the web server that is normally preparing and passing down the Conn.
We could simply pass the desired toggle state from the test piggy-back with the conn, using put_private/3 for example. The trick we used for end-to-end tests, discussed in the next section, would continue to work with the obvious, minor modifications. The setup block shown in this section would have to be changed to modify the
conn, which is passed through in the test metadata
This approach would be in line with Elixir’s general attitude toward explicitness. The value of explicitness permeating most of the ecosystem is very pleasant to work with – it makes data flow more obvious, code more search- and readable and also easier to understand and to debug. Our situation, however, was that very few people were required to understand the code for switching to the new code logic, and that code would have a very limited lifetime. Together with the fact that our request tests were a bit inconsistent to easily pass the toggle state via the
conn, we chose to take a shortcut.
In a request test, both the test code and the controller code execute in the same Erlang process. You can verify this by putting
IO.inspect(self()) into both and seeing the same PID twice in the output. This allows us to use a less-well-known feature of Erlang to pass the per-test toggle value down from the test to the implementation: the process dictionary.
The process dictionary works as a key-value store and implements hidden state within a process. It is a strangeling in the architecture, passing data through it is frowned upon because the data flow is no longer happening explicitly that way. In rare occasions, the process dictionary can be really helpful though.
Here’s how we can use it for our purposes:
There are quite a number of tests that need this kind of setup, so I want to make the test setup simpler (and simpler to remove). We can use
ExUnit’s tagging mechanism for this. Tests can be tagged with metadata individually or in bulk like this:
These metadata tags are then available to all setup blocks. To avoid repeating the setup block as well, we extract it like this:
and use this shared setup in our request tests like this:
If we are not careful and miss this step for a request test, the implementation code still runs through the persistence and cache code in the feature toggle. We can simply raise an exception up there to find and then properly tag all such tests. This trick relies on the fact that each test runs in its own process, so no cleanup of the process dictionary is necessary.
Toggle values in full-stack tests through a browser
Keeping the toggle value in a process’ state won’t help us much when writing tests against a browser interacting with our site. A request initiated in a test will first hit the browser, running in a different operating system process, which then issues a web request to our application, where it is then handled in a different Erlang process. If only there were a mechanism to link back the request to the test process driving the browser!
Ecto and Phoenix allow us to run end-to-end tests in parallel, to the effect that the rendered page content reflects the state of the database as set up by a test. This window into the database content is state shared between the test and the controller servicing the request – across process boundaries!
Indeed, the Phoenix/Ecto stack has already solved a problem similar to ours. I give a brief overview of the stack and the data flow involved:
- each test process checks out a connection from the SQL Sandbox and claims ownership. All database interaction through this connection happens inside a transaction, and all database effects are invisible outside of it.
- the test configures the framework responsible for controlling the browser session (Hound or Wallaby) with metadata – containing the test’s PID
- when the web request is processed, this metadata is used to grant the process handling the request allowance to the connection owned by the test process
- any queries in the web request will subsequently use the same database connection, and act inside the same transaction as the test code.
For the curious, the cross-process mechanism works by adding a payload to the
user-agent header, to be parsed in the code starting here.
Phoenix.Ecto.SQL.Sandbox has SQL in its name, we can use it for our purposes as well.
Step 1: Adding metadata in the
There is a test case template for feature tests (these execute the application code end-to-end), a file named similar to
test/support/feature_case.ex, that roughly looks like this:
The last paragraph of this code computes the necessary metadata for the Ecto SQL sandbox mechanism and passes it on to the end-to-end testing framework (Wallaby in our case). We add one line to amend the test framework metadata with information from the test metadata
Step 2: Mark tests to use old or new code
The setup in step 1 takes exactly the same test tags as we used above for request tests. We tag all end-to-end tests that require our mechanism in the same way as we did for request tests.
If we forget to tag an end-to-end test, we get an immediate failure
because the above setup code is executed, and
tags.use_new_code? requires the
:use_new_code? key to be present in the metadata map
Step 3: Extract the metadata and pass the flag on to the toggle router
As part of the standard setup for asynchronous end-to-end tests, a plug in the application endpoint is used to extract the metadata and pass it on to Ecto’s SQL sandbox. We do a similar thing right next to it:
In the setup for end-to-end tests, we instructed the browser testing framework to add the value for our feature toggle as metadata to all requests. The
extract_feature_toggle function plug tries to extract this value.
If present, it writes it to the process dictionary. We have already written our toggle function to accept the toggle value from there because our request tests use that mechanism.
PLEASE NOTE that the
if Application.get_env(:my_app, :sql_sandbox) conditional around our function plug is REALLY important here! We must never use
Phoenix.Ecto.SQL.Sandbox in production code, since it eventually calls
:erlang.binary_to_term() to deserialize the payload. An attacker could craft requests with a prepared
user-agent header to make this call generate Erlang atoms, which are never garbage collected, until resources are exhausted and our app crashes.
Conclusions and final thoughts
Having both old-world and new-world code side-by-side had some effect on the application code in various places. Obviously, we need a database schema that can service both worlds. The same held true for our database factory. A good amount of careful planning was strictly required for our approach.
We are glad we took this route, however. The inevitable thing happened when we changed the toggle value in production: things were not exactly right, and we needed to go back. However, this meant no downtime, no stress, and only minimal delay. Eventually, it took us a few hours to decide that everything was fine, followed by a few days of cleaning up and removing some columns and tables from the database that were no longer needed. All the testing specific modifications were deliberately minimal and easy to find, hence easy to remove. Of course, the first things we removed were all tests covering the old-world logic.
Our approach required going low-level with the frameworks more than might seem necessary.
Phoenix.Ecto.SQL.Sandbox has been written, as the name indicates, with one particular use-case in mind, and I was surprised it exposed
decode_metadata publicly, but not
extract_metadata, which we had to replicate. All mechanisms we needed were present, albeit not conveniently exposed. We had to dig deeper to understand how things were built to accomplish asynchronous tests with shared database connections. This turned out to be not hard because the framework code involved in our problem was clearly written, low in volume and easy to navigate.
With several years of experience in building and maintaining Elixir applications, we can help you build applications that can change as your business does. Get in touch!