Track-switching in a large Elixir web application

We used a feature toggle to gradually re-build parts of the business logic of our app. We were then able to switch to the new logic, and seamlessly roll back to the old logic in case of unforeseen problems. We had to come up with two interesting hacks to be able to test both code paths asynchronously.

In this post, I will tell the story about a large code migration behind a feature toggle, and show the necessary steps for implementing it. Besides, it covers the process dictionary and how the SQL sandboxing works to make asynchronous end-to-end tests possible.

We have a large Elixir application running in production that we needed to change drastically. Parts of the decision logic, parts of the database schema, and several queries needed to be modified or even re-written. It would take us several weeks of development time, during which we still had to do minor tweaks and fixes to the live application. Developing the new logic in a separate branch over such a long time seemed like a dangerous option. Also, we did not want to deploy the new code late at night, incurring a longer downtime, only to find an unforeseen problem with the new code that required us to roll back…

Instead, we decided to develop the new logic step by step, in parallel with maintenance work of the live application. We would continually and frequently deploy all of the code – including the dormant new logic. Using a feature toggle, we would be able to dynamically switch the running application to using the new code paths, once they were ready. In case of an unforeseen problem, we could flick the switch back to the old code, fix and redeploy, then try again. Our tests should exercise both the old and the new business logic.

The purpose of our feature toggle

If feature toggles are a new concept for you, take the time to read what Martin Fowler has to say about them. In what follows, I will borrow some of his vocabulary.

Feature toggles differ drastically in longevity and dynamism. Ours was intended to introduce a single large code change at runtime, so it was meant to be short-lived, and dynamic for the entire application at once – as opposed to dynamic per-request dynamic. In other words, the feature toggle required no input from requests.

It is worthwhile to point out that the toggle served two purposes: Besides being able to switch between the old and the new logic at runtime, we needed to be able to test both old and new logic code paths in parallel. Further on, we will see that full-stack asynchronous tests forced us to handle the toggling per-request. With some understanding of the platform and the testing environment we were able to keep the interface of the feature toggle unchanged.

A simple toggle router, and storage for the toggle

In this section, I sketch the decision code behind the toggle. There is already enough going on that it makes sense to extract this logic, rather than duplicating it at every feature branching point. Fowler uses the term toggle router: a device to “decouple a toggling decision point from the logic behind that decision.”

We already had a “switchboard” module in place, where we centralized all environment-specific decisions for our application. It was a natural place to hold the logic. Here I start with an empty module for the sake of clarity.

Our toggle shall be a simple function in a module:

elixirdefmodule MyApplication.Switchboard
    def use_new_logic? do
        false
    end
end

We would typically call use_new_logic? from a controller. There is no input from the request required, making this as simple as possible.

The state of our toggle is application-wide and must survive deployments. In order for a new app instance booting up to pick it up, we need to persist the value of the toggle. We used Redis to store the value, but anything is fine really as long as it can be polled. We do not want to make a request to the backing store every time we need to check the toggle value, so we need some caching.

Let’s implement that, without going into details about the backing store:

elixirdefmodule MyApplication.Switchboard
  def use_new_logic? do
    case Application.get_env(:my_application, :use_new_logic?) do
      nil ->
        value = get_value_from_redis() || false
        Application.put_env(:my_application, :use_new_logic?, value)

        value

      value -> value
    end
  end

  def use_new_logic!(value) do
    put_value_in_redis(value)
    Application.put_env(:my_application, :use_new_logic?, value)
  end
end

When storing shared data in memory on the Erlang VM, ETS tables are usually the way to go. Above, we (mis-)use the application environment for caching. It is backed by ETS anyway, has a much simpler interface, minimal performance overhead, and it saves us from setting up an ETS table in the first place.

I’m also providing the code for changing the toggle value above.

This implementation works fine assuming we have only one instance running at any time. In the case of horizontal scaling, our application develops split-brain syndrome: The instance where use_new_logic! was called will use the new value, but other instances will continue to use their cached value until they are cycled out. To fix that, we need to extract

elixirvalue = get_value_from_redis()
Application.put_env(:my_application, :use_new_logic?, value)

into a process wrapping these lines in an endless loop, sleeping for a couple of seconds during each round. Our app still was only under mild load and had multiple instances running only during times of deployment, so we gladly skipped this step.

Testing with a feature toggle

We want to be able to test both new-world and old-world code in the same codebase. Some of the code exercised by our tests would have to call the feature toggle. This leads to an interesting problem: The implementation given above relies on shared state, held by the cache and in the backing store – but tests requiring different values for the toggle should be able to run in parallel. Let’s investigate this a bit:

Code exercised from a unit test should not need to call the toggle function. Such code is intended to be called from higher levels that would use the toggle, if needed. (This depends a bit on your application architecture and how you break up your units, but it probably holds true in a web application, where it can be expected that the feature toggle is only called from a controller.)

Request and end-to-end tests might execute code that would run through the toggle router. In their case, we either need to give up test parallelism, or change the toggle code to enable different tests having different opinions on the toggle value.

So, we need to find mechanisms for passing the desired toggle value from some tests down to the implementation.

Passing the toggle value down in a request test

A request test (in Phoenix parlance, an endpoint test using ConnTest) builds a special Plug.Conn struct for each request, sends it into the Phoenix application stack (the application endpoint), receives it back and asserts against it. You can think of ConnTest as a lightweight utility to short-circuit the web server that is normally preparing and passing down the Conn.

We could simply pass the desired toggle state from the test piggy-back with the conn, using put_private/3. This approach would be in line with the general attitude toward explicitness in the Elixir world. This value of explicitness is very pleasant to work with – it makes data flow more obvious, code more search- and readable and also easier to understand and to debug. We decided to take a short-cut for these reasons:

both the feature toggle and the required testing modifications had a limited lifetime, they were meant to be thrown away after switching to the new code for good.
very few people were required to understand the code for switching to the new code logic.
our request test setup was inconsistent, so the approach of modifying the conn would not be that simple.

In a request test, both the test code and the controller code execute in the same Erlang process. You can verify this by putting IO.inspect(self()) into both and seeing the same PID twice in the output. This allows us to use a less-well-known feature of Erlang to pass the per-test toggle value on: the process dictionary.

The process dictionary works as a key-value store and implements hidden state within a process. It is a strangeling in the architecture, and passing data through it is usually frowned upon. Using data side-loaded in the process dictionary is the opposite of handling it explicitly. In rare occasions, it can be a really helpful technique.

Here’s how we can use it for our purposes:

elixir# In the test setup:
Process.put(:test_use_new_logic?, true)

# New switchboard code:
defmodule MyApplication.Switchboard
  def use_new_logic? do
    case Process.get(:test_use_new_logic?) do
      nil ->
        # old switchboard code here (persistence and cache)

      test_value -> test_value
    end
  end

  ...
end

There are quite a number of tests that need this kind of setup, so I want to make the test setup simpler (and simpler to remove). We can use ExUnit’s tagging mechanism for this. Tests can be tagged with metadata individually or in bulk like this:

elixirdefmodule SomeTest do
  use ExUnit.Case

  @moduletag tag1: value1 # for all tests in the module

  @tag tag2: value2 # only for this test
  test "something" do
    ...
  end
end

The idea is to tag tests that require setup for stubbing the toggle value, and implement the stubbing inside a setup block. The tags of a test are available to each setup block.

To avoid repeating the setup block as well, we extract it like this:

elixirdefmodule FeatureToggleTest do
  defmacro __using__(_) do
    setup tags do
      Process.put(:test_use_new_logic?, tags.use_new_code?)
    end
  end
end

and use this shared setup in our request tests like this:

elixirdefmodule SomeRequestTest do
  use MyApplication.ConnTest
  use FeatureToggleTest

  @tag use_new_code?: true
  test "something" do
    ...
  end
end

If we forget to setup the stubbing for a request test, the implementation will still run through the persistence and cache code in the feature toggle. We can simply raise an exception up there to find and then properly tag all such tests.

Note that each test runs in its own process, so no cleanup of the process dictionary is necessary.

Passing the toggle value from a full-stack test

Keeping the toggle value in a process’ state won’t help us much when writing full-stack tests using a browser to interact with our site.

Requests to our web application are typically initiated by an action inside the browser, like clicking a link, as instructed by the test. So an integration test passes information to the browser, running in a different operating system process, which then issues a web request. The request is then handled in an Erlang process different from the one running the test.

We need a mechanism for communicating from the test process to the process handling the request.

The SQL Sandbox already does this!

Ecto and Phoenix allow us to run end-to-end tests in parallel, to the effect that the rendered page content reflects the state of the database as set up by a test. This window into the database content is state shared between the test and the controller servicing the request – across process boundaries!

Indeed, the Phoenix/Ecto stack has already solved a problem similar to ours. I give a brief overview of the stack and the data flow involved:

each test process checks out a connection from the SQL Sandbox and claims ownership. All database interaction through this connection happens inside a transaction, and all database effects are invisible outside of it.
the test configures the framework responsible for controlling the browser session (Hound or Wallaby) with metadata – containing the test’s PID
when the web request is processed, this metadata is used to grant the process handling the request allowance to the connection owned by the test process
any queries in the web request will subsequently use the same database connection, and act inside the same transaction as the test code.

For the curious, the cross-process mechanism works by adding a payload to the user-agent header, to be parsed in the code starting here.

Although Phoenix.Ecto.SQL.Sandbox has SQL in its name, we can use it for our purposes as well.

Step 1: Adding metadata in the Feature Case

There is a test case template for feature tests (these execute the application code end-to-end), a file named similar to test/support/feature_case.ex, that roughly looks like this:

elixirdefmodule MyApplication.FeatureCase do
  use ExUnit.CaseTemplate

  using do
    quote do
      ... # aliases and imports
    end
  end

  setup tags do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(YourApp.Repo)

    unless tags[:async] do
      Ecto.Adapters.SQL.Sandbox.mode(YourApp.Repo, {:shared, self()})
    end

    metadata = Phoenix.Ecto.SQL.Sandbox.metadata_for(YourApp.Repo, self())
    # Wallaby specific, but looks almost the same when using Hound
    {:ok, session} = Wallaby.start_session(metadata: metadata)
    {:ok, session: session}
  end
end

The last paragraph of this code computes the necessary metadata for the Ecto SQL sandbox mechanism and passes it on to the end-to-end testing framework (Wallaby in our case). We add one line to amend the test framework metadata with information from the test metadata tags:

elixirmetadata = Map.put(metadata, :use_new_code?, tags.use_new_code?)

Step 2: Mark tests to use old or new code

The setup in step 1 takes exactly the same test tags as we used above for request tests. We tag all end-to-end tests that require our mechanism in the same way as we did for request tests.

If we forget to tag an end-to-end test, we get an immediate failure because the above setup code is executed, and tags.use_new_code? requires the :use_new_code? key to be present in the metadata map tags.

Step 3: Extract the metadata and pass the flag on to the toggle router

As part of the standard setup for asynchronous end-to-end tests, a plug in the application endpoint is used to extract the metadata and pass it on to Ecto’s SQL sandbox. We do a similar thing right next to it:

elixirdefmodule MyApp.Endpoint do
  use Phoenix.Endpoint, otp_app: :my_app

  if Application.get_env(:my_app, :sql_sandbox) do
    plug Phoenix.Ecto.SQL.Sandbox
    plug :extract_feature_toggle # <-- ours!
  end

  def extract_feature_toggle(conn, _) do
    conn
    |> get_req_header("user-agent")
    |> List.first
    |> Phoenix.Ecto.SQL.Sandbox.decode_metadata
    |> case do
      %{use_new_code?: flag} ->
        Process.put(:test_use_new_logic?, flag)
      _ ->
        # No metadata was passed. Happens when hit by request test,
        # not end-to-end test. Do nothing.
        :ok
    end

    conn
  end

  ...
end

In the setup for end-to-end tests, we instructed the browser testing framework to add the value for our feature toggle as metadata to all requests. The extract_feature_toggle function plug tries to extract this value.

If present, it writes it to the process dictionary. We have already written our toggle function to accept the toggle value from there because our request tests use that mechanism.

PLEASE NOTE that the if Application.get_env(:my_app, :sql_sandbox) conditional around our function plug is REALLY important here! We must never use Phoenix.Ecto.SQL.Sandbox in production code, since it eventually calls :erlang.binary_to_term() to deserialize the payload. An attacker could craft requests with a prepared user-agent header to make this call generate Erlang atoms, which are never garbage collected, until resources are exhausted and our app crashes.

Conclusions and final thoughts

Having both old-world and new-world code side-by-side during the transition had some effect on the application code in various places. Obviously, we need a database schema that can service both worlds. The same holds for our database factory. A couple more places were affected after all, and a good amount of careful planning was strictly required for our approach.

We are glad we took this route, however. When we changed the feature toggle to using the new code in production, we quickly realized a mistake and went back. This meant no downtime, no stress for us, and only a minimal delay needed for fixing the issue and re-deploying.

A few hours later we decided that the new code worked as desired. What followed was a couple of days of removing the old implementation and what had become obsolete, starting with the old-world tests, and eventually dropping columns in the database. All the testing specific modifications shown above were deliberately minimal and easy to find, hence easy to remove.

It looks like we used the frameworks in a way they were not really intended for. While the mechanism for passing metadata to an in-browser test run is documented, the work required for getting it back out is not immediately obvious. Phoenix.Ecto.SQL.Sandbox exposes decode_metadata publicly, but not extract_metadata, which we had to replicate.

It speaks to the ecosystem and community that the necessary steps quickly became clear when looking at the code and trying a few things out. My general impression is that the abstractions around the popular frameworks written in Elixir are mostly paper-thin, and the result is low volume of implementation code that is easy to understand.

With several years of experience in building and maintaining Elixir applications, we can help you build applications that can change as your business does. Get in touch!