Volatile downloads with Elixir and Phoenix

At some point, almost every product served by a web application needs a reporting functionality for its admins. This may be a page showing a graph and a table, or a spreadsheet download. Admins can usually select a date range, and possibly provide more parameters for filtering, sorting, and aggregation.

Admins get a better insight and can pass on the data to other layers of the business. Everybody is happy!

Until … the page rendering or download starts becoming slow and occasionally fails to complete. By now, the feature is integrated into business workflows to a degree that failing to produce data for longer than a weekend becomes a major problem. Admins start learning about 504 Gateway Timeout, the project’s infrastructure, and heckling developers with good ideas:

“Can’t we just upload yesterday’s report onto S3?”

We ran into this problem while we were working heavily on a different part of the application and we needed a quick and cheap fix.

This blog post demonstrates a solution which is easy to set up. It exclusively uses standard language tools of Elixir. At the end, I will compare it to implementations in traditional languages.

Building blocks of a solution

We need jobs that handle the background calculation in preparation of a report, including storing the data. These jobs need to be discoverable, so we can notify an admin that his/her download for the sales report for the previous two days is still in the making. When the calculation is finished, the results need to be stored. Stored results must be discoverable, so a click on the report button will eventually serve the results. To avoid a high bill or running into resource limits, we also want result retention.

Usual approaches to the problem will also include job persistence and restart. This means that if a job is interrupted by a reboot or a redeploy, it will be restarted and not get lost.

Life without persistence (not as sad as it sounds)

We decided that job persistence and restarts were not needed in our situation. Also, reports are short-lived, so we can live with them vanishing during a deploy.

With these requirements out of the way, it becomes possible to implement the whole feature without any persistence. Reports would be short-lived in RAM and served directly from there.

Elixir, or the Erlang platform in general, is significantly different from what most developers are used to. Data is never shared between execution threads, instead, different execution units, called processes, operate on data in isolation. So we need to set up these processes first. The platform encourages building a “supervision tree”, where dedicated processes take care of child processes when they die unexpectedly.

Building a process tree

These are the processes we need:

a top-level supervisor for the subsystem
a dynamic supervisor, supervising all the job processes
the jobs themselves

Why do we need two supervisors instead of just one for supervising the jobs? Later on, we will introduce another process which is not a job. It will be supervised by the top-level supervisor of the subsystem. Therefore, two supervisiors will be required.

Here’s the complete code for starting the supervision tree and launching jobs under it, with a few placeholders to be filled later on:

jsdefmodule MyApp.Jobs do
  # Placeholder for code which appears in the next iteration, see below.
  @job_supervisor __MODULE__.JobSupervisor

  def child_spec(_) do
    children = [
      # Placeholder
      {DynamicSupervisor, strategy: :one_for_one, name: @job_supervisor}
    ]

    %{
      id: __MODULE__.Supervisor,
      start: {Supervisor, :start_link, [children, [strategy: :one_for_all, name: __MODULE__.Sup]]}
    }
    end

  def spawn_job(realm, payload, fun) do
    # Placeholder
    #
    #
    #
    #
    #
        spawn_new_job(realm, payload, fun)
    # Placeholder
  end

  defp spawn_new_job(realm, payload, calculation) do
    #

    job_fun = fn ->
      # Placeholder

      # (in the next iteration, see below, this line will be replaced:)
      calculation.(payload)
      # Placeholder
      #
      #

      # Placeholder
    end

    {:ok, _} = DynamicSupervisor.start_child(@job_supervisor, {__MODULE__.Job, job_fun})

    :started
  end

# Placeholder
  #
  #

  # Placeholder

  defmodule Job do
    def child_spec(fun) do
      %{id: __MODULE__, start: {__MODULE__, :start_link, [fun]}, restart: :transient}
    end

    def start_link(fun) do
      {:ok, spawn_link(fun)}
    end
  end
end

Now, we are ready to side-load work in a way which outlasts a single web request:

jsMyApp.Jobs.spawn
  :sales_report,
  [date_range: ...],
  &calculate_sales_report_for_options/1
)

Adding the missing features

In this section, we update the code above to do everything else we need. All those “placeholder” lines will now be filled.

We need to know which jobs are being run in a process. This is a perfect use-case for Elixir’s built-in Registry: it keeps track of processes registered under an arbitrary key and it allows for storage of additional metadata per process.

Our top-level supervisor will get a Registry child process for this purpose. This registry maps a descriptive key to the corresponding job process and in addition, we will use it to store each job’s result, making it easy to retrieve the results.

However, it is important to know that registry entries are evicted once a registered process exits! We side-step this issue by making an important decision: A job process will not only calculate the data, it manages the lifetime of that data as well. By adding a final :timer.sleep(..) to the job process after storing the result as metadata in the registry, the lifetime of the data is tied to that of the job. Even better, the registry process will automatically clean up afterwards.

Here are the changes needed for starting the registry:

js# Code replacing lines 1..2

defmodule MyApp.Jobs do
  @registry __MODULE__.Registry

js# Code replacing lines 6..9

    children = [
      Registry.child_spec(keys: :unique, name: @registry),
      {DynamicSupervisor, strategy: :one_for_one, name: @job_supervisor}
    ]

Here is a screenshot from Erlang’s observer tool, showing the relevant part of the supervision tree. Note that the process names all start with “Elixir.”, because this is what Elixir module names look like from the Erlang side.

Process registration stores a processes’ PID together with a key and a value. An obvious choice for a job key would be a tuple like {realm, job_options}. The stored value, which can be changed later on, should give us information about the state of the job and eventually hold the data.

Before spawning a new job, we check the registry. In case a job has been found, we return its value instead of :started.

js# Code replacing lines 17..26

  def spawn_job(realm, payload, fun) do
    Registry.lookup(@registry, registry_key(realm, payload))
    |> case do
      [{_pid, value}] ->
        value

      [] ->
        spawn_new_job(realm, payload, fun)
    end
  end

Remember the anonymous function job_fun which is run inside the job process. The first thing it now does is register the current process, with the value :running. Another immediate call to spawn_job would return this value. Once the job has calculated its result, it uses Registry.update_value to store it. A subsequent call to spawn would then return the map %{result: result}. Finally, the job_fun function waits the (hard-coded) retention time.

js# Code replacing lines 28..41

 defp spawn_new_job(realm, payload, calculation) do
    registry_key = registry_key(realm, payload)

    job_fun = fn ->
      Registry.register(@registry, registry_key, :running)

      run_calculation_and_pass_value(calculation, payload, fn result ->
        registry_value = %{result: result}

        {_, _} = Registry.update_value(@registry, registry_key, fn _ -> registry_value end)
      end)

      :timer.sleep(:timer.minutes(10))
 end

The following function helps to ensure that the job result value does not leak into the job_fun function above. Otherwise, it might not be considered for garbage collection (however, the compiler might already deal with this properly).

js# Code replacing lines 48..50

def run_calculation_and_pass_value(calculation, payload, pass) do
    payload |> calculation.() |> pass.()
end

We have also extracted a function for calculating the registry key.

js# Code replacing line 52

def registry_key(realm, payload), do: {realm, payload}

Analysis of our solution

These are the components of our solution:

Process tree setup
Retrieving a job result or spawning of job process
Registration and storage of results inside a job process
Calculating the data
Sleeping for the duration

I do not think that our approach can be substantially simplified.

The Erlang architecture forces us to push the side-loaded work into separate processes, which we need a supervisor for.
We could keep track of job processes inside what’s called an ETS table, and monitor processes for termination in order to clean up after them. Or use Registry, which does all of that for us.
(In-memory) persistance of job results and data retention are just two lines of code.

Comparison

At the top of this post, I suggested that we might be able to estimate and compare the complexity of this solution against a classical approach. For the scope of this post I’d like to compare with a language that offers explicitly-managed threads and uses mutable data structures not protected by type system mechanisms.

Comparing merely lines of code is not interesting. Developers can get used to common boilerplate in a language. Much more interesting, albeit not measurable and highly subjective, is the mental weight a developer needs to lift in order to build and maintain a solution. Particularly the kind of mental burden that is not visible inside the code itself.

Confidence in our Elixir solution

Did you notice there is a race condition when two admins request the same report in the same second? Some time can elapse between checking for a running job process and spawning one. If a second process for the same job is spawned, it tries to register under the same key and will crash. The unlucky admin will see an error page. The other job will simply carry on. Yet, I am not bothered in the slightest by this rare possibility.

While the solution above is not entirely bug-free, I expect it to be resilient and free of lock-ups and memory leaks. Should a process or even our whole subtree of processes crash, users might not even notice that, and there will never be a degradation of other services. The reports feature itself recovers immediately. There cannot be any memory leak (provided Elixir’s Registry does not leak memory).

Mental weight of the Elixir solution

On the Erlang platform, everything is so different! You cannot get away with not learning the concepts beyond some point. That means we carry some mental weight around.

Reading and understanding a process tree setup is not easy. Spawning a child under a dynamic supervisor is not something people would do every day. Understanding how arguments are passed from the supervisor call to the child_spec function to start_link is difficult, and the idiosyncratic tuples and maps involved are off-putting. The registry is not just a simple key-value-store, so its parameters and response tuples must be understood.

Most importantly, one must understand why this solution works. This blog post hopefully helps with that.

On the other hand, did you notice there is no error handling or defensive coding in the Elixir solution at all? The Erlang platform has us covered: errors are confined to the processes they occur in, required processes are restarted automatically. Memory is automatically reclaimed and cannot leak because it was never shared in the first place.

Confidence and mental weight of a classical solution

Assume somebody implements a very naive solution, spawning threads which register in a global data structure. It will “just work”, until something bad happens. Let’s have a look at what could go wrong:

When a job calculation fails, this will bring down the thread. This probably leaks memory. Even worse, it might not be possible to restart the job because the system considers it running. We could use exception handling to correct for this or continuously monitor all registered processes. This handling would need to clean up all registration data of the failing process or data that cannot be associated to a running thread.

If the registration data structure is not atomic (or access to it not wrapped accordingly), all kinds of things can happen, from duplicate job threads to corruption, taking the entire application down. For the most part, interaction with this data must be serialized by using locking.

We are dealing with invisible necessities of the programming model here. It takes a sufficiently experienced programmer to recognize them and to concentrate on doing the right thing in certain places. Otherwise, a naive solution would have bugs that are subtle, notoriously hard to reproduce, and hard to locate.

Conclusion

Elixir is rather different, but we could solve our problem with relatively little effort. The paradigm is really solid, and many problems have already been solved in the stack underneath our code. We were able to quickly introduce a technique that is relatively simple and well understood.

Learned something? Share this article with others or
feel free to check out our projects.