penguincoder-org/content/homework/testdouble.md

14 KiB

+++ title = "TestDouble Infra Homework" description = "A treatise of a distributed job system" date = "2023-02-03" +++

Introduction

This is an imaginary description of a complex software project that is designed to take a request from an end user and perform a long-running calculation. This calculation cannot be cached (as that would nullify the outcome of this project) and cannot be performed in a typical amount of time that a browser typically uses for the request-response cycle.

Problem description as presented by TestDouble.

Choices

For this experiment, I will be using Google Cloud Platform to host my solution. As such, products and services chosen represent what is available from GCP. Similar technologies exist for both AWS and Azure and a similar pattern could be applied to this solution.

Complicated Outline


C4Context
  title TestDouble Distributed Pi Calculator

  Person(consumer, "API Consumer")
  Enterprise_Boundary(b0, "Public API") {
    System(app, "Cloud Run", "A serverless app that responds to the public facing API.")

    Boundary(b1, "Data Storage") {
      SystemDb(redis, "MemoryStore", "A managed Redis instance.")
      SystemDb(objects, "Cloud Storage", "Managed object storage for job results.")
    }

    Boundary(b2, "Backend Workers") {
      SystemQueue(pubsub, "Pub/Sub", "Sends events notifications to a worker.")
      System(worker, "Cloud Function", "A serverless function to compute PI.")
    }
  }

  Rel(consumer, app, "HTTP")
  Rel(app, objects, "Get Job Results")
  BiRel(app, redis, "Read Job Status")
  Rel(app, pubsub, "Queue Job")
  Rel(pubsub, worker, "Start Job")
  Rel(worker, objects, "Job Finished")
  Rel(worker, redis, "Write Job Status")

I'm not convinced that C4Context is best for this project...

Part A


flowchart TD
  consumers[API consumers] --> app[Cloud Run]
  app --> redis[Redis MemoryStore]
  app --> pubsub[Pub/Sub]
  pubsub --> worker[Cloud Function]
  worker --> object[Cloud Storage]
  worker --> redis
  app --> object

This is kind of a lot to grok at a glance, but let's take a look at each component and dive into the challenges and features of each decision.

API Consumers

This represents the already-existing application that will be consuming our pi digits. This is pretty hand-wavey, but also not relevant to this exercise.

The Primary Application

This will be a Cloud Run app that will be run each time the API is utilized. Nearly any programming language can be used to create this app, but priority should be given to runtimes that can start quickly and use very little memory. Go or Rust would be a great fit, while a full Ruby on Rails would not. Plain Ruby would work fine, though!

Cloud Run is designed to scale out to as much traffic as is required, and the target throughput of 1000 r/s will be easily met with a serverless solution.

The process running here will be responsible for the following client-side HTTP API.

Method Path Purpose
POST /job Creates a new job
GET /job/<job_id> Gets the current status of the job
GET /job/download/<job_id> Downloads the results from the job

Job Status

I will be storing the job status inside of MemoryStore, which is a managed Redis product from Google. The simple choice here is that if or when the volume of traffic changes, we can quickly upgrade to a better-performing tier with little to no effort on our part.

Why Redis?

Redis is an optimal solution for this problem because of the simplicity of the data and the high-throughput nature of the service. There are no complex transactions or modifications to the data, so utilizing a system that has transactions is likely more iron than we need. Redis can typically repond in less than 2ms for simple data retrieval of the scale that this solution will be using. This system will be using Redis for job status and the location of the job results, only.

Pub/Sub

To trigger the on-demand job, I will be using a PubSub messaging service. This is also a managed service offered by GCP that can scale out to as much or as little traffic as needed. For initial development, we can always use PubSub Lite to keep initial costs down (and reduce costs of non-production deployments). Each time a message is published to our PubSub topic, GCP will spawn a Cloud Function to trigger our job.

Long Running Job

The long computations will be performed in a Cloud Function, which is a serverless platform that can be used to respond to events. Nearly any language can be used to create this component with preference given to low memory usage and ease of maintenance. For particularly large digit computations, the results can be streamed into Cloud Storage as they are calculated (or in a fixed buffer size) to prevent runaway memory growth and manage costs better.

When the function starts, it will update Redis with the new state of the job. When it completes the work, it will store the results in an object in Cloud Storage and update Redis with the object name. Subsequent calls to download the job results can either be downloaded by proxy or you can offer a redirect to a signed URL that can be used to download the job results once.

Part B

Since I have chosen a lot of Google-managed services, the networking aspect of this product is less complex than a more self-hosted one. I am also going to assume that all of the deployment details will be handled with a IaC tool and that users will not be able to just go and grant whatever permission they want because they are all sharing root credentials. More than likely they are, but that's another problem for another day.

Network

All of the described components will live inside of the same VPC. More than likely this is going to be a VPC that is shared among the entire environment (i.e. development or staging). All Google-managed services will have a "magic" endpoint that will be the network location of the service. It will be routable from anywhere in the VPC so we will use IAM roles to control access to each service.

API Consumers

The external users of this service will be the only role, user, or serviceaccount that will have the run.routes.invoke permission on the Cloud Run app. This will also have the storage.objects.get on the Cloud Storage bucket to download results.

Cloud Run

This portion will need to be able to storage.objects.get on the Cloud Storage bucket for the results. Also, there will need to be a Serverless VPC Access connector to allow access from the Cloud Run and Cloud Functions environments. It will also need a roles/pubsub.publisher role applied to it so that it can publish messages to the agreed-upon topic for job creation.

Pub/Sub

The messaging component doesn't have any specific customizations or permissions needed for this component to function correctly.

Cloud Functions

This will need storage.buckets.create on the Cloud Storage bucket to write/stream the results. It will also need roles/pubsub.subscriber on the job creation topic. The previously mentioned connector will apply to this product, too.

Cloud Storage

We'll need to ensure that all of the objects are encrypted every time. Also, the access policy should be private to prevent malicious actors from guessing (or knowing) where the objects are in the bucket. Otherwise, there are not many settings to change for this product.

MemoryStore

This feature doesn't have a lot of the same IAM controls as the rest of the Google suite offers. It has the built-in Redis authentication features and the ACL appears to be disabled. Because of that, "anyone" that has the network address can connect to Redis. While not ideal, it's solvable by using a Secret Manager secret to contain the user password to Redis and then loading that as an environment variable (or similar) in the Cloud Run / Cloud Function products. The information in this example is not sensitive, so leaking that data is not great but also not a soul-crushing loss to the business. For any more complicated systems, there absolutely could be sensitive parameters, so using SSL on every connection and closely guarding the password is mandatory. If you build the process to rotate the credentials when you deploy MemoryStore, then changing the password won't be a challenge. If this proves to be an untenable situation, you can always use a Cloud Storage object instead of MemoryStore+Pub/Sub.

Appendix 1: Advantages of scale

This system should be quite performant, assuming the serverless components can start quickly and use a manageable amount of memory. Using managed services is more expensive than running the services yourself, but it also has the advantage of working right now and generally performs very reliably. You can easily think of the Cloud Run and Cloud Function layers as "limitless" so long as your budget allows. Redis has some practical limits, if nothing else, then the simple speed-of-light response time. However, for the target throughput I am confident that you can provision a reasonable cost-efficient Redis instance quickly at the project onset. Cloud Storage is another area that should not have any practical limits to storage quantity so long as the invoices keep getting paid. PubSub will very easily process 1000 r/s so there should be no concern with that product running out of capacity.

Individually scalable

By having each component as a separate tool that is purpose-built, we can scale each section as necessary. Another not-immediately-obvious benefit of using these managed services is the geographic spread you can achieve by using them. If there is a need to geo-locate your PI digits as close to the user as possible, you can do that with the big name GCP tools. You can also implement organizational rules to limit where your data will be spread. Specifically speaking, if you are dealing with patient data for patients in the USA, you'll need to keep that data entirely on servers hosted in the USA. The same goes for several other countries, but for all data (Russia, China, etc.).

Capacity

There will be ample capacity for 1000 r/s.

Appendix 2: Cost Saving Alternatives

There are several options for reducing costs, if desired. I'm not sure this section is necessary, but these all ran through my head at the same time and I think it's worth bringing up.

Self-host redis

If necessary, you can provision a machine in Compute and run 32-64 different Redis processes on the same machine. Each one consumes about 1Mb of RAM at idle, and during development it will be trivial to provision. Each Redis library should consistently hash between all of the processes, so as long as everyone uses the same configuration environment variables, there will be no cache misses. It's incredibly simple and generally quite robust, but it comes at a cost of additional Terraform and cognitive brain power to manage.

Self-host app

If Serverless becomes too expensive, you can always either self-host using a Compute VM or go to something like Google Kubernetes Engine and have a greater hand in the baseline performance and capacity of the system.

Using more Cloud Storage

This could have easily have been architected to use a Cloud Storage trigger to start the job. Basically instead of using Redis and Pub/Sub, you would write the new job to Cloud Storage and that would trigger the job to run. From there, you would update the object with state changes and results as necessary.

At this time, I don't have conclusive data that this would be less expensive than my current solution, but it is nevertheless an option to pursue, particularly if there are other organizational restrictions that might hinder the initial plan.

Using an NFS share

You could concievably use FileStore to provide a shared persistent disk between workers and app processes. This would have the benefit of being a more consistent cost, but this cost will be for the maximum possible storage size you will need. With Cloud Storage, you can store 100Gb and only pay for 100Gb, but with FileStore, you have to provision something larger than your expected data set. If you provision a 1Tb FileStore and only use 100Gb, you'll still pay for the full 1Tb.

Appendix 3: Happy Path


sequenceDiagram
  Consumer->>App: POST /job
  App->>Redis: Insert initial job state + UUID
  App->>PubSub: Notify worker
  PubSub->>Worker: Spawn function
  Worker->>Cloud Storage: Store job results
  Worker->>Redis: Change state to finished + object name
  Consumer->>App: GET /job/download/UUID
  Consumer->>Cloud Storage: GET s3://my-bucket-name/object/path

Appendix 4: Job Schema


classDiagram
  class Job {
    JobState state
    uuid id
    string results_object_name
  }

  class JobState {
    STARTED
    RUNNING
    FINISHED
    ERRORED
  }

  Job --> JobState

Appendix 5: Job States


stateDiagram-v2
  [*] --> Created
  Created --> Queued
  Queued --> Running
  Running --> Finished
  Running --> Errored

  Finished --> [*]
  Errored --> [*]