penguincoder-org/content/homework/testdouble.md

223 lines
14 KiB
Markdown

+++
title = "TestDouble Infra Homework"
description = "A treatise of a distributed job system"
date = "2023-02-03"
+++
# Introduction
This is an ~~imaginary~~ description of a complex software project that is designed to take a request from an end user and perform a long-running calculation. This calculation cannot be cached (as that would nullify the outcome of this project) and cannot be performed in a typical amount of time that a browser typically uses for the request-response cycle.
[Problem description](https://gitlab.com/testdouble-infrastructure/api-infrastructure-take-home) as presented by TestDouble.
# Choices
For this experiment, I will be using [Google Cloud Platform](https://cloud.google.com/) to host my solution. As such, products and services chosen represent what is available from GCP. Similar technologies exist for both AWS and Azure and a similar pattern could be applied to this solution.
# Complicated Outline
<pre class="mermaid">
C4Context
title TestDouble Distributed Pi Calculator
Person(consumer, "API Consumer")
Enterprise_Boundary(b0, "Public API") {
System(app, "Cloud Run", "A serverless app that responds to the public facing API.")
Boundary(b1, "Data Storage") {
SystemDb(redis, "MemoryStore", "A managed Redis instance.")
SystemDb(objects, "Cloud Storage", "Managed object storage for job results.")
}
Boundary(b2, "Backend Workers") {
SystemQueue(pubsub, "Pub/Sub", "Sends events notifications to a worker.")
System(worker, "Cloud Function", "A serverless function to compute PI.")
}
}
Rel(consumer, app, "HTTP")
Rel(app, objects, "Get Job Results")
BiRel(app, redis, "Read Job Status")
Rel(app, pubsub, "Queue Job")
Rel(pubsub, worker, "Start Job")
Rel(worker, objects, "Job Finished")
Rel(worker, redis, "Write Job Status")
</pre>
I'm not convinced that [C4Context](https://mermaid.js.org/syntax/c4c.html) is best for this project...
# Part A
<pre class="mermaid">
flowchart TD
consumers[API consumers] --> app[Cloud Run]
app --> redis[Redis MemoryStore]
app --> pubsub[Pub/Sub]
pubsub --> worker[Cloud Function]
worker --> object[Cloud Storage]
worker --> redis
app --> object
</pre>
This is kind of a lot to grok at a glance, but let's take a look at each component and dive into the challenges and features of each decision.
## API Consumers
This represents the already-existing application that will be consuming our _pi_ digits. This is pretty hand-wavey, but also not relevant to this exercise.
## The Primary Application
This will be a [Cloud Run](https://cloud.google.com/run) app that will be run each time the API is utilized. Nearly any programming language can be used to create this app, but priority should be given to runtimes that can start quickly and use very little memory. Go or Rust would be a great fit, while a full Ruby on Rails would not. Plain Ruby would work fine, though!
Cloud Run is designed to scale out to as much traffic as is required, and the target throughput of `1000 r/s` will be easily met with a serverless solution.
The process running here will be responsible for the following client-side HTTP API.
|Method|Path|Purpose|
|------|----|-------|
|`POST`|`/job`|Creates a new job|
|`GET`|`/job/<job_id>`|Gets the current status of the job|
|`GET`|`/job/download/<job_id>`|Downloads the results from the job|
## Job Status
I will be storing the job status inside of [MemoryStore](https://cloud.google.com/memorystore), which is a managed [Redis](https://redis.io) product from Google. The simple choice here is that _if_ or _when_ the volume of traffic changes, we can quickly upgrade to a better-performing tier with little to no effort on our part.
### Why Redis?
Redis is an optimal solution for this problem because of the simplicity of the data and the high-throughput nature of the service. There are no complex transactions or modifications to the data, so utilizing a system that has transactions is likely more iron than we need. Redis can typically repond in less than 2ms for simple data retrieval of the scale that this solution will be using. This system will be using Redis for job status and the location of the job results, only.
## Pub/Sub
To trigger the on-demand job, I will be using a [PubSub](https://cloud.google.com/pubsub) messaging service. This is also a managed service offered by GCP that can scale out to as much or as little traffic as needed. For initial development, we can always use [PubSub Lite](https://cloud.google.com/pubsub/docs/choosing-pubsub-or-lite) to keep initial costs down (and reduce costs of non-`production` deployments). Each time a message is published to our PubSub topic, GCP will spawn a Cloud Function to [trigger](https://cloud.google.com/functions/docs/calling/pubsub) our job.
## Long Running Job
The long computations will be performed in a [Cloud Function](https://cloud.google.com/functions), which is a serverless platform that can be used to respond to events. Nearly any language can be used to create this component with preference given to low memory usage and ease of maintenance. For particularly large digit computations, the results can be [streamed](https://cloud.google.com/storage/docs/samples/storage-stream-file-upload) into Cloud Storage as they are calculated (or in a fixed buffer size) to prevent runaway memory growth and manage costs better.
When the function starts, it will update Redis with the new state of the job. When it completes the work, it will store the results in an object in [Cloud Storage](https://cloud.google.com/storage) and update Redis with the object name. Subsequent calls to download the job results can either be downloaded by proxy or you can offer a redirect to a signed URL that can be used to download the job results once.
# Part B
Since I have chosen a lot of Google-managed services, the networking aspect of this product is less complex than a more self-hosted one. I am also going to assume that all of the deployment details will be handled with a IaC tool and that users will not be able to just go and grant whatever permission they want because they are all sharing root credentials. More than likely they are, but that's another problem for another day.
## Network
All of the described components will live inside of the same VPC. More than likely this is going to be a VPC that is shared among the entire environment (i.e. `development` or `staging`). All Google-managed services will have a "magic" endpoint that will be the network location of the service. It will be routable from anywhere in the VPC so we will use IAM roles to control access to each service.
## API Consumers
The external users of this service will be the only role, user, or serviceaccount that will have the `run.routes.invoke` permission on the Cloud Run app. This will also have the `storage.objects.get` on the Cloud Storage bucket to download results.
## Cloud Run
This portion will need to be able to `storage.objects.get` on the Cloud Storage bucket for the results. Also, there will need to be a `Serverless VPC Access connector` to allow access from the Cloud Run and Cloud Functions environments. It will also need a `roles/pubsub.publisher` role applied to it so that it can publish messages to the agreed-upon topic for job creation.
## Pub/Sub
The messaging component doesn't have any specific customizations or permissions needed for this component to function correctly.
## Cloud Functions
This will need `storage.buckets.create` on the Cloud Storage bucket to write/stream the results. It will also need `roles/pubsub.subscriber` on the job creation topic. The previously mentioned connector will apply to this product, too.
## Cloud Storage
We'll need to ensure that all of the objects are encrypted every time. Also, the access policy should be `private` to prevent malicious actors from guessing (or knowing) where the objects are in the bucket. Otherwise, there are not many settings to change for this product.
## MemoryStore
This feature doesn't have a lot of the same IAM controls as the rest of the Google suite offers. It has the built-in Redis authentication features and the ACL appears to be disabled. Because of that, "anyone" that has the network address can connect to Redis. While not ideal, it's solvable by using a Secret Manager secret to contain the user password to Redis and then loading that as an environment variable (or similar) in the Cloud Run / Cloud Function products. The information in this example is not sensitive, so leaking that data is not great but also not a soul-crushing loss to the business. For any more complicated systems, there absolutely could be sensitive parameters, so using SSL on every connection and closely guarding the password is mandatory. If you build the process to rotate the credentials when you deploy MemoryStore, then changing the password won't be a challenge. If this proves to be an untenable situation, you can always use a [Cloud Storage object](#using-more-cloud-storage) instead of MemoryStore+Pub/Sub.
# Appendix 1: Advantages of scale
This system should be quite performant, assuming the serverless components can start quickly and use a manageable amount of memory. Using managed services is more expensive than running the services yourself, but it also has the advantage of working **right now** and generally performs very reliably. You can easily think of the Cloud Run and Cloud Function layers as "limitless" so long as your budget allows. Redis has some practical limits, if nothing else, then the simple speed-of-light response time. However, for the target throughput I am confident that you can provision a reasonable cost-efficient Redis instance quickly at the project onset. Cloud Storage is another area that should not have any practical limits to storage quantity so long as the invoices keep getting paid. PubSub will very easily process `1000 r/s` so there should be no concern with that product running out of capacity.
## Individually scalable
By having each component as a separate tool that is purpose-built, we can scale each section as necessary. Another not-immediately-obvious benefit of using these managed services is the geographic spread you can achieve by using them. If there is a need to geo-locate your PI digits as close to the user as possible, you can do that with the big name GCP tools. You can also implement organizational rules to limit where your data will be spread. Specifically speaking, if you are dealing with patient data for patients in the USA, you'll need to keep that data entirely on servers hosted in the USA. The same goes for several other countries, but for _all data_ (Russia, China, etc.).
## Capacity
There will be ample capacity for `1000 r/s`.
# Appendix 2: Cost Saving Alternatives
There are several options for reducing costs, if desired. I'm not sure this section is necessary, but these all ran through my head at the same time and I think it's worth bringing up.
## Self-host redis
If necessary, you can provision a machine in Compute and run 32-64 different Redis processes on the same machine. Each one consumes about 1Mb of RAM at idle, and during development it will be trivial to provision. Each Redis library should consistently hash between all of the processes, so as long as everyone uses the same configuration environment variables, there will be no cache misses. It's incredibly simple and generally quite robust, but it comes at a cost of additional Terraform and cognitive brain power to manage.
## Self-host app
If Serverless becomes too expensive, you can always either self-host using a Compute VM or go to something like [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) and have a greater hand in the baseline performance and capacity of the system.
## Using more Cloud Storage
This could have easily have been architected to use a Cloud Storage trigger to start the job. Basically instead of using Redis and Pub/Sub, you would write the new job to Cloud Storage and that would trigger the job to run. From there, you would update the object with state changes and results as necessary.
At this time, I don't have conclusive data that this would be less expensive than my current solution, but it is nevertheless an option to pursue, particularly if there are other organizational restrictions that might hinder the initial plan.
## Using an NFS share
You could concievably use [FileStore](https://cloud.google.com/filestore/) to provide a shared persistent disk between workers and app processes. This would have the benefit of being a more consistent cost, but this cost will be for the maximum possible storage size you will need. With Cloud Storage, you can store 100Gb and only pay for 100Gb, but with FileStore, you have to provision something larger than your expected data set. If you provision a 1Tb FileStore and only use 100Gb, you'll still pay for the full 1Tb.
# Appendix 3: Happy Path
<pre class="mermaid">
sequenceDiagram
Consumer->>App: POST /job
App->>Redis: Insert initial job state + UUID
App->>PubSub: Notify worker
PubSub->>Worker: Spawn function
Worker->>Cloud Storage: Store job results
Worker->>Redis: Change state to finished + object name
Consumer->>App: GET /job/download/UUID
Consumer->>Cloud Storage: GET s3://my-bucket-name/object/path
</pre>
# Appendix 4: Job Schema
<pre class="mermaid">
classDiagram
class Job {
JobState state
uuid id
string results_object_name
}
class JobState {
STARTED
RUNNING
FINISHED
ERRORED
}
Job --> JobState
</pre>
# Appendix 5: Job States
<pre class="mermaid">
stateDiagram-v2
[*] --> Created
Created --> Queued
Queued --> Running
Running --> Finished
Running --> Errored
Finished --> [*]
Errored --> [*]
</pre>