add testdouble homework

main
Andrew Coleman 2023-02-03 16:49:12 -05:00
parent 300c469fc1
commit 53ddad7fa4
1 changed files with 190 additions and 0 deletions

View File

@ -0,0 +1,190 @@
+++
title = "TestDouble Infra Homework"
description = "A treatise of a distributed job system"
date = "2023-02-03"
+++
# Introduction
This is an ~~imaginary~~ description of a complex software project that is designed to take a request from an end user and perform a long-running calculation. This calculation cannot be cached (as that would nullify the outcome of this project) and cannot be performed in a typical amount of time that a browser typically uses for the request-response cycle.
[Problem description](https://gitlab.com/testdouble-infrastructure/api-infrastructure-take-home) as presented by TestDouble.
# Choice
For this experiment, I will be using [Google Cloud Platform](https://cloud.google.com/) to host my solution. As such, products and services chosen represent what is available from GCP. Similar technologies exist for both AWS and Azure and a similar pattern could be applied to this solution.
# Complicated Outline
<pre class="mermaid">
C4Context
title TestDouble Distributed Pi Calculator
Person(consumer, "API Consumer")
Enterprise_Boundary(b0, "Public API") {
System(app, "Cloud Run", "A serverless app that responds to the public facing API.")
Boundary(b1, "Data Storage") {
SystemDb(redis, "MemoryStore", "A managed Redis instance.")
SystemDb(objects, "Cloud Storage", "Managed object storage for job results.")
}
Boundary(b2, "Backend Workers") {
SystemQueue(pubsub, "Pub/Sub", "Sends events notifications to a worker.")
System(worker, "Cloud Function", "A serverless function to compute PI.")
}
}
Rel(consumer, app, "HTTP")
Rel(app, objects, "Get Job Results")
BiRel(app, redis, "Read Job Status")
Rel(app, pubsub, "Queue Job")
Rel(pubsub, worker, "Start Job")
Rel(worker, objects, "Job Finished")
Rel(worker, redis, "Write Job Status")
</pre>
I'm not convinced that [C4Context](https://mermaid.js.org/syntax/c4c.html) is best for this project...
# Basic Outline
<pre class="mermaid">
flowchart TD
consumers[API consumers] --> app[Cloud Run]
app --> redis[Redis MemoryStore]
app --> pubsub[Pub/Sub]
pubsub --> worker[Cloud Function]
worker --> object[Cloud Storage]
worker --> redis
app --> object
</pre>
This is kind of a lot to grok at a glance, but let's take a look at each component and dive into the challenges and features of each decision.
## API Consumers
This represents the already-existing application that will be consuming our _pi_ digits. This is pretty hand-wavey, but also not relevant to this exercise.
## The Primary Application
This will be a [Cloud Run](https://cloud.google.com/run) app that will be run each time the API is utilized. Nearly any programming language can be used to create this app, but priority should be given to runtimes that can start quickly and use very little memory. Go or Rust would be a great fit, while a full Ruby on Rails would not. Plain Ruby would work fine, though!
Cloud Run is designed to scale out to as much traffic as is required, and the target throughput of `1000 r/s` will be easily met with a serverless solution.
The process running here will be responsible for the following client-side HTTP API.
|Method|Path|Purpose|
|------|----|-------|
|`POST`|`/job`|Creates a new job|
|`GET`|`/job/<job_id>`|Gets the current status of the job|
|`GET`|`/job/download/<job_id>`|Downloads the results from the job|
## Job Status
I will be storing the job status inside of [MemoryStore](https://cloud.google.com/memorystore), which is a managed [Redis](https://redis.io) product from Google. The simple choice here is that _if_ or _when_ the volume of traffic changes, we can quickly upgrade to a better-performing tier with little to no effort on our part.
### Why Redis?
Redis is an optimal solution for this problem because of the simplicity of the data and the high-throughput nature of the service. There are no complex transactions or modifications to the data, so utilizing a system that has transactions is likely more iron than we need. Redis can typically repond in less than 2ms for simple data retrieval of the scale that this solution will be using. This system will be using Redis for job status and the location of the job results, only.
## Pub/Sub
To trigger the on-demand job, I will be using a [PubSub](https://cloud.google.com/pubsub) messaging service. This is also a managed service offered by GCP that can scale out to as much or as little traffic as needed. For initial development, we can always use [PubSub Lite](https://cloud.google.com/pubsub/docs/choosing-pubsub-or-lite) to keep initial costs down (and reduce costs of non-`production` deployments). Each time a message is published to our PubSub topic, GCP will spawn a Cloud Function to [trigger](https://cloud.google.com/functions/docs/calling/pubsub) our job.
## Long Running Job
The long computations will be performed in a [Cloud Function](https://cloud.google.com/functions), which is a serverless platform that can be used to respond to events. Nearly any language can be used to create this component with preference given to low memory usage and ease of maintenance. For particularly large digit computations, the results can be [streamed](https://cloud.google.com/storage/docs/samples/storage-stream-file-upload) into Cloud Storage as they are calculated (or in a fixed buffer size) to prevent runaway memory growth and manage costs better.
When the function starts, it will update Redis with the new state of the job. When it completes the work, it will store the results in an object in [Cloud Storage](https://cloud.google.com/storage) and update Redis with the object name. Subsequent calls to download the job results can either be downloaded by proxy or you can offer a redirect to a signed URL that can be used to download the job results once.
# Advantages of scale
This system should be quite performant, assuming the serverless components can start quickly and use a manageable amount of memory. Using managed services is more expensive than running the services yourself, but it also has the advantage of working **right now** and generally performs very reliably. You can easily think of the Cloud Run and Cloud Function layers as "limitless" so long as your budget allows. Redis has some practical limits, if nothing else, then the simple speed-of-light response time. However, for the target throughput I am confident that you can provision a reasonable cost-efficient Redis instance quickly at the project onset. Cloud Storage is another area that should not have any practical limits to storage quantity so long as the invoices keep getting paid. PubSub will very easily process `1000 r/s` so there should be no concern with that product running out of capacity.
## Individually scalable
By having each component as a separate tool that is purpose-built, we can scale each section as necessary. Another not-immediately-obvious benefit of using these managed services is the geographic spread you can achieve by using them. If there is a need to geo-locate your PI digits as close to the user as possible, you can do that with the big name GCP tools. You can also implement organizational rules to limit where your data will be spread. Specifically speaking, if you are dealing with patient data for patients in the USA, you'll need to keep that data entirely on servers hosted in the USA. The same goes for several other countries, but for _all data_ (Russia, China, etc.).
## Capacity
There will be ample capacity for `1000 r/s`.
# Cost Saving Alternatives
There are several options for reducing costs, if desired. I'm not sure this section is necessary, but these all ran through my head at the same time and I think it's worth bringing up.
## Self-host redis
If necessary, you can provision a machine in Compute and run 32-64 different Redis processes on the same machine. Each one consumes about 1Mb of RAM at idle, and during development it will be trivial to provision. Each Redis library should consistently hash between all of the processes, so as long as everyone uses the same configuration environment variables, there will be no cache misses. It's incredibly simple and generally quite robust, but it comes at a cost of additional Terraform and cognitive brain power to manage.
## Self-host app
If Serverless becomes too expensive, you can always either self-host using a Compute VM or go to something like [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) and have a greater hand in the baseline performance and capacity of the system.
## Using more Cloud Storage
This could have easily have been architected to use a Cloud Storage trigger to start the job. Basically instead of using Redis and Pub/Sub, you would write the new job to Cloud Storage and that would trigger the job to run. From there, you would update the object with state changes and results as necessary.
At this time, I don't have conclusive data that this would be less expensive than my current solution, but it is nevertheless an option to pursue, particularly if there are other organizational restrictions that might hinder the initial plan.
## Using an NFS share
You could concievably use [FileStore](https://cloud.google.com/filestore/) to provide a shared persistent disk between workers and app processes. This would have the benefit of being a more consistent cost, but this cost will be for the maximum possible storage size you will need. With Cloud Storage, you can store 100Gb and only pay for 100Gb, but with FileStore, you have to provision something larger than your expected data set. If you provision a 1Tb FileStore and only use 100Gb, you'll still pay for the full 1Tb.
# Happy Path
<pre class="mermaid">
sequenceDiagram
Consumer->>App: POST /job
App->>Redis: Insert initial job state + UUID
App->>PubSub: Notify worker
PubSub->>Worker: Spawn function
Worker->>Cloud Storage: Store job results
Worker->>Redis: Change state to finished + object name
Consumer->>App: GET /job/download/UUID
Consumer->>Cloud Storage: GET s3://my-bucket-name/object/path
</pre>
# Job Schema
<pre class="mermaid">
classDiagram
class Job {
JobState state
uuid id
string results_object_name
}
class JobState {
STARTED
RUNNING
FINISHED
ERRORED
}
Job --> JobState
</pre>
# Job States
<pre class="mermaid">
stateDiagram-v2
[*] --> Created
Created --> Queued
Queued --> Running
Running --> Finished
Running --> Errored
Finished --> [*]
Errored --> [*]
</pre>