add part b

main
Andrew Coleman 2023-02-05 13:43:59 -05:00
parent 53ddad7fa4
commit 82f62890f7
1 changed files with 39 additions and 7 deletions

View File

@ -10,7 +10,7 @@ This is an ~~imaginary~~ description of a complex software project that is desig
[Problem description](https://gitlab.com/testdouble-infrastructure/api-infrastructure-take-home) as presented by TestDouble. [Problem description](https://gitlab.com/testdouble-infrastructure/api-infrastructure-take-home) as presented by TestDouble.
# Choice # Choices
For this experiment, I will be using [Google Cloud Platform](https://cloud.google.com/) to host my solution. As such, products and services chosen represent what is available from GCP. Similar technologies exist for both AWS and Azure and a similar pattern could be applied to this solution. For this experiment, I will be using [Google Cloud Platform](https://cloud.google.com/) to host my solution. As such, products and services chosen represent what is available from GCP. Similar technologies exist for both AWS and Azure and a similar pattern could be applied to this solution.
@ -48,7 +48,7 @@ C4Context
I'm not convinced that [C4Context](https://mermaid.js.org/syntax/c4c.html) is best for this project... I'm not convinced that [C4Context](https://mermaid.js.org/syntax/c4c.html) is best for this project...
# Basic Outline # Part A
<pre class="mermaid"> <pre class="mermaid">
@ -101,7 +101,39 @@ The long computations will be performed in a [Cloud Function](https://cloud.goog
When the function starts, it will update Redis with the new state of the job. When it completes the work, it will store the results in an object in [Cloud Storage](https://cloud.google.com/storage) and update Redis with the object name. Subsequent calls to download the job results can either be downloaded by proxy or you can offer a redirect to a signed URL that can be used to download the job results once. When the function starts, it will update Redis with the new state of the job. When it completes the work, it will store the results in an object in [Cloud Storage](https://cloud.google.com/storage) and update Redis with the object name. Subsequent calls to download the job results can either be downloaded by proxy or you can offer a redirect to a signed URL that can be used to download the job results once.
# Advantages of scale # Part B
Since I have chosen a lot of Google-managed services, the networking aspect of this product is less complex than a more self-hosted one. I am also going to assume that all of the deployment details will be handled with a IaC tool and that users will not be able to just go and grant whatever permission they want because they are all sharing root credentials. More than likely they are, but that's another problem for another day.
## Network
All of the described components will live inside of the same VPC. More than likely this is going to be a VPC that is shared among the entire environment (i.e. `development` or `staging`). All Google-managed services will have a "magic" endpoint that will be the network location of the service. It will be routable from anywhere in the VPC so we will use IAM roles to control access to each service.
## API Consumers
The external users of this service will be the only role, user, or serviceaccount that will have the `run.routes.invoke` permission on the Cloud Run app. This will also have the `storage.objects.get` on the Cloud Storage bucket to download results.
## Cloud Run
This portion will need to be able to `storage.objects.get` on the Cloud Storage bucket for the results. Also, there will need to be a `Serverless VPC Access connector` to allow access from the Cloud Run and Cloud Functions environments. It will also need a `roles/pubsub.publisher` role applied to it so that it can publish messages to the agreed-upon topic for job creation.
## Pub/Sub
The messaging component doesn't have any specific customizations or permissions needed for this component to function correctly.
## Cloud Functions
This will need `storage.buckets.create` on the Cloud Storage bucket to write/stream the results. It will also need `roles/pubsub.subscriber` on the job creation topic. The previously mentioned connector will apply to this product, too.
## Cloud Storage
We'll need to ensure that all of the objects are encrypted every time. Also, the access policy should be `private` to prevent malicious actors from guessing (or knowing) where the objects are in the bucket. Otherwise, there are not many settings to change for this product.
## MemoryStore
This feature doesn't have a lot of the same IAM controls as the rest of the Google suite offers. It has the built-in Redis authentication features and the ACL appears to be disabled. Because of that, "anyone" that has the network address can connect to Redis. While not ideal, it's solvable by using a Secret Manager secret to contain the user password to Redis and then loading that as an environment variable (or similar) in the Cloud Run / Cloud Function products. The information in this example is not sensitive, so leaking that data is not great but also not a soul-crushing loss to the business. For any more complicated systems, there absolutely could be sensitive parameters, so using SSL on every connection and closely guarding the password is mandatory. If you build the process to rotate the credentials when you deploy MemoryStore, then changing the password won't be a challenge. If this proves to be an untenable situation, you can always use a [Cloud Storage object](#using-more-cloud-storage) instead of MemoryStore+Pub/Sub.
# Appendix 1: Advantages of scale
This system should be quite performant, assuming the serverless components can start quickly and use a manageable amount of memory. Using managed services is more expensive than running the services yourself, but it also has the advantage of working **right now** and generally performs very reliably. You can easily think of the Cloud Run and Cloud Function layers as "limitless" so long as your budget allows. Redis has some practical limits, if nothing else, then the simple speed-of-light response time. However, for the target throughput I am confident that you can provision a reasonable cost-efficient Redis instance quickly at the project onset. Cloud Storage is another area that should not have any practical limits to storage quantity so long as the invoices keep getting paid. PubSub will very easily process `1000 r/s` so there should be no concern with that product running out of capacity. This system should be quite performant, assuming the serverless components can start quickly and use a manageable amount of memory. Using managed services is more expensive than running the services yourself, but it also has the advantage of working **right now** and generally performs very reliably. You can easily think of the Cloud Run and Cloud Function layers as "limitless" so long as your budget allows. Redis has some practical limits, if nothing else, then the simple speed-of-light response time. However, for the target throughput I am confident that you can provision a reasonable cost-efficient Redis instance quickly at the project onset. Cloud Storage is another area that should not have any practical limits to storage quantity so long as the invoices keep getting paid. PubSub will very easily process `1000 r/s` so there should be no concern with that product running out of capacity.
@ -113,7 +145,7 @@ By having each component as a separate tool that is purpose-built, we can scale
There will be ample capacity for `1000 r/s`. There will be ample capacity for `1000 r/s`.
# Cost Saving Alternatives # Appendix 2: Cost Saving Alternatives
There are several options for reducing costs, if desired. I'm not sure this section is necessary, but these all ran through my head at the same time and I think it's worth bringing up. There are several options for reducing costs, if desired. I'm not sure this section is necessary, but these all ran through my head at the same time and I think it's worth bringing up.
@ -135,7 +167,7 @@ At this time, I don't have conclusive data that this would be less expensive tha
You could concievably use [FileStore](https://cloud.google.com/filestore/) to provide a shared persistent disk between workers and app processes. This would have the benefit of being a more consistent cost, but this cost will be for the maximum possible storage size you will need. With Cloud Storage, you can store 100Gb and only pay for 100Gb, but with FileStore, you have to provision something larger than your expected data set. If you provision a 1Tb FileStore and only use 100Gb, you'll still pay for the full 1Tb. You could concievably use [FileStore](https://cloud.google.com/filestore/) to provide a shared persistent disk between workers and app processes. This would have the benefit of being a more consistent cost, but this cost will be for the maximum possible storage size you will need. With Cloud Storage, you can store 100Gb and only pay for 100Gb, but with FileStore, you have to provision something larger than your expected data set. If you provision a 1Tb FileStore and only use 100Gb, you'll still pay for the full 1Tb.
# Happy Path # Appendix 3: Happy Path
<pre class="mermaid"> <pre class="mermaid">
@ -151,7 +183,7 @@ sequenceDiagram
</pre> </pre>
# Job Schema # Appendix 4: Job Schema
<pre class="mermaid"> <pre class="mermaid">
@ -173,7 +205,7 @@ classDiagram
</pre> </pre>
# Job States # Appendix 5: Job States
<pre class="mermaid"> <pre class="mermaid">