The Cost of Serverless

In this blog post, I delve into my personal perspective on the cost implications of harnessing serverless capabilities. Crafting a project from the ground up on Google Cloud Platform allowed me to gain insights worthwhile sharing. This article focuses on DataFlow as a streaming engine, Bigtable, a scalable database, as well as Cloud Functions. Join me as I explain some of the trickier tradeoffs, exploring not only the financial cost but also whether the technology delivers on the promise of increased efficiency.

Why Serverless?

My project was a streaming service processing customer data in real-time, characterised by intense I/O operations. There was a batch mode to replay historical events. Since we were a small team building a project from scratch on GCP we decided to tap into Google’s native stack of technologies.

Benefits of Using Dataflow

For the streaming and batch processing we decided to use Dataflow, a fully managed data service with automated provisioning and resource management. Developers write batch and streaming pipelines using Apache Beam, a framework providing SDK in Java, Python and Go. We used Java.

Advantages we were hoping to achieve:

Automatic provisioning. We didn’t want to worry about managing infrastructure.
Seamless integration with other GCP services, such as Bigtable, BigQuery or PubSub.
Leveraging prebuilt connectors, sources and sinks to GCP storage, as well as external datasources.

Dataflow provides scalability, monitoring and visualised metrics out of the box.

Dataflow Downsides

Early into our experiments we encountered a few significant challenges.

Learning Curve

Apache Beam programming model is complex and requires a lot of mental power to grasp. Factor in the lack of useful documentation of this rather unintuitive framework. StackOverflow is your best friend, but even then the number of unanswered questions is quite staggering.

Another problem is that custom connectors are hard to write. Writing custom connectors is complex, time-consuming and error prone. Why would you want to do it? Well, prebuilt I/O connectors show some terrible coding anti-patterns and in some cases lack configuration options.

Sensitive To Failures

Unhandled exceptions may lead to significant performance issues in your streaming pipeline.

If processing of an element within a bundle fails, the entire bundle fails.
Apache Beam: Runtime Model

A single failed element means that all elements in the input bundle will be reprocessed. E.g. the runner throws away the entire output of the bundle and the results will be recomputed from scratch!

Recommended Setup Won’t Always Work

For example, trying to make the streaming utmost efficient, I have enabled all the good stuff – the shiny new optimised backend service, as well as the new generation of the worker runner. It resulted in terrible latencies rendering the whole thing unusable.

Significant latencies as an outcome of best-effort optimisations. Not exactly what I was hoping for.

Limited Support For Stateful Processing

Maintaining state is costly in general and doing so per processed element is super hard. Don’t get me wrong, maintaining a simple state works just fine. For example, it is easy to assign a unique numeric index to each element within the processing window. However, storing a collection or a complex object is a bad idea.

Source: Stateful Processing – Apache Beam lets you add a unique index to each element in the distributed collection.

Concurrency Control Impacts Performance

Partitioning per key and window can become a huge performance bottleneck. Imagine you need to update a user profile based on information scattered across different worker nodes. Grouping related elements by user id (GroupByKey) within a window of streamed events brings the user’s data to a single worker node, where they can be deduplicated and processed.

This comes at a cost of an additional overhead. Namely:

Increased network traffic due to data shuffling, since the runner needs to bring related elements onto the same worker node.
Serialization overhead. Complex objects might take time to (de-)serialize.
Synchronisation overhead. Apache Beam shields the developer from concurrency issues. The drawback is that the runner’s built-in concurrency control comes at the expense of decreased parallelism. Your pipeline will not scale as much as it normally would.
Expensive per-element operations. Any database lookups or other I/O is not a good idea. It is possible to do it, but it is not ideal. Also, do your best to avoid expensive creation of new objects, searching the classpath and other resource intensive operations.

Hot Keys and Large Windows

There are two fundamental risks associated with GroupByKey operations. Hot keys and large windows. See this blog post by Google for details.

Hot Keys

In some situations certain nodes of the cluster are overwhelmed, while the rest is idle. This happens when most of the streamed elements (like large millions) share a fistful of same keys (like four or five). Imagine clickstream events grouped by popular browsers. If most of the events come from five popular browsers, all processing will be done on five worker nodes. This presents a performance bottleneck.

Large Windows

Dataflow buffers elements on the receiver side waiting for each window to close. If your window spans a long period of time (24 hours or longer), the buffer of unprocessed events will grow large and workers can run out of memory.

Takeaways

Choose your keys wisely. User id is probably ok, but grouping by gender would be a disaster.
Decide on window size with caution. Always think how complex your processing is and prefer windows that won’t accumulate a huge backlog of staged elements.

Bigtable as a Scalable Database

Bigtable is a fully managed NoSQL database provided by Google Cloud Platform. This scalable, high-performance and low-latency database is designed to handle massive amounts of structured and semi-structured data.

Bigtable is extremely fast under high pressure with read operations taking less than a hundred milliseconds and writes lasting a couple of hundreds of milliseconds under an intense load.

Capacity Planning

Consider the storage type. Workloads that are read-heavy benefit from switching to SSD. It makes read operations up to ten times faster. Changing to SSD in production made a huge difference in our project.

Terraform definition of a Bigtable cluster with storage type as SSD.

Another factor to consider is CPU load. For latency-sensitive workloads your cluster should use less than 50 % of CPU capacity.

Setting a CPU target helps manage load on your Bigtable cluster.

Last but not least it comes to storage utilization. For latency-sensitive workloads stored data should use at most 60 % of available storage per cluster node. Add more nodes as your dataset grows bigger.

Risk of Degraded Performance

Your choices boil down to favouring low latency over high throughput, or vice versa. Latency is the amount of time it takes to complete an operation, e.g. processing delay measured in time units (milliseconds etc.) Throughput is the rate at which data is processed and transferred, where the transfer rate is measured in bits per second (bps, Mbps, Gbps).

Source: How throughput and latency affect speed

Table scans and sequential reads lead to increased latency. Consider the example below.

Query 1: Throughput optimized. Good for batch jobs, but it results in higher latency.

Query 2: Latency optimized. Good for streaming, but it increases the amount of read operations leading to a lower throughput.

Hot Keys Visualization

Bigtable lets you inspect the key distribution as a heat map. This intuitive tool allows you to quickly discover hot keys leading to sequential reads and degraded performance.

For instance, the white stripes in the example below signify row keys that are accessed with an extreme frequency.

Hot keys stand out in Google’s key visualizer. The paler the colour the hotter the key.

Cloud Functions and a Memory Leak

I have encountered a situation where a bug in code lead to out of memory exceptions, which in turn caused our lambdas to fail. Cloud Run kept spinning up new instances that were almost immediately discarded. This lead to a quick and extensive degradation of performance.

It was in fact impressive to see the quick turnaround! However, it had no chance of recovery due to the programmer’s mistake resulting in OOM.

Cloud Run did its best to spin up new instances until the point of exhaustion.

What Is the Cost of Serverless?

In my experience using serverless technology comes at a cost and you should think twice before tapping into the promise of increased efficiency.

Decreased Productivity

Developer’s productivity suffers. All of the below can take too much from your productive time!

First of all, it was a surprisingly steep learning curve. Learning Apache Beam is challenging and takes time. I found myself having to work around limitations and that lead to even greater time waste.

Next, whatever runs in the cloud is hard to emulate locally an deployments of Dataflow jobs take forever! The feedback loop is too slow.

Slow deployments: Time to deploy + ramp-up time + time until metrics start to show

Another point is that build and deployment automation is complex and limited. It is hard to verify serverless architecture with a standard CI/CD system. Meaning more bugs make it to production!

Writing Ugly Code

As a developer, you are forced to rethink some fundamental patterns.

Dependency Injection is super hard! For example, it is extremely challenging to create a single instance of a database client per worker node. Lambdas, or workers in case of Dataflow, come and go. That is beyond your control. Each lambda (worker) is isolated, it does not share state with others. There is a wishy-washy support in Apache Beam for DI, but it is not fully reliable.

End result? You will write a ton of ugly code that is hard to test!

Cost of the Infrastructure

Zero maintenance is not for free. Make sure to understand the cost model. In our case the main cost factors were:

Storage: Bigtable
Data processing: Dataflow
Distributed cache: Redis

Bigtable

Server node per hour: $0.65 per hour per node
Daily cost of the cluster: 10-11 nodes x 24 hours x 0.65 = cca $159

Dataflow

See pricing for the current rates. The rates we dealt with were as follows.

CPU: $0.072 per vCPU per hour
Memory: $0.004172 per GB per hour
Data shuffle: $0.018 per GB

Redis

Fixed cost per instance type (M3): $0.026 per GB per hour

Hidden Cost of Serverless?

Watch out for expenses that are not too obvious. They might surprise you if you are not careful.

Logging

Free allotment: 50 GiB per project per month
$0.50/GiB for logs ingested (includes 30 days of free storage)

Network

Moving data within Google cloud comes at a cost. This includes shuffling data between virtual machines, to and from Bigtable, from PubSub to BigQuery etc.

Summary

In this post, I have expanded on the complexities and trade-offs associated with utilizing serverless technologies within Google Cloud Platform. Namely the benefits and downsides of using Dataflow for streaming and batch processing. While Dataflow offers scalability and monitoring, the tradeoff is a steep learning curve, lack of comprehensive documentation and sensitivity to failures.

I have also discussed the use of Bigtable as a scalable database and the importance of capacity planning and storage type selection for optimal performance.

Bear in mind that choosing the discussed technologies has impact on developer productivity due to unique constraints of serverless environment.

Lastly, I have provided insights into the financial aspects, detailing costs associated with services like Bigtable, Dataflow, and Redis, as well as hidden expenses such as logging and network charges.

Overall, I hope the article helps you navigate through the complexities and costs that organizations should consider when adopting Google cloud technologies.

Thanks for reading and all the best with choosing technology that makes you happy.

Published by Tomas Zezula on August 27, 2023August 27, 2023

Table of Contents

Why Serverless?

Benefits of Using Dataflow