Part 2: Designing a Data Collection Service for Scale
In Part 1 of this series I discussed the basic structure of a framework that could rotate content on a site, which at that point was mainly in the interest of creating software to run AB/MVT testing scenarios. This post will cover the collection of not only the results of those tests, but virtually any external data that can be gleaned from a user’s session– going far beyond the testing use-case and into a world where data can be a means to any ends.
There are a number of priorities to juggle when considering the daunting task of handling anything that the world wide web could throw at your collection services. The first and foremost is speed and scale, of course a truly dedicated DDoS attack could cripple even Oracle’s infrastructure but there are tricks and traps that can prevent all but the most heinous of malefactors. In the end however, we must acknowledge that no matter how lightning-quick your collection might be– your pipeline is only as fast as its slowest processing step.
Speed
We touched on the paramount importance of this aspect in Part 1, especially in the realm of site content rotation it is key to be able to process calls to/from your service in a couple milliseconds. In the general case of data collection as we’re now discussing it is no less crucial to make this the primary focus of your efforts. Depending who your clients will be, it can be normal to expect to handle tens of thousand of requests per second– and that figure will only exponentially increase as more of our devices generate data about us. The corollary of that number is that you’ll need to ensure those tens of thousands of requests are through your pipeline by the time the next second comes, or you’re already behind!
There are a few considerations that will effect how fast your data can go from the user’s webpage to inside of your pipeline, and much of it will be particular to your environment and its limitations. However, here are some ideas for optimization that should be considered in every case:
- Ensure your JavaScript collection tag sends tracks as an async operation, to prevent flickering or slowed loading on the host site
- On the backend, choose an HTTP service library with the highest throughput benchmarks
- Even at the cost of maintainability, ease of adoption, and robustness– within reason of course
- Include timing metrics from the start, using a library like Dropwizard you should be able to surround each bit of code with a timer function
- As a solid strategy, put a timer on each method in your intake logic and make a point to time sub-methods even if their parent is already timed
- A collection server should be lightweight enough that an event should make it from HTTP to pipeline in only a handful of methods so this shouldn’t be too arduous (and a good metrics library will introduce very little of its own latency)
- This will be your primary tool for tracking down bottlenecks and there WILL be bottlenecks
- If there is ANY amount of side processing that needs to be done (e.g. checking if the tracking event is also asking for us to send back additional content), ensure that it is spun off to its own thread
- The tracking logic should be a Steel Thread that is only concerned with getting from A to B with hopefully little more than a bit of serialization along the way
The Three Steps of Bottleneck Busting
- Unit Performance Tests: You’ll be iterating MANY times on your collection logic so unit tests, being the fastest to build and run vs. integration tests, are the clear choice to start out. This can be as simple as sending a number of hits (start low, around 10000 and work up– if things are VERY slow it could take too long to get useable results otherwise) and then printing out timings from your metrics
- This step will be focused on the actual collection logic itself, as we are exclusively in-code the whole test and get to ignore confounding environmental factors. An approach that has worked well is to find the smallest method that is taking the longest duration. It’s pretty typical that there is one operation that has a much larger execution time than anything else. Once this method has been sped up, it’s likely another one will then take the lead as #1 Bottleneck
- There are a number of tricks to speed up code, off the top of my head; choosing not to (de)serialize event data (or just find the one thing you need without checking it all) and instead pass it on as a raw string, rework loops to terminate early, use more time-efficient data structures, avoid all locks/syncs/thread-safety no matter what (which means global variables that vary are mostly out), multi-thread wherever possible when doing many short operations or conversely use batching when doing a few long batch-appropriate processes
- Once you’ve made it to the place where the longest running methods have been sliced down as much as possible, and all others don’t have a significant impact, then you’re ready to move on to the next optimizations– strong in the knowledge that at least your main logic is performant
- Local Integration Tests: After we know the code is in a good place, it’s time to consider the entry and exit costs of your collection service. As we mostly optimized the code itself in Step (1), here we should be looking for things we can tweak in our configuration to get more out of the technologies we’re utilizing. Now, you’ll want to get your application in a place where it can be run locally either directly on your machine or on a local container. This will pay off high dividends over going out to a pod environment as there will be significant iteration in this step as well. Once that is running (with its HTTP endpoints, and a local streaming service like Kafka to receive the events on the way out), you’ll want to create an integration framework that you can run locally to pummel your service. Here are some things to consider when creating and running these performance tests
- Be sure to support multiple varieties of input. In earlier stages it might have made sense to send the same event over and over– but now we’re trying to find more exotic bottlenecks. Send nothing but huge payloads, send empty payloads or ones that are supposed to be discarded, send events that are spread out across multiple client accounts, in addition to the classic duplicated payload. It’s very likely that different inputs take longer and will provide you hints that could help all cases
- Look at your pools first and try scaling them up and down to get the sweet spot (too many threads can drop off in dividends due to the overhead of their management). Commonly you’d see; the pool of connection handlers your HTTP server uses (and features like keep-alives can speed up your code a lot, but keep in mind it will skew performance results since your one connection generating the data will be kept-alive and therefore much faster), the thread pool used by your code infrastructure (e.g. Akka has its own thread pools for handling messages across your application), and the thread pool used by your output streaming service (e.g. Kafka has tons of options to tweak for both producers and consumers)
- In this step and the next we can begin to evaluate at what rate of ingestion we start to get behind. The sad truth is that no server (or array of servers) you eventually run your collector on is able to handle infinite throughput, and there will be a limit as to how much data you can process before you begin to get behind. The key here is to set a goal of what you want to achieve (e.g. 15,000 events/sec on a single local server which would be quite impressive) and make sure we’re handling things gracefully when we breach that limit. It is common to find that certain types of input, or doing things like sending input across multiple accounts or from multiple send threads, will have a large impact on what throughput is possible
- Testing this is pretty straightforward, simply slam the server with as many hits as you can (each with a timestamp), and see if the last event comes through collection at about the same second that your performance testing framework finishes– if it doesn’t then you are behind and can lower your throughput until you find what is achievable– Ops will appreciate this data point so they know how to scale
- Pod/Pre-Production “Real-World” Testing: At this point we should be reasonably assured that we’ve hammered our business logic into the sleekest Steel Thread possible, and we tweaked our pools to get everything we can out of them. Now we’ll bring out little collector out into a production-like pod to put it through its main paces. A word of warning, keep in mind that many virtual servers out there are quite small with 1-2 VCPUs and even when hit locally will likely have much worse processing speed than you’d seen in local testing
- Throughout this portion, it will be highly important to think about the WAN (i.e. literally WHERE [as in, on Earth] does the collection server live, WHERE is your performance test hitting it from, and WHERE is the streaming service [e.g. Kafka] to receive it). It’s a good policy to run all your tests from the same cluster of servers and be sure not to run the testing code on the servers that are hosting your collector. It would be prescient as well to check the time of a simple ping from each testing server to their target collection servers to see the raw delay of communication between these nodes (that amount can be subtracted from your timings if desired)
- In a real world application, this is one of the reasons that of all your servers it is most important to ensure your collection servers are distributed around the world– if your code is truly fast then the delay for a Germany-based event to reach a USA-based collection server will be MUCH longer than the processing time of the event itself.
- If testing with different sender/receiver nodes in different geographic locations reveals performance issues, consider limiting the number of sends. An example fix would be to batch up the events that are sent from the collection servers to your streaming pipeline
- At this point we will start looking at durability of performance as well. This means trying to detect memory leaks and other drop-offs over an extended time. Many memory leaks (and connection leaks that show up as “file handle limits”) won’t show up immediately when when pummeling the server. To catch these, write a slow-burn script (doesn’t have to be complex) that will send a consistent trickle of data– then let it run overnight. For extra credit, attach a profiler like JMX that shows memory usage over time. With those tools you’ll be able to detect slowly creeping memory of a certain type (or you’ll get an Out of Memory Error or “too many open files” which each also show a leak). A profiler will also give you the additional hint hopefully of what kind of objects are being kept around
- Lastly, you’ll need to determine what kind of servers and how many are needed to meet your throughput goals. There have been occasions where a single node with 4 cores (Less and Stronger) is able to out-perform two nodes with 2 cores (More and Weaker) each, and vice versa. Another useful data point in this portion of the test will be how well your Less/More servers are able to handle hits coming from a single vs. multiple testing servers– tend towards the one that handles multiple better as that is more realistic. Of course many problems in CS can be solved with additional servers but you want to know just how many will be needed
- Throughout this portion, it will be highly important to think about the WAN (i.e. literally WHERE [as in, on Earth] does the collection server live, WHERE is your performance test hitting it from, and WHERE is the streaming service [e.g. Kafka] to receive it). It’s a good policy to run all your tests from the same cluster of servers and be sure not to run the testing code on the servers that are hosting your collector. It would be prescient as well to check the time of a simple ping from each testing server to their target collection servers to see the raw delay of communication between these nodes (that amount can be subtracted from your timings if desired)
As you can see there is a LOT to consider when building for speed, and undoubtedly that should be the main concern on the mind of anyone hoping to ingest whatever the net can throw. However, if you go through this iterative process of smoothing over your bottlenecks– and invest in the infrastructure to distribute collection servers in all the global markets feeding them– then you’ll be eating data with the best of them!