At GameChanger (GC), we recently decided to use Kafka as a core piece in our new data pipeline project. We chose Kafka for its consistency and availability, ability to provide ordered messages logs, and its impressive throughput. This is the first in a series two blog posts I will be doing that describes how we found the best Kafka configuration for GC.
In this post I will discuss how I learned as much about Kafka as fast as possible using a set of targeted experiments based off of the scientific method. To learn, I did a large variety of experiments on throughput, box scaling, and what specific configuration values meant. I will discuss three experiments in particular.
- What happened when I killed a Kafka box while writing messages to it?
- Experiments around configuration of Kafka’s topics
- Experiments around Kafka’s read and write throughput
I will go over the setup for the experiments, why I ran them, the results of the experiments, and how the results are affecting / guiding GC moving forward.
At GC we have put a lot of systems and technologies into production. We have found there are often new failure scenarios that present themselves once a system is in production under heavy load. I wanted to expose configuration and failure scenarios for Kafka before they could adversely affect our production environment and our customers. The question I was then faced with: How can I learn as much as possible as fast as possible?
I decided to use the scientific method. The basic structure of the scientific method is:
- Form a hypothesis about an observation
- Figure out and perform an experiment to test the hypothesis
- Analyze the data and draw conclusions
This method is used in the hard sciences by thousands of people each day to learn new things, but how is it useful when learning a new technology?
When learning a new technology, the standard avenue for learning is reading the docs. But language is dynamic and, as I have to explain to my father sometimes, words can mean many things. What I interpret from a written description of a configuration variable might not be what the author meant for me to get out of it. The other most common ways to learn about a system are help forums, like Stack Overflow, and blog posts. Both of these sources give information on how a system or technology worked for someone else in specific scenarios. But just because it worked one way for them, that does not mean it will work that way for you. The only way to be 100% sure about how things work is to do experiments and test your information and assumptions. That is why the scientific method is the perfect tool here.
The scientific method also has some other nice consequences. First, by confirming or disproving information, the scientific method allows people to begin to build a firm foundation of knowledge to move forward with. Second, even if the documentation is written perfectly, experiments still have great value. They serve to reinforce documentation with specific examples. These examples can then more easily be drawn on later when faced with an unexpected problem in production.
So I have given ample reasoning on why I chose to use experiments to learn about Kafka, but I also want to dive into some of the actual experiments I ran. I ran a ton of experiments; everything from attempting to delete a topic when delete is enabled, to changing whether or not topic rebalancing will happen. Each new unknown was approached with the scientific method in mind. Although I ran many experiments, I want to discuss three experiments specifically. (For reference I will be mentioning terms and language used on this page)
Experiment 1: What happened when I killed a Kafka box while writing messages to it?
Why is it important?
The first and most basic experiment I want to talk about is testing what happened if a box was killed while writing messages to it. There are a few reasons I chose this experiment. First, the experiment is very simple and tests a fundamental part of the system. This makes it easier to apply the information to future scenarios. Second, boxes will fail so we must know how the system will behave when that happens.
Experiment setup
- 3 Kafka Boxes
- 1 Kafka Topic
- Kafka Topic had 1 partition with replication factor 3
- This means 1 box was the leader or owner of data written
- Kafka Topic had 1 partition with replication factor 3
- Box A wrote to the Kafka cluster
- Box B read from the Kafka cluster
- As Box A was writing to the Kafka cluster, I killed the box who owned the topics only partition between rights.
Hypothesis
I expected that Kafka would handle this killing gracefully and no data would be lost. That means that data that was accepted as written by Box A, would successfully be read by Box B.
Results
The results were a little surprising, but generally confirmed the hypothesis. Initially Kafka did not accept any writes to the cluster. It returned an error to Box A: NotLeaderForPartitionException
. Quickly thereafter, Kafka started accepting writes again. Throughout the process, B read all acknowledged writes by A and did not read any unacknowledged writes.
How we use this knowledge moving forward
After this experiment, I had more confidence in Kafka, had learned a new failure case, and had gained a better foundation to approach more complicated experiments.
Experiment 2: Testing Kafka’s Topic characteristics
Why is it important?
When reading over Kafka’s documentation, I stumbled over the config variables min.insync.replicas
and request.required.acks
. These variables enable specification of the number of boxes that must receive and acknowledge a write before that write to the cluster succeeds. I was hopeful that this would provide a lot of power and flexibility. If there were topics that needed more security/consistency I could require higher numbers of acknowledgments. And if there were topics where I wanted to favor availability over consistency, I could configure them that way. The problem was that I did not have a great feel for how this would work, so I set up some experiments.
Experiment setup
- 3 Kafka Boxes
- 1 Kafka Topic
- Kafka Topic had 1 partition with replication factor 3
- Box A wrote to the Kafka cluster
- Box B read from the Kafka cluster
- I did a series of tests where I tried different write acknowledgment values for the partition
- As Box A was writing to the Kafka cluster, I killed the box
Hypotheses and Results
A scaled down version of the tests I performed and their outcomes can be seen in the table below.
Box Acknowledgments | Hypothesis | Results |
1 | Writes should continue to work | Writes continued to work |
2 | Writes should continue to work | Writes continued to work |
3 | Writes should continue to work | Writes did not work. When attempting to write, Box A received an error message: [KafkaEvent,0] failed due to Number of insync replicas for partition [KafkaEvent,0] is [3], below required minimum [2] |
How we use this knowledge moving forward
Using the results of this experiment I learned that both too many and too few required acknowledgments made the system more brittle (a single box dying could cause data or availability loss). This led us to configure all of our current topics to require the majority of boxes to acknowledge writes before accepting (acknowledgments = (number of boxes / 2) + 1
or written another way N = 2F + 1
).
Experiment 3: Determining our Kafka cluster’s read and write throughput
Why is it important?
It has been reported in other blog posts that Kafka is able to handle 50 MB’s of data per second under heavy load. I wanted to see if I could reproduce these results on our Kafka setup. I wanted to determine whether our boxes had optimal network, disk, and memory configurations.
Experiment setup
- 3 Kafka Boxes
- 1 Kafka Topic
- Kafka Topic had 1 partition with replication factor 3
- Topic had write acknowledgment of 2, as discussed in experiment 2
- Built 2 new Docker images that used the producing and consuming load test scripts that came with Kafka
- On a new box, I used the Docker images to run the scripts against our Kafka cluster
Hypothesis
I assumed that our read and write throughput would be around the 50 MB/s reported in the other blog posts.
Results
Our cluster was able to handle about 5 MB/s, which was much less than I had expected.
How we use this knowledge moving forward
The current set up was not optimal and needed to be tweaked for better performance. This information was critical in order to make sure our Kafka cluster was ready for our traffic peaks. If I had just relied on the data from blog posts and outside sources, it could have come back to bite us while under heavy load. The results also pointed to the need for a new series of experiments in order to optimize the Kafka cluster’s throughput.
The three experiments I discuss above gave invaluable information about how we need to configure our Kafka cluster. Using the scientific method helped me learn as quickly as possible about Kafka. It allowed me to expose what I did not know and learn before the system failed in production. This is a great way to get to know a new system and helped me quickly ramp up knowledge. The scientific method was very valuable and I highly recommend it for anyone attempting to learn a new system, technology, or language.