Monday
Aug112014

UX for multisport scorekeeping

The GameChanger mobile app currently provides scorekeeping for Baseball, Softball and Basketball, with more sports coming in the future. This presents an interesting challenge in terms of user experience. Aspects like pace and complexity vary so much from sport to sport, that the challenges a scorekeeper has to face are completely different. Taking these differences into account is essential when designing the app’s user interface. After years of interacting and getting feedback from coaches and scorekeepers, GameChanger has attempted to create a great product that works well for all the supported sports. This post briefly describes what GameChanger has done so far and some of the main challenges that need to be faced in the future to improve basketball scorekeeping.

Baseball and softball are very complex sports. The rules and the way the game is defined allow for a very large combination of events that can happen in a single play. Furthermore, coaches are interested in keeping very detailed and complex stats. What this means is that if a player hits a single, the app has to be able to record if the player hit a ground ball or a line drive, the pitch type, if there was a defensive error, if the ball was hit to the left, right or center field, and what player caught it. That’s a lot of things to ask for each batter.
 
Fortunately this complexity is mitigated by the pace of the game. Baseball and softball are not very fast sports, which means that there’s enough time between plays for scorekeepers to record all this information without falling behind in the game. For this reason, the focus when designing baseball and softball scorekeeping for the app was to make all possible flows clear and accessible to the user, even if that meant having to go through several screens to score each play.
 

Baseball scorekeeping

Basketball is almost the complete opposite to baseball and softball in terms of complexity and pace. When a player scores, the only information that needs to be recorded is where the shot was made and by whom. Some coaches might also be interested in recording assists. When compared to baseball, the flow that a scorekeeper has to go through is a lot shorter, and yet users often find it difficult to score a game without falling behind.

The reason is that basketball is a very fast sport. Many things can happen in a span of seconds and all of them need to be recorded in order to obtain accurate stats. Taking this into account, the GameChanger app attempts to make scoring a basketball play something very quick. Buttons are bigger and options are limited. Furthermore, the app allows the user to check and correct previous plays. What this means is that users know they can continue to pay attention to the game, and when they have a chance they can go back and assign a shot or a foul that was recorded earlier. This helps reducing the potential stress caused by failing to record some of the plays.

Basketball scorekeeping 

Despite these efforts, basketball scorekeeping is still too hard for some users. Scoring plays needs to be even faster. This can be achieved by attempting to predict how users behave when using the app, and optimizing the user interface for that behavior. There are tools that can be used for that purpose such as CogTool, which allows the design and testing of different storyboards by predicting how skilled users would perform on them while doing different tasks.

Every change made to the current interface involves some risk, given that it’s hard to anticipate if users will understand the new interaction models. That is why using a rapid prototyping methodology becomes really important, as it provides early feedback from users. This involves iterating really quickly over some UI concepts, and testing them in office with actual users. After the tests are conducted and analyzed the team is able to decide what works and what doesn’t, and develop the next prototype for a new round of user testing. As an engineer, this has been a great opportunity for me to get direct contact with users and understand how they perceive and use the app, which I think is really important for being able to make a successful product that actually provides a great value to people who use it.

Understanding how the differences between sports affect what users need from the app, is very important for providing a successful experience. This will become harder and even more important in the future as new sports are introduced, because each one of them will bring new challenges that will need to be addressed. To do so successfully, it is very important to maintain a close relationship with coaches and scorekeepers, as GameChanger has done so far, since this is the foundation for understanding what they really need from the product and what the best user experience should be.

Thursday
May292014

Landing in NYC

I arrived to New York a tuesday at 4:00 am, in a flight that departed five hours earlier from Bogota, Colombia. The reason for my travel: taking part in an eighteen month training program at GameChanger. After a few weeks in the city I have found an apartment to live in, bought furniture for said apartment and started the training program.  

During this time, I have found that there are some things that work very similar in New York and in Bogota, while there are others that definitely require some learning and adjustment. I want to write about two of them that I’ve found particularly interesting. 

The first one is public transportation. New York and Bogota are both big cities and as a consequence, efficient transportations is a huge challenge for both of them. I was aware that Bogota’s public transportation system couldn’t be described as ‘good’, but after using New York’s for a couple weeks I can say that the difference is significant. While looking for an apartment I spent whole days going from one place of the city to another, and it amazed me how fast and easy it was to move around.

In Bogota, using public transportation can be a very unpleasant experience. There is no subway so the city relies on buses to transport millions of people every day. This buses tend to be overcrowded and slow, and a five mile trip can often take over one hour. Furthermore, there several companies running the transportation system, which means that a lot of people need to pay more than one fare to get from their homes to their destinations.

The second one is choosing your healthcare plan when you start working in a new company. 

In Colombia, the basic medical coverage of all healthcare plans is set by the government, and every healthcare provider is obligated to offer at least the minimal obligatory health care plan. This means that when a person is going to start a new job, the decision comes down to which network of medical institutions you prefer. That’s it. Furthermore, if you change jobs you can transfer your existing plan to the new company, as every company is obligated to pay the health provider chosen by the employee.

On the other hand, choosing my health plan in New York felt like something that required great amounts of knowledge and understanding of the system, which I obviously din’t have. Deciding wether opening a HSA or a FSA suited me best, how much deductible I should choose in my plans, wether I needed vision and dental care plans and which of the several available plans was the best for my needs. It was an overwhelming experience that required a few hours of reading, researching and the valuable help of a teammate to finally understand and make a decision about my healthcare plans.

These are just two of several things I have had to adjust myself to since arriving to New York. The list could also include things like renting an apartment in Manhattan, understanding taxes and driving. And I’m quite sure I will keep finding more things to add to the list in the future, which is great. It has been an enriching and eye opening experience that I have fully enjoyed so far, and hopefully I’ll continue to learn how things work in the USA.


Monday
Feb102014

Making Downtime Less Painful

Downtime! It's probably every sysadmin's least favorite word. But sometimes it's necessary, and when we're lucky, we can plan for it in advance, during off-hours, to do some much-needed maintenance.

Whenever we need to do maintenance or an upgrade on our Mongo database, for example, we put the site into "scheduled downtime" mode. The end result for users is that they see a page saying "we're down for scheduled maintenance, please come back in later" with a link to our status page. If we didn't do this, users would instead see a blank page, or get lots of 500 errors, or other undesirable behavior. 

To accomplish this, we created a tool called Downtime Abbey. 

Abbey works by changing our Amazon load balancers to send traffic to a different port. This port is set up to respond with a 503 and the maintenance page- the 503 response (instead of the 500 errors that would otherwise results) tells search engines that this is temporary so they won't remove our pages from their indexes. The tool uses boto to send all traffic to this downtime port at the beginning of our maintenance windows, and restores the original settings at the end.

In addition, it reconfigures the load balancer health checks. Amazon ELBs have configurable health checks that makes sure each node in the load balancer is healthy, with unhealthy nodes being removed so traffic doesn't get sent to them. When we're returning a 503, this causes the health checks to fail, which would cause all the servers to get removed from the load balancer, which would get rid of the maintenance page that users see and display nothing instead. To work around this, Abbey changes the health checks to check a different target (one that returns a 200 OK during maintenance), again changing them back at the end of maintenance, so the load balancers are happy the whole time.

But wait, there's more!

We use Sensu to monitor all our hosts and services, and naturally during downtime when things are purposely stopped, the checks for these things will fail. We have Sensu configured to send critical alerts to Hipchat, and this used to cause Hipchat to fill up with a lot of red durning downtime.

So much red that we almost missed an actual problem in there - the chef-client failure at the bottom indicating that setting up a software RAID array didn't work properly. Also, alert fatigue is bad, and training ourselves to ignore alerts is bad, so we needed to come up with a way to make these false (or rather, expected) alerts not happen in the first place.

Sensu deals with failures by sending messages to one or more handlers. We have a group of handlers called the 'default' group that includes Hipchat, email, Datadog, and PagerDuty. Before, this was hard-coded into the Sensu configuration, but to deal with downtime alerting, we made it an attribute in Chef that we could override as needed. 

We created a Chef recipe called sensu::downtime that overrode the list of default handlers to be empty. Failures will still show up on the Sensu dashboard, but they won't go anywhere else. After changing the handlers attribute, the recipe then restarts the sensu-server service so this change takes effect. Adding this recipe to the run list of the Sensu server overrides the default list (all the handlers) with the empty list, and removing it from the run list lets the defaults stay default (also restarting the service so we start getting alerts again). 

But doing that by hand would have been a pain, and one more thing to potentially forget, so PyChef to the rescue! Now, the buttons in the Abbey tool add and remove this recipe from the Sensu server automatically. This means that Sensu is quiet in Hipchat (and PagerDuty) during our scheduled downtime, so we don't get flooded with red messages we don't care about. At the press of a button, most of the downtime pain is automatically taken care of... except for the pesky maintenance itself!

Thursday
Feb062014

Speeding up Provisioning

One of the things that we often have to do on the GameChanger tech team, especially now that spring baseball season is approaching, is to bring up more server capacity in AWS to respond to higher traffic. We haven't yet been able to make use of Amazon's autoscaling feature that would handle this for us, so we've been bringing this extra capacity up and down largely by hand (albeit with extensive help from a script we've written to automate away most of the details). 

This process has always been rather slow, meaning that we are slower to respond when traffic starts to rapidly increase, so we started looking into how we could speed up the provisioning process.

On Chef

When we provisioned servers, we started with an essentially blank Amazon Machine Image (AMI, or the template from which EC2 instances are created) - it was a basic Linux installation with no GameChanger code or configuration. All of the configuration and deployment was then done using Chef, starting with this base image. Because it was starting from a blank slate, it had to do everything - from installing the user accounts that get installed everywhere, all the way up through the different services that run on our different types of servers (such as Apache on our web servers). This naturally was fairly time-consuming, taking on average 15-20 minutes to bring up one server.

The other problem with doing provisioning entirely with Chef was fragility. If any part of the Chef run failed, the provisioning would stop, leaving the server in a partway-provisioned state that wasn't able to handle any production traffic, and had to be fixed by an engineer who could diagnose the issue with Chef. An unfortunate side-effect of this was that external dependencies, such as software packages that get downloaded from third-party repositories, could block our provisioning process if they were unreachable.

Chef and AMIs

The great thing about Chef is that after it has set something up, the next time it runs it only has to verify that things have stayed in the correct state. If we've created a user account, we don't have to create it again if it's still there. If we've downloaded a package, we don't have to download it again. This means that subsequent Chef runs complete much more quickly than the initial run. 

So we decided to use this to our advantage by creating our own AMIs from servers that had already completed this initial Chef run. We started out by creating our base AMI, which contained only things that were common to every single server in our infrastructure. This consisted of things like user accounts, environment variables, system tools, and so on, with no application-specific code or settings. We were then able to use this newly-created GameChanger base AMI to provision new servers, which cut several minutes off the provisioning time. Now, initial Chef runs on those servers could breeze over those common parts and only spend significant time on the application-specific parts.

Getting More Specialized

We have several main roles that most of our servers fall into. We have groups of servers for the web frontend, the web and mobile APIs, and servers that do various backend data processing. Each server in a group is identical to any other server in the group, so we decided to leverage those commonalities to create even more specific AMIs. Starting with our newly created base image, we extended our AMI-creation script (because of course we aren't going to be doing this by hand!) to leverage our existing Chef roles to create a specialized AMI for each type of application server we use.

Because the AMI creation process essentially takes a snapshot of anything that is on the server when the image is created, we did have to be a bit careful with what got baked into these images. Specifically, we made sure that Chef-client wasn't in the image, to prevent it from getting started a second time on a new server made from this image, and we completely disabled Sensu (our monitoring service) when creating these images, both to prevent re-configuration issues with new servers, and because we don't want to monitor anything on servers that only exist to create AMIs for other servers.

The Results

With the exception of our workers (which have many many queues used to process a variety of tasks), we can use these new customized AMIs to provision any of our servers in under 5 minutes. Overall, this came out to be a 70% improvement across the board. Why? Because almost everything that Chef needs to do has already been done, so it doesn't have to do those things again when it brings up a server from these images. And because our AMIs have all the packages we need already installed, they don't need to be reinstalled during provisioning of production servers, so we're much less dependent on external package repositories than we used to be.

Not only will this allow us to be much faster when we need to respond to increased traffic, resulting in a better experience for our customers, we will also be able to leverage these AMIs to make the shift to using Amazon's autoscaling at some point in the future. Here's looking forward to a busy spring season and a future post on autoscaling here!

Monday
Feb032014

The right amount of test coverage

Engineers often have poor intuition as to what to unit test, so they fall into one of two camps: unit test everything or unit test nothing. Both of these are unhealthy extremes. Unit tests are important, but it shouldn’t be all or nothing. My principle for deciding what code should be tested is that the harder it is to detect bugs during manual regression testing, the more necessary to write automated unit tests.

I’ve been on both sides of this divide. When I worked at Vontu, automated test coverage was measured, and we sought to hit a high target, such as 80%.For a long time, I accepted this as the right way to work, but after a while I started to get the feeling that much of the time I was spending writing tests was wasted.

I also have worked at Amazon.com, which had no institutional policy regarding testing, and at which many teams did no automated testing. Yet Amazon’s availability is very high — for them, not testing everything is working.

My objection to some of the attitudes I’ve encountered is that there is often little logic or principle to them. They seem more based in world view than empiricism. Worse, they are often advocated by people who have always used one method.

I reject the idea that there is a known amount of test coverage you should always strive for. If you have zero customers, your code coverage has produced exactly zero business value. Knowing what should be unit tested will always rely somewhat on intuition, but we can still discuss principles which should guide your team. My philosophy is that test coverage is not valuable in and of itself. It’s a proxy— for achieving quality, customer happiness, and business value, among other things.

At GameChanger, we built a sizable customer base with no unit testing. Now, I’m not recommending this approach. The opposite, in fact. But while shooting for 100% is better than shooting for 0%, it’s still a huge waste of time.

In the past few years, as our UI has become increasingly complex, we have shored up our gaps with automated tests (we use Kiwi to specify behavior), and we write unit and functional tests, in advance, for our new features if they exceed our threshold of needing them.

This threshold is something that many engineers inexperienced with testing struggle with. Test writing is not a part of any CS curriculum I know about, and while books and blogs are a decent way to get started, they only get you so far. Without guidance, you can waste a lot of time, and worse still, write tests that miss the point (e.g. you test that some library code is working, rather than the code you’ve written). It’s mentally easier to have a mandate to cover 80% of your codebase than it is to learn subtle things, but there is a lot of nuance to precisely what needs testing.

We have come to believe that to determine when to spend the effort required to make code testable and write tests, it should be evaluated on how hard it is to discover bugs, rather than defaulting to unit testing or not.

Here I am making the assumption that manual testing is a non-negotiable component of your release process. Even the most die-hard test-first adherents agree that you have to use your app to make sure it works. Details about our development and release cycle were published in a previous post about how to ship an app. In it, you can read about how much we value writing automated functional and unit tests in advance. However, we write those tests in anticipation of how much manual testing effort they save.

To delve into how much automated testing should be done, let’s break the cost of bugs into two types: the cost of being bitten, and the cost to discover.

The cost of being bitten by a bug is what happens to your business value when a bug emerges and affects your customers. A bug which prevents GameChanger users from creating teams is a total disaster, while mis-localizing a field as “postal code” vs. “zip code” is trivial.

The cost to discover is what you had to do to find the bug in testing. Discovering that you cannot create a new team in our app is very easy to discover in manual testing, which has to be done no matter what. On the other hand, making code testable and writing tests represents a cost.

Cost to discover * cost of being bitten is roughly the equation which calculates the risk you take on when you code up new stories. So even when the cost of something going wrong is great, if the cost to discover is very low, the risk is low, and testing is a lower priority.

Example GameChanger Code

In order to make this whole thing concrete, I’m going to show you what I mean with the play editor from the GameChanger Basketball scoring app. I’ll describe our test coverage as it evolved over several releases of the editor to cover cases of made and missed shots being recorded.

Version 3.7

Pictured here is the app immediately after the screen has been tapped on the court area, indicating that a shot was made. On the right is the play editor, showing one play in the play history (made, by player #4), and a new shot, just entered.

The model which is built into the UIViews containing the made/missed button and player number button has only two possible states — shot.made == true or shot.made == false. Detecting an error in manual testing at this stage is trivial; it would not increase my confidence to write a test verifying that the correct PNG file gets displayed for each state.

Version 4.6

In this version, the model still only has shot.made true and false states, but those states lead to more variety in the views. We’ve added a “segment” (in our lingo, a discrete unit of visual info describing an aspect of a play) for rebounds, which can have either or neither team selected.

Making this code testable, and testing it, is still more effort than it’s worth.

Version 4.7

In this version of the app, the play editor moved to the bottom of the screen, and we added a state to the model, so now it has made, missed without rebound, and missed with rebound. It’s still easy to get the app into these states, and we continued to ship this code without unit testing it.

Version 4.8

In v4.8, states explode as we added support for assists and blocks in advanced scoring mode, while simple mode remained the same as 4.7.

The play editor would display a segment for adding a block to missed shots and a segment for adding an assist to made shots. Pressing the “add block” or “add assist” button would replace that button with yet another segment, which let you specify the player number for the block or assist. The combinatorics of the values a play’s attributes can have meant lots of states to test.

At this point we added testing for this feature; the time to verify its behavior by manual testing would have inflated by ~10x! In retrospect, I’d have preferred that we added testing for v4.7, but no earlier than that.

Version 5.1 (current behavior as of this post)

In v5.1 we added support for 3-point and 4-point plays, but fouls can only be added in the editor for the most recent play. At this point there are three states for missed shots (simple, advanced with block, and advanced without block), and made shots have states for current or historical play, simple or advanced with or without blocks, assists, and fouls. Cost to discover bugs without automated testing has skyrocketed, and we have added a robust suite of tests which look like this.

	context(@"for a historical play segment with no timeouts before it", ^{
          beforeEach(^{
            [[manager stubAndReturn: theValue(NO)] 
               isCurrentPlayManager];
            [[manager stubAndReturn: theValue(NO)]
               areAllSubsequentRowsTimeouts];
          });

          it(@"has a shot segment and an add assist segment", ^{
            [manager updatePlaySegments];
            [[manager should] havePlaySegmentTypes: 
               @[@(GCPlaySegmentTypeShot),
                 @(GCPlaySegmentTypeAddAssist)]];
          });

          context(@"when an assist is added", ^{
            beforeEach(^{
              manager.event.assistAdded = YES;
            });

            it(@"has a shot segment and an assist segment", ^{
              [manager updatePlaySegments];
              [[manager should] havePlaySegmentTypes: 
                 @[@(GCPlaySegmentTypeShot),
                   @(GCPlaySegmentTypeAssist)]];
            });
          });

The entire file for our Play Editor specs is just shy of 1k lines, so obviously significant time was invested into writing these tests (and the custom matchers we wrote to increase clarity).

 


 

I’m trying to move the testing dialogue away from dogma. I think there are cases where automated testing really pays off, and cases where it doesn’t. When you write a test, you’re gambling that the time you take results in saved time and protects you from customer-impacting events, but it’s silly to act as if every gamble has the same odds.

What I’m trying to do is work out where we should place the threshold, and how we should talk about where to place it. Cost of bug discovery is a major component in my process.

I’d really like to hear what other people think; I suspect that either formally or informally, a lot of people who write unit tests are using thresholds, but probably not talking about them.