Sunday, 25 January 2015

My Dream Project - Part 2

System Design

In this continuation of the My Dream Project series of blog posts, I will discuss the general system design that was implemented. This will cover high level design patterns, code strategies and technology stack used. I hope that some of the decisions and reasoning here might be able to help you with any similar goals, as either a ward or endorsement.

The platform needed to be highly available, and scalable to a large volume of concurrent requests. The platform needed to ensure that users perceived the site to be very quick to respond to actions and requests. To combat this challenge a number of strategies were developed and deployed to production. Overall, they have had a good degree of success, and where there have been problems I have outlined them below.

Micro Services

One of the main problems we had with the existing code base was coupling. There was coupling at various levels between many different components. This had already presented itself as a bottleneck for performance a number of times and was something we definitely wanted to get away from. We choose Micro Services as an architecture style specifically to create concrete boundaries between components, and have well defined cross cutting concerns the only common dependency.

For more on micro services, you can read this fairly extensive article

The server/client architecture that was chosen was that the client application would access those services directly when CORS (Cross Origin Request Support) was available. When it was not available (older versions of IE), we used a simple reverse proxy on a single domain to route requests accordingly. For a first step this seemed to work really well, there was clear development and deployment benefits to this approach. The main benefit was the ability to scale a single service independently of others. Usage of the platform was not going to be evenly distributed between services, one of the reasons this architecture style was chosen was to prevent having to scale the whole platform by 100% in order to get an extra 10% performance from a single part of it.

In hindsight, this was still the best choice, however accessing the services directly was non-maintainable. There are two downsides, you need a custom sub domain for each service, when you have 20-30 services this is a pain. The second is that getting client performance reporting out when you have requests spread across all those domains is harder. If this needed to be done again, I would definitely start with a single domain that routes requests internally via a reverse proxy (nginx for example) to keep it simpler. This is in fact a change that has already been made to the platform.


Varnish is a great tool that I first used on this project, after the initial trial and error phase of playing with the service it proved quite simple to move onto fairly advanced use cases. As an initial reaction, I felt this would be a plug n play service, and to an extent that is exactly what you have. However, in order to fully utilise Varnish, you really need to have a fairly deep understanding of the client/server relationship. A lot of time was spent optimising the Varnish set up (VCL) to get the most out of it. If you invest time here it will be worth it, a significant proportion of the performance requirement was met just by optimising our usage of Varnish.

One of the hidden downsides of Varnish is that hosting within Amazon Web Services (AWS) is a bit of a pain point. The exists an incompatibility between the Varnish implementation and the AWS implementation for connections between servers. Varnish binds requests to an IP address only, while AWS uses DNS so that IPs can change frequently. To address this issue, you can use Nginx. There is an excellent article here which can be followed to resolve this problem.

There was one other issue with using Varnish that was specific to our set up. I have not yet been able to resolve this issue fully (will update if I do) and I can't find any information on line. If you have a number of Varnish servers behind an Elastic Load Balancer (ELB) then there appears to be an idle time out problem. Eventually, you will see in your web browsers intermittent 502 errors that come and go.

Client optimisation

This is one aspect of the platform that was at the centre of our thinking from day one. The question was 'How do we ensure a fast and responsive application even if services are slow?'. Part of the answer came from using Micros Services, this meant that parts of the page might load with different speeds, but still information would be coming in. Although this is no where near perfect when the system was under pressure, it did mean the user could see things happening and they were not left guessing about if things were even happening.

The tactic we used initially was that there would be a dedicated service for serving our HTML pages, these pages would then initialise themselves and request from various APIs the information they needed to render fully. Initially, this was a good tactic, we didn't know how long services would take to respond and it meant that the service serving those HTML pages didn't require a code dependency on the rest of the system. A win win.

In hindsight, I think I would of still done the same thing as time was tight, but now pages are being served with much more bootstrap information embedded. This reduces 'loading indicator hell' which the platform is currently suffering from. To get around dependencies, we use Edge Side Includes (ESI) to pull mark up from various other Micro Services in order to full generate the HTML of the page with Varnish before serving to the client. This creates a much nicer experience. Some parts of the page still load using an loading indicator, but the majority is available immediately, and because of ESI the HTML is still served very quickly.

Technology Stack and Code

Within the Micro Service architecture, .NET (with the WebAPI framework) was used for developing the application, Elastic Search to provide fuzzy searching and MongoDB as the persistence layer. A big part of the development process was understanding and optimising each of the APIs. The platform uses a CQRS approach for POST/PUT/DELETE type requests, and uses Varnish caching for GET/HEAD requests. On top of that, the code itself was optimised and performance tested to stand up to significant load without the use of Varnish.

Looking at it now, I am not sure if there is any specific advice I could give about the strategy. Generally, we looked at how a certain part for the system needed to respond to load, chose the right pattern to achieve that task and then refactored to optimize that part of the API. This is perhaps fairly obvious, but can take considerable time and several iterations to get just right. 

If you are developing a system now, some advice I would give you is that you don't want to spend too much time optimising straight away. However, that doesn't mean you shouldn't implement an existing performance strategy from the start. For example, if your putting in a POST endpoint, and that after an entity update, various other things need to happen, start with a CQRS approach, and keep your code clean. That will get you 90% of the way, if you need more after that then look at it when you need to. Don't start with an API endpoint that's calling 5-10 difference services to do those tasks before returning. That strategy is just kicking the can down the road. As you develop your system, capture good practices from the current feature and implement them from the ground up in the next feature. Then find time to apply them to existing features.

No comments:

Post a Comment