Sunday 25 January 2015

My Dream Project - Part 2

System Design

In this continuation of the My Dream Project series of blog posts, I will discuss the general system design that was implemented. This will cover high level design patterns, code strategies and technology stack used. I hope that some of the decisions and reasoning here might be able to help you with any similar goals, as either a ward or endorsement.

The platform needed to be highly available, and scalable to a large volume of concurrent requests. The platform needed to ensure that users perceived the site to be very quick to respond to actions and requests. To combat this challenge a number of strategies were developed and deployed to production. Overall, they have had a good degree of success, and where there have been problems I have outlined them below.

Micro Services

One of the main problems we had with the existing code base was coupling. There was coupling at various levels between many different components. This had already presented itself as a bottleneck for performance a number of times and was something we definitely wanted to get away from. We choose Micro Services as an architecture style specifically to create concrete boundaries between components, and have well defined cross cutting concerns the only common dependency.

For more on micro services, you can read this fairly extensive article http://martinfowler.com/articles/microservices.html.

The server/client architecture that was chosen was that the client application would access those services directly when CORS (Cross Origin Request Support) was available. When it was not available (older versions of IE), we used a simple reverse proxy on a single domain to route requests accordingly. For a first step this seemed to work really well, there was clear development and deployment benefits to this approach. The main benefit was the ability to scale a single service independently of others. Usage of the platform was not going to be evenly distributed between services, one of the reasons this architecture style was chosen was to prevent having to scale the whole platform by 100% in order to get an extra 10% performance from a single part of it.

In hindsight, this was still the best choice, however accessing the services directly was non-maintainable. There are two downsides, you need a custom sub domain for each service, when you have 20-30 services this is a pain. The second is that getting client performance reporting out when you have requests spread across all those domains is harder. If this needed to be done again, I would definitely start with a single domain that routes requests internally via a reverse proxy (nginx for example) to keep it simpler. This is in fact a change that has already been made to the platform.

Varnish

Varnish is a great tool that I first used on this project, after the initial trial and error phase of playing with the service it proved quite simple to move onto fairly advanced use cases. As an initial reaction, I felt this would be a plug n play service, and to an extent that is exactly what you have. However, in order to fully utilise Varnish, you really need to have a fairly deep understanding of the client/server relationship. A lot of time was spent optimising the Varnish set up (VCL) to get the most out of it. If you invest time here it will be worth it, a significant proportion of the performance requirement was met just by optimising our usage of Varnish.

One of the hidden downsides of Varnish is that hosting within Amazon Web Services (AWS) is a bit of a pain point. The exists an incompatibility between the Varnish implementation and the AWS implementation for connections between servers. Varnish binds requests to an IP address only, while AWS uses DNS so that IPs can change frequently. To address this issue, you can use Nginx. There is an excellent article here http://blog.domenech.org/2013/09/using-varnish-proxy-cache-with-amazon-web-services-elastic-load-balancer-elb.html which can be followed to resolve this problem.

There was one other issue with using Varnish that was specific to our set up. I have not yet been able to resolve this issue fully (will update if I do) and I can't find any information on line. If you have a number of Varnish servers behind an Elastic Load Balancer (ELB) then there appears to be an idle time out problem. Eventually, you will see in your web browsers intermittent 502 errors that come and go.

Client optimisation

This is one aspect of the platform that was at the centre of our thinking from day one. The question was 'How do we ensure a fast and responsive application even if services are slow?'. Part of the answer came from using Micros Services, this meant that parts of the page might load with different speeds, but still information would be coming in. Although this is no where near perfect when the system was under pressure, it did mean the user could see things happening and they were not left guessing about if things were even happening.

The tactic we used initially was that there would be a dedicated service for serving our HTML pages, these pages would then initialise themselves and request from various APIs the information they needed to render fully. Initially, this was a good tactic, we didn't know how long services would take to respond and it meant that the service serving those HTML pages didn't require a code dependency on the rest of the system. A win win.

In hindsight, I think I would of still done the same thing as time was tight, but now pages are being served with much more bootstrap information embedded. This reduces 'loading indicator hell' which the platform is currently suffering from. To get around dependencies, we use Edge Side Includes (ESI) to pull mark up from various other Micro Services in order to full generate the HTML of the page with Varnish before serving to the client. This creates a much nicer experience. Some parts of the page still load using an loading indicator, but the majority is available immediately, and because of ESI the HTML is still served very quickly.

Technology Stack and Code

Within the Micro Service architecture, .NET (with the WebAPI framework) was used for developing the application, Elastic Search to provide fuzzy searching and MongoDB as the persistence layer. A big part of the development process was understanding and optimising each of the APIs. The platform uses a CQRS approach for POST/PUT/DELETE type requests, and uses Varnish caching for GET/HEAD requests. On top of that, the code itself was optimised and performance tested to stand up to significant load without the use of Varnish.

Looking at it now, I am not sure if there is any specific advice I could give about the strategy. Generally, we looked at how a certain part for the system needed to respond to load, chose the right pattern to achieve that task and then refactored to optimize that part of the API. This is perhaps fairly obvious, but can take considerable time and several iterations to get just right. 

If you are developing a system now, some advice I would give you is that you don't want to spend too much time optimising straight away. However, that doesn't mean you shouldn't implement an existing performance strategy from the start. For example, if your putting in a POST endpoint, and that after an entity update, various other things need to happen, start with a CQRS approach, and keep your code clean. That will get you 90% of the way, if you need more after that then look at it when you need to. Don't start with an API endpoint that's calling 5-10 difference services to do those tasks before returning. That strategy is just kicking the can down the road. As you develop your system, capture good practices from the current feature and implement them from the ground up in the next feature. Then find time to apply them to existing features.

Friday 2 January 2015

My Dream Project - Part 1

My Dream Project

Disclaimer: I want to keep this focused on a development/design point of view, so I have explicitly made the decision to not disclose details of the resulting project or the products used. 

In 2014 I got to work on what I personally considered to be a dream project of huge significance. There was a tight deadline, and a lot to do to integrate a number of existing products into a web platform that would be exposed to a very large audience. The dream here was that it was greenfield project to replicate an existing feature set. The scale of the platform required pretty much everything except the products feature set (and even that was improved/changed) to be re-designed from the ground up.

I wanted to take some time to record my thoughts about the project from hindsight. So this is what I learned, why I had to learn it and if what I did worked. I think that there is a lot to cover, so I am going to do this through a series of posts over the coming weeks/months. This post will start with an introduction to the scope of the task.

The Task

The task was a publicly accessible version of my companies products. Up until this point, the products have operated completely privately. The visibility of the platform was quite high and would be receiving press coverage. It has a large pre-existing user base, and the environment that it needed to operate in was much more active than what I had been used to. 

The web platform that was being built required us to run with the expectation that we would be seeing approximately 200,000 times more concurrent logged in and active users than any existing system. These users would expect sub 1 second response times, and this would need to be maintained 24/7, 365 days a year. 

I did lot of research into various technologies, stack choices, infrastructure options and just generally coding practices. I read a lot about resilience, scalability and performance optimizations for web applications, the problem I found was not a lot actually says specifically and explicitly how that is achieved. 

The reason for this I found - and perhaps you already knew - is that really it is dependant on your stack, your practices, your resources and most importantly your product. There are a variety of options for any given problem, each with their own pros and cons. Its up to you as the developer or system architect to kind of know your tool set, and then select the right tool for the job. So the reason it is never spelled out explicitly is because its likely whatever person A did, isn't going to work as well - or at all - for person B. 

At the time, the products were deployed to a number of web platforms running in various environment scenarios, but no single web platform had anything close to even half of that kind of activity. The reason for that was due to a number of factors. The user environment meant they would interact with the product infrequently during the course of the day. Only a portion of the entire user base would be logged into the platform at any one time. When they were logged in, their activity had long periods of silence in it as they consumed content. Overall this meant that our platforms were not getting 'pounded' by heavy, consistent usage.

The platform that needed to built and operated was expecting a much higher volume of constantly active users who never went 'silent'. I remember thinking at the time how much of a leap that was going to be, how as the system architect I was going to explain the gravity of that task to non-technical people. In the end that part was fairly easy.