Scalable Internet Architecture

from: https://www.youtube.com/watch?v=2WuT2rdLK5A

Advice for any kind of technology.

What is architecture? "conceptual structure and logical organization of a computer or a computer-based system" from the dictionary. Architecture encompasses: space, power, cooling (and more, like legal requirement); hardware and network stuff (servers, switches, routers, firewalls...); finally dynamic applications. The architect is focus on the application to build, because it does not exist.

Not all people do all things. We are not aware of everything, it is hard to do many things. isolated decisions can create massive problems. Those things can leads to unreasonable requirements, can make stupid decisions and catastrophic failures.

Running Operations is a serious stuff and require 4 things: knowledge (what you learn or have learn); tools (what you use); experience; discipline.

Knowledge is created by reading, studying, training, user groups and participate to one or more community.

Tools is find by collaborating (with colleagues or community), by trying new tools, write new one and finally by practicing during the good and the bad time (should be effortless during the last one). You must survive with simple tools (like duct tape) and after correct with better tools. tools help to maintain discipline.

Experience come from mistakes and risks taken. Those risks should be taken carefully. People who did not learn are not useful. Make mistakes to learn from them and not repeat them. You need a way to create error safely without impacting the production.

Discipline is important in any job and come from training, study and practice. People wants fame and glory, their are cowboys, it is a missing ingredient in our field. A job is not an art. Through practice, you should achieve excellence.

All information (configuration, documentation, schema...) must be stored in version control. We can easily deal with textual file, not so easy with binary and running software. It doesn't matter which tool is it.

You must know your systems, use monitoring, collect data from system and process as much as possible, looks your systems and use diagnostic tools when things are healthy. You need historic information to compare information.

Management (configuration and provisioning) is managed by tools the craftsman make or prefer. We don't care about the version, it must do the job.

Static content can be stored on Akamai (https://www.akamai.com/) or another competitor (https://alternative.me/akamai). Another method can be to create our own service based on our own service and optimization but there is no real value to create another competitor to existing providers.

Availability, two methods, "White paper" approach, users access HA/LB and HA/LB forward request to multiple available web services, this method is costly. Another way to do things is to create a peer-based HA, Users access the data web server, and HA is made in the back for the data.

An example stack: web server to store static content; create a reverse proxy-cache (Varnish, Apache, Nginx...); create redundancy on IP (VRRP, CARP, wackamole); must be simple, easy and scalable. setup same thing in multiple datacenters.

How to get the content closer to the user? put a DNS at each location behind the same uplink, with the same IP address, announce the network from all data centers by using BGP.

IDEA: it could be really interesting to make an intelligent DNS which give answer based on location of the request, from the IP address, we can easily find these information from Whois or other service on the web. By doing something like that, we add a layer of complexity on DNS but we can control easily the location of user. To make things fast, we probably also need a way to communicate with IP services to know where exactly a request come from. Should be a good idea to add more complexity to DNS, an already complex service? Can solve the issue of people who can't manage BGP? It is a kind of big quick and dirty fix. BGP is complex and require lot of knowledge, can be a good solution to not use BGP in "simple" setup. To be clear, DNS is used by everybody around the world, BGP is only used by few people around the world (and actually, it's a kind of hidden service).

Dynamic content, find the premature optimization is hard and actually, few people (no one) can find it easily. Don't do work you don't have to. Don't pay to generate the same content twice; generate only the change; Break the system in two parts, one that change rapidly to one that change infrequently, this permit to isolate the costs.

Caching (see memcache) is a really interesting thing, you can use it to optimize the usage of the database but also the volume of data served for users. This make the bottleneck moving to the web server (Apache in our case), but it is easy to deploy many web server. [NOTE: every examples are created for Apache/PHP]

Databases scale vertically, if the data are fragmented it will throw away relational constraint. If you don't need relationship between data, you can use alternative like files, CouchDB or cookies. These database are easier to scale. Do only what is absolutely necessary in the database. You build thing, you see the pain point and you correct it afterward. This is the best way to create something working.

Network is part of the architecture, people tends to forget it. firewall states and load balancing algorithms are important. You must understand the bottleneck. To manage it you can use routing an use algorithm like OSPF (source base/hash routing). That add fault-tolerance, distributes network load and its free. To be clear, same technique to protect against DDoS attacks.

Service Decoupling is the most overlooked techniques for building scalable systems. Break down the user transaction into parts, isolate the asynchronous one, queue the information to complete the task, and process the queues (see Erlang). it's called messaging, you can use JMS (https://en.wikipedia.org/wiki/Java_Message_Service), spread (http://www.spread.org/) or AMQP (https://www.amqp.org/). Typical use-case require a message queue and a job dispatcher.

Scaling is hard, performance is easier. Extremely high performance systems tend to be easier to scale because they don't have to scale as much!