Scalability as an Architectural Concern (Part 3)

This article is part of a series that provides practical advice and guidance on how to leverage the Continuous Architecture (CA) approach. “Scalability Part 1” provides a definition of scalability, discusses its importance, its relationship with other quality attributes and the forces affecting it. Part 2 covers what has changed, the types of scalability, and the impact of cloud computing. This article is Part 3 and is the third and final article in this “Scalability as an Architectural Concern” series and discusses the architectural tactics available to deal with scalability requirements. Please note that we are just presenting a summary of scalability tactics here, due to space limitations. A full discussion of these tactics is included in Chapter 5 of the “Continuous Architecture in Practice” book.

Database Scalability

Databases are often the hardest component to scale in a software system. Software architects and engineers are often worried about the ability of their databases to cope with large workloads, especially if those workloads exceed estimates. Common tactics to address this concern include partitioning data across several loosely coupled databases and keeping data sharing and replication needs to a minimum.

Assessing the upper scalability limit of an architecture as early as possible can also be very helpful. An effective approach for doing this is to leverage CA Principle 5, Architect for Build, Test, Deploy, and Operate, and start testing for scalability as early in the development cycle as possible. One option would be to design and build a simplified, bare-bones prototype that implements only a few key transactions, and to run several stress tests, using simulations of their expected transaction mix.

What if, as the test workload increases, some of the databases become bottlenecks and affect the performance of the overall platform? Optimizing queries and reconfiguring the services for more capacity at increased cost, in order to facilitate vertical scaling, may not suffice as the test workloads are increased further. In addition, it may be hard to optimize a database design for both the updates and queries that it needs to support. Given this challenge, one option would be to use a separate database to process queries related to reporting requirements. This database would ingest transactional data and store it in a format optimized to handle the reporting requirements. As the transactional database is updated, the updates would be replicated to the reporting database using the DBMS replication mechanism process. Using this mechanism would reduce the propagation latency because, in most cases, it would be shipping database logs rather than processing events. This is very important for consistency and scalability.  

The data eventually may need to be distributed to cope with volumes beyond what is expected in the future An option to handle additional volumes would be to clone the reporting database and run each instance on a separate node without any major changes to the architecture. However this architectural tactic may not be enough to cope with even greater workload increases.

Data Distribution, Replication, and Partitioning

The next step could be to implement the approach of cloning the compute servers and sharing the databases described under “horizontal scalability” in the “Scalability as an Architectural Concern – Part 2” article. This would involve cloning services that may experience scalability issues, and running them in separate containers.  All of the database updates would be done on one of the instances (master database), and those updates would be replicated to the other databases using the DBMS replication process.

Alternatively, the data could be partitioned for specific services. This would involve splitting the database rows using some criteria, for example, user group. However, using table partitioning increases architectural complexity. Database design changes become more complicated when this capability is used. It is a good idea to use database partitioning only when all other options for scaling the database have been exhausted.

For software systems running in the public cloud, software architects and engineers would probably opt for a managed cloud-based alternative should they need to implement data partitioning or even sharding. This approach would, of course, increase the cost of implementing the solution as well as its ongoing operating cost and possibly a large migration cost to a different database model.

Architectural decisions for scalability have a significant impact on deployment costs. When selecting an architecture to increase scalability, we may need to make a tradeoff with other quality attributes, such as cost in this case. We recommend to follow CA Principle 3, Delay design decisions until they are absolutely necessary, and to consider deferring table partitioning or sharding until there are no other options to deal with scalability issues.

Caching for Scalability

Caching is a powerful technique for solving some performance and scalability issues. It can be thought of as a method of saving results of a query or calculation for later reuse. This technique has a few tradeoffs, including more complicated failure modes, as well as the need to implement a cache invalidation process to ensure that obsolete data is either updated or removed. There are many caching technologies and tools available, and covering all of them is beyond the scope of this article. Here are four common caching techniques:

  • Database object cache: This technique is used to fetch the results of a database query and store them in memory. A database object cache could be implemented using a caching tool such as Redis or Memcached. In addition to providing read access to the data, caching utilities provide their clients with the ability to update data in cache. Implementing a simple data access API could be used to isolate a service from the database object caching tool. This API would first check the cache when data is requested and access the database (and update the cache accordingly) if the requested data isn’t already in cache. It also would ensure that the database is updated when data is updated in cache.
  • Application object cache: This technique stores the results of a service, which uses heavy computational resources, in cache for later retrieval by a client process. This technique is useful because, by caching calculated results, it prevents application servers from having to recalculate the same data.
  • Proxy cache: This technique is used to cache retrieved Web pages on a proxy server so that they can be quickly accessed next time they are requested, whether by the same user or by a different user. Implementing it involves some changes to the configuration of the proxy servers and does not require any changes to the software system code.
  • Precompute cache: This technique stores the results of complex queries on a database node for later retrieval by a client process. For example, a complex calculation using a daily currency exchange rate could benefit from this technique. In addition, query results can be cached in the database object cache if necessary. Precompute cache is different from standard database caching provided by DBMS engines. Materialized views, which are supported by all traditional SQL databases, are a type of precompute cache. Precomputing via database triggers is another example of precompute cache.

Two additional forms of caching may already be present in your system environment. The first one is a static (browser) cache, used when a browser requests a resource, such as a document. The Web server requests the resource from the software system and provides it to the browser. The browser uses its local copy for successive requests for the same resource rather than retrieving it from the Web server. The second one is a content delivery network (CDN) that could provide a cache for static JavaScript code and can be used for caching other static resources in case of a Web tier scalability problem. The benefits of a CDN include caching static content close to the customers regardless of their geographical location.

Using Asynchronous Communications for Scalability

Older software system architectures generally assume that most inter-service interactions are synchronous, which means that for these interactions, code execution will block (or wait) for the request to return a response before continuing, unlike asynchronous inter-service interactions. However, retrieving the results of the service execution for asynchronous inter-service interactions involves extra design and implementation work. In a nutshell, synchronous interactions are simpler and less costly to design and implement than asynchronous ones, but introduce a scalability risk. Since synchronous interactions stop the execution of the requesting process until the software component being invoked completes its execution, they may create bottlenecks as the workload increases. In a synchronous interaction model, there is only one request in flight at a time. In an asynchronous model, there may be more than one request in flight. An asynchronous approach improves scalability if the request-to-response time is long, and there is adequate request processing throughput (requests per second). It can’t help if there is inadequate request processing throughput, although it may help smooth over spikes in request volume. It may become necessary to switch from synchronous interactions to asynchronous ones for software system components that are likely to experience resource contentions as workloads increase.

To avoid a major rework of the application, should this switch become necessary, a good option would be to use a standard library for all inter-service interactions. Initially, this library would implement synchronous interactions, perhaps using the REST architectural style. However, it could be switched to an asynchronous mode for selected inter-service communications should the need arise. To mitigate concurrency issues that may be introduced by this switch, the library could use a message bus for asynchronous requests. In addition, the library could implement a standard interface monitoring approach in order to make sure that inter-service communication bottlenecks are quickly identified. Any message processing logic would be implemented in the services that use the message bus. This approach would decrease the dependency of services on the message bus and enable the use of a simpler communication bus between services.

Additional Application Architecture Considerations

Presenting an exhaustive list of application architecture and software engineering considerations to develop scalable software is beyond the scope of this article. However, the following are a few additional guidelines for ensuring scalability.

Stateless and Stateful Services

Stateless services are usually considered scalable, whereas stateful services are not. But what do we mean by stateless and stateful? Simply put, a stateful service is a service that needs additional data (usually the data from a previous request) besides the data provided with the current request in order to successfully execute that request. That additional data is referred to as the state of the service. User session data, including user information and permissions, is an example of state. There are three places where state can be maintained: in the client (e.g., cookies), in the service instance, or outside the instance. Only the second one is stateful. A stateless service does not need any additional data beyond what is provided with the request, usually referred to as the request payload.

Stateful services need to store their state in the memory of the server they are running on. This isn’t a major issue if vertical scalability is used to handle higher workloads, as the stateful service would execute successive requests on the same server, although memory usage on that server may become a concern as workloads increase. However, cloud based applications use horizontal scalability as their preferred way of scaling. Using this approach, a service instance may be assigned to a different server to process a new request, as determined by the load balancer. This would cause a stateful service instance processing the request not to be able to access the state and therefore not to be able to execute correctly. One possible remedy would be to ensure that requests to stateful services retain the same service instance between requests regardless of the server load. This could be acceptable if an application includes a few seldom-used stateful services, but it may create issues as workloads increase. Another potential remedy would be to use a variant of the first approach for implementing horizontal scalability (see ““Scalability as an Architectural Concern – Part 2”). An option would be to assign a user to a resource instance for the duration of a session based on some criteria, such as the user identification, thereby ensuring that all stateful services invoked during that session execute on the same resource instance. Unfortunately, this could lead to overutilization of some resource instances and underutilization of others, because traffic assignment to resource instances would not be based on their load. It would clearly be better to design and implement stateless services rather than stateful ones, if possible. Stateful exceptions should be carefully reviewed and approved.

Microservices and Serverless Scalability

Microservices evolved from the SOA approach and use simpler integration constructs such as RESTful APIs. The term microservice may be misleading, as it implies that those components should be very small. Among the microservices characteristics, size has turned out to be much less important than loose coupling, stateless design, and doing a few things well. Leveraging a Domain-Driven Design approach, including applying the bounded context pattern to determine the service boundaries, is a good way to organize microservices.

Microservices communicate with each other using a lightweight protocol such as HTTP for synchronous communications or a simple message or event bus that follows the “smart endpoints and dumb pipes” pattern for asynchronous ones. A microservices-based architecture can fairly easily be scaled using any of the three techniques described in our discussion of horizontal scalability.

What about serverless computing, also known as “Function as a Service (FaaS)?” FaaS can be thought of as a cloud-based model in which the cloud provider manages the computing environment that runs customer-written functions as well as manages the allocation of resources. Serverless functions (called lambdas [λ] on Amazon’s AWS cloud) are attractive because of their ability to autoscale, both up and down, and their pay-per-use pricing model. Using serverless computing, software engineers can ignore infrastructure concerns such as provisioning resources, maintaining the software, and operating software applications, and focus on developing application software. Serverless functions can be mixed with more traditional components, such as microservices, or the entire software system can be composed of serverless functions.

Serverless functions do not exist in isolation. Their use depends on using a number of components provided by the cloud vendor, collectively referred to as the serverless architecture. Applications that leverage a serverless architecture provided by a cloud vendor may not be easily ported to another cloud.

From an architectural perspective, serverless functions should have as few responsibilities as possible (one is best) and should be loosely coupled as well as stateless. Serverless functions are event based, although they can also be invoked using a request/reply model through an API gateway. A common mode of operation is to be triggered by a database event, such as a database update, or by an event published on an event bus. Serverless functions tend to be dependent on other services, and these services need to scale as well as they do. In addition, the serverless event-driven model may create design challenges for architects who are not used to this model. If not careful, utilizing serverless functions can create a modern spaghetti architecture with complex and unmanageable dependencies between functions. On the other hand, a serverless architecture can react more rapidly to unexpected workload spikes, as there are no delays in provisioning additional infrastructure.

Measuring Scalability

Monitoring is a fundamental aspect of a software system that is needed to ensure that it is scalable. It is hard to predict exactly how large a transaction volume a software system will have to handle. Determining a realistic transaction mix for that workload may be an even bigger challenge. As a result, it could be difficult o load and stress test a complex software system ln a realistic manner. This does not mean that load and stress testing are not useful or should not be used, especially if an effort has been made to document scalability requirements as accurately as possible based on business estimates. However, testing should not be the only way to ensure that a software system will be able to handle high workloads

Effective logging, metrics, and the associated automation are critical components of a monitoring architecture. They enable the architect to make data-driven scalability decisions when required and to cope with unexpected spikes in the workload. The architect needs to know precisely which software components start experiencing performance issues at higher workloads and to be able to take remedial action before the performance of the whole software system becomes unacceptable. The effort and cost of designing and implementing an effective monitoring architecture is key to making a software system scalable.

Dealing with Scalability Failure

System component failures are inevitable, and a software system must be architected to be resilient. Large Internet-based companies such as Google and Netflix have published information on how they deal with that challenge. But why is a monitoring architecture not sufficient for addressing this issue? Unfortunately, even the best monitoring framework does not prevent failure, although it would send alerts when a component fails and starts a chain reaction that could bring down the whole platform. Software architectures leveraging CA Principle 5, Architect for build, test, deploy, and operate, employ additional tactics to deal with failure.

Examples of these tactics include implementing circuit breakers, throttling, load shedding, autoscaling, and bulkheads as well as ensuring that system components fail fast, if necessary, and performing regular health checks on system components. All the components that deal with failure should be integrated into the monitoring architecture. They are an essential source of information for the monitoring dashboard and for the automation components of that architecture.

The next series of articles in this “Continuous Architecture in Practice” series will discuss performance as an architectural concern, so we hope that you find those articles useful and you will keep on reading them!

For more information on our new book, “Continuous Architecture in Practice”, which discusses scalability in much more detail, please visit our website at

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: