How to survive and thrive during traffic surges
A well-known business proverb says, “One can never have too many website visitors.” Okay, we don’t challenge the universal truth, but it’s certainly keeping silent about the traffic surges it can bring.
Work-from-home is the new slashdot effect
Sales ahead of the Christmas holiday, Black Fridays, a highly anticipated product launch (did you also think about iPhone 12?), being featured on a website with millions of subscribers — all of these events can potentially cause huge traffic bursts.
Just in recent days, you may have also noticed some problems with a range of in-demand resources such as glitchy Zoom or Microsoft Teams conferences, the inability to access LinkedIn, grainy Netflix videos, etc.
As a significant part of the population has been working from home during the shutdown, digital channels experienced a massive increase in web traffic.
The graph below shows how dramatically the US internet traffic has grown over the last two months as people spent more time online.
As this and other above-mentioned situations are likely to be here to stay, it’s important to be prepared and hold off the next strike.
To arm you with all the necessary tools and weapons, we’ve collected some bulletproof monitoring and testing techniques so you can successfully cope with any digital loads on the infrastructure.
Traffic surges: So good, it’s bad
As a rule, high surges in traffic mean you’re getting more-than-before customers, subscribers, closed deals, etc. Unfortunately, there is a range of reputation-threatening dangers.
- Downtime can negatively affect your website’s quality score and how your site ranks in search results.
- Users who are dissatisfied with the website performance are less likely to return again, which directly correlates with lost profits.
- It’s important to differentiate between a simple traffic surge and, for instance, a DNS Amplification DDoS attack, which can result in an entirely unreachable service, loss of confidential data, and turnover decline, to name a few.
How to prepare for and handle unexpected traffic surges
Survival amidst traffic increases is less about heroic rescue efforts (we’ll review them later) than mitigating possible risks beforehand.
There are lots of SaaS-based automated performance monitoring platforms that help Dev and Ops teams ensure the availability of services at any critical moment in the future. Among such tools are Dynatrace, DataDog, AppDynamics, Zapier and many more.
Each vendor offers a unified, holistic environment and functionality to automatically navigate and trace the network paths that apps run across, thus identifying problems behind atypical response times.
We also recommend considering synthetic and real user monitoring to learn how performance issues affect the end-users' experience. This way, you take advantage of full visibility into your product: collecting, searching and analyzing traces across fully distributed architectures; mapping data flows and cluster services based on their connections; graphing, dashboarding and alerting, etc.
Database performance tuning
If the application relies heavily on the database (DB), any slowdowns of the DB can affect the entire system’s performance. To avoid any limitations in the operations, database performance tuning should be practiced regularly as part of a broader initiative.
Make proper use of indexes
Indexing is a data structure technique that speeds up the data retrieval process on a database table.
Lack of indexes is one of the most frequent reasons for slow queries. When a table is unindexed, your query has to look through the rows linearly to find those fulfilling the conditions. You must admit, this is an extremely time-consuming process.
For that reason, we’ve outlined the most important indexing perks you should consider:
- Faster data access
- Reduced total number of I/O operations
- Accelerated SELECT queries and reports
- Unique indexes, like primary key and unique key constraints, help to prevent data duplication
Note, however, that the more indexes you add to the table, the slower INSERT, UPDATE and DELETE queries become.
Explore the execution plan
A query plan, or query execution plan, helps to create proper indexes and visually represent the data retrieval methods selected by the SQL Server query optimizer.
There are estimated and actual execution plans. The estimated execution plan evaluates what a SQL Server would most likely do when executing the query. The actual execution plan provides the exact info on how many reads were made, how many rows were read, and what joins were performed.
Considering execution plans is one of the essential ways to determine why a specific query is slow or why this or that query runs faster than another.
Use temporary tables wisely
If your code can be written in a simple way, there is no need to use temporary tables since they add a certain complexity to the code. On the other hand, if you have a specific procedure to be set up that can’t be managed with a single query, temporary tables are allowed.
Get rid of coding loops
Coding loops slow down the whole sequence performance. This can be addressed by using a unique UPDATE or INSERT command with multiple rows and values. Also, ensure that the WHERE statement does not update the stored value in case it matches the existing value.
Limit the number of database queries
It’s important to consider not only the queries speed and the load on the database server but also how much data is ultimately sent from the database server to the application across the network.
Using the command LIMIT, instead of SELECT *, helps to retrieve only the data necessary for meeting the business requirements. Limiting and specifying the data significantly reduces the risk of optimizing the database in the future.
Do not misuse SELECT DISTINCT
SELECT DISTINCT groups all fields in a query, thus eliminating duplicate rows. Yet, it requires too much processing power. One of the lifehacks is to add more fields so that the database does not need to group any fields, and the number of records is correct.
Avoid correlated SQL subqueries
Correlated subqueries are those using values from the parent or outer query. Some SQL developers tend to make joins with the help of WHERE clauses. This way, the subquery is run row for each row returned by the outer query, which decreases the overall SQL query performance.
A much more effective SQL performance tuning method is refactoring of the correlated subquery as a join.
Utilize WHERE instead of HAVING to define filters
If your goal is filtering a query on the basis of conditions, a WHERE statement is a preferred option.
HAVING checks conditions after the aggregation. It filters the query after SQL has retrieved, gathered and sorted the results. Accordingly, it’s much slower than WHERE and should be avoided wherever possible.
Consider micro(macro)services for scaling bottlenecks
When user demand hits unforeseen peaks, a system's scalability and fault tolerance are severely tested. In that regard, microservices enable more efficient scaling since spinning up additional containers is much faster than booting up additional virtual machines.
But you can go further.
As the environment matures, it becomes more justified to go after macroservices. This is an application architecture running 2-20 individual services where each service represents a medium-sized codebase that serves one business function. As our experience has shown, they might become more resilient and maintainable than microservices.
Add more callbacks
Running callbacks ensures real-time reporting, thus becoming essential for everyone who uses server-based technologies. Knowing that the important data is being updated in a live mode during a spike allows developers to respond quickly to any changes and prevent any performance issues.
Optimize WebSockets bandwidth
WebSocket compression “squeezes” the payload, which considerably decreases the number of transmitted data, which results in bandwidth savings and faster message delivery.
Cutting the extras
Optimizing web content is a tried-and-true way to reduce the page load time. By minifying, compressing and re-encoding the assets, a website will load 50-80 percent faster on average.
Another related approach is pagination. This means APIs return some part of the results and a token that users can utilize to request more data. In such a way, it’s much easier to estimate the additional load on a service and improve network bandwidth.
Proactive server management
Balance the load
A load balancer is a kind of a reverse proxy. In that case, when traffic increases, the server will distribute the requests over the network to keep the queue as low as possible.
Add a CDN
A CDN (Content Delivery Network) is a distributed collection of servers that holds copies of your application’s files. When users visit the website, the system directs them to a content server that is located nearby.
CDNs guarantee additional speed, security and failover for services that sometimes experience massive traffic spikes.
How to survive during traffic surges: iTechArt’s master class
In light of the recent events, one of our telehealth clients found themselves right in the center of the action — 5X overall customer base growth, a couple of million potential clients, and manyfold increase of daily users — yet with the same number of medical staff, and a system designed to handle much less load.
In this situation, it was vital that everyone act immediately. So, we did the following:
- Assembled a senior-level team responsible solely for the ongoing optimization.
- Tracked the application performance using DataDog APM to identify inefficient requests.
- Reduced the latency of some heavy queries from 60s to under 30ms through refactoring, adding relevant indexes, and reducing response size.
- Experimented with different database setups to find the one that scales easily.
- Started considering macroservices as an alternative to harder-to-manage microservices.
- Extended the use of the chatbots to decrease the load on the doctors.
Now, you have irrefutable evidence that the above-mentioned tips are truly battle-tested.
One last thing
They say that premature optimization is the root of all evil — and sometimes it’s an absolutely legitimate assumption. Still, ignoring reasonable optimizing practices may “square” your problems in the event of high traffic surges.
There is only one way out. Take the discussed procedures as your golden rule and don’t forget to analyze your architecture ahead of a scaling incident, mapping out the choices in order not to lose data consistency, component availability and network partitioning.
After all, prevention is always better than the cure.