For the first time in my life, instead of cribbing about how slow and painful the IRCTC ticket booking system was, I had a different level of respect for it, for I could, for the first time understand what the power of India truly was when it came to Internet traffic.
The preparations had started about four months ago, with all sorts of defensive armour being planned in detail by both the product and the infrastructure teams. The storm was very much real and was going to hit every team with a force we'd never seen or experienced before, this was different from anything else we'd seen - so we were promised. TBBD or 'The Big Billion Day' was right around the corner and what was worrying more was the fact that this time, the mega sale was for a whole week.
By now, we'd done a lot of app-only sale days and were pretty confident about the whole thing playing out. Systems were aptly engineered to handle all sorts of happy and non-happy flows and we'd eventually perfected every system to a very considerable degree of confidence. But unfortunately that was just the base requirement here - TBBD was a different beast altogether. Infrastructure programs had dominated most of the two months prior to the code-freeze date with mandatory attendance from every team in Flipkart; In parallel, multiple new product constructs were being engineered in the user-path and in the order-path that were to handle the three dreaded problems in the supply-chain world - over-booking, under-booking and matching the promise and fulfillment - all this by incurring the least cost possible. After all, the success of a supply-chain product construct can only be measured in how accurately we promise to the customer and more importantly how we stick to that promise.
While the user-path systems were mainly being engineered to handle very high QPS, concurrency and consistency, the order-path systems consisting mainly of the promise system, order-management system, fulfillment system and the logistics system were a different story altogether. Each and every order was P0 here - the internal async messaging and queueing system was key to make sure absolutely no message was lost anywhere - it was practically the backbone of the whole supply-chain. There were multiple lessons learnt from last year's TBBD - solving very high QPS problems in the supply-chain world is a different story altogether - High QPS at the website would mean nothing if we did not have the manual capacity to pack or deliver the products. With multiple systems talking to each other, even one very small system could potentially choke the whole supply-chain if it failed under high load. The topmost system had to understand the capacity of every system underneath it and pipe orders accordingly - capacity breaking in any of the underlying systems could mean disaster -
The fear of the unknown
What if people did not turn up in one warehouse ? What if the warehouse dispatched more items than what the logistics could handle ? What if the seller is unable to handle the load that he was told to expect ? What if there was a 'bandh' in a certain state/city ?
There were so many things that were beyond our control, but every system had multiple constructs just to make sure we could constraint these problems as much as possible using algorithms backed by intelligence from big-data - but then anything could go wrong even after all this. "Almost Unrealistic" NFR (Non Functional Requirement) runs at 5X the projected numbers were being conducted across all systems for almost a month - just to find out how each system would behave in case an absolutely unprecedented traffic hit it.
Being in small and nimble teams meant every other team member was an independent worker - while this is generally very good, TBBD makes things a bit complicated - in order to ensure 24X7 continued support, every engineer had to know what every other engineer had worked on, in detail - to work in shifts. Thus began a series of knowledge-transfer sessions that probed in detail every feature, every remote possibility of it breaking under high-load and alternatives for the same. Multiple alerts were installed on every possible metric - both infra and product - just so that the team gets alerted in time if there is something that is breaking / showed signs of breaking - because any small thing, unattended to, could potentially get compounded to at least 100X in a very short time screwing up all things underneath it.
Night outs were becoming more common - 16 hour weekdays and weekends were the norm till TBBD. The facilities were asked to beef up their support throughout the night - infrastructure teams had a lot of pressure to deliver all that they planned because every system would completely bank on them being super solid in every planned / unplanned situation. Employees who stayed with their families often did not get to even see their loved ones for many days - and their families were eagerly waiting for TBBD to be successful - for they knew exactly how much effort was being put into this. There was very high energy in every team to get things out of the way as soon as possible - and at the same time there was also a very high sense of anticipation and concern. The veterans had a good idea of what was out there, but the newer joinees could not fathom why TBBD demanded so much from every team. An organizational plan was put in place that had details of all the 'tiger-teams' to ensure absolute unconditional support at every team level at all times through the week.
The Night of 12th October
The uneasy calm before the storm was evident in the team - we checked and re-calibrated all the alerts, struck off all the last minute tasks, updated the play-books to handle every possible outage that could potentially happen. Some folks were told to come late to office for their 12 hour night shift while the others had an option to leave early and get a good night's sleep to handle the daytime load, but obviously nobody left early that day - everybody was eager to see what would happen at midnight when the deals went live.
All of the user path teams were already camping on the lower floors with huge monitors and projectors to monitor live data while the order-path teams camped mainly on the upper floors with their share of metrics being projected live to the whole floor. Every screen had a different graph - from orders being taken to lags in queues, to breaches, to server stats, to live tweets on #TBBD. The calm was evident - all the music that was playing a while back had stopped and every engineer was making sure there were no warnings on absolutely any metric.
The traffic started piling up by 10:30 PM because customers were downloading the app just before the sale kicked off at midnight and had already started buying things. All metrics were being closely monitored and looked good - we were ready for TBBD. Just some more time to go. The wait was frustrating. A quick surprise dance by some of the team members in floor 2 eased the pressure a bit before teams went back to their frontiers. As the clock struck 00:00, there was an announcement on the emergency-loudspeaker that this is it !
Wild cheers greeted TBBD 2015 and we were in business !
While the user-path had some initial hiccups with the app, all of the the order-path graphs were already soaring - the majority of the people who had updated the app to the latest version had already started placing orders and we could see the inventory being depleted in real time at a super-fast rate - good that we had ample inventory for all products !
We were witnessing the power of a billion Indians at a scale none of us had seen before.
As we'd been thoroughly prepared for this, systems were throttling just a little below the NFR stats that we'd done before - so we were doing good ! TBBD is practically a race to the finish - for us, the finish was five more days while the finish for the customers was to get hold of a deal before someone else. Out-of-stocks are bound to happen when the inventory is always going to less than the number of people contending for it - what's more important is that we show an item OOS as soon as it happens - the customer shouldn't be able to order a product that in reality was OOS unless it's a neck-to-neck race condition. The next one hour was very intense with all the metrics being very closely monitored - nobody dared to even visit the restroom - almost like the television audience of a cricket-match where one is expected to continue what one is doing if things are going well - touch-wood as they say.
Post 1:00 AM we started relaxing a bit, playing some music and reading what people had to say on twitter - now the problem with twitter is that people who were able to place orders successfully don't tweet immediately about it, but the moment anyone has a bad experience with the app, they immediately take a screenshot and tweet about it - so while there was some negative sentiment initially, we obviously knew that the positive sentiment is way larger, but that'll only come into play when the order is delivered - then the #thankyou #flipkart tweets would come up !
By 1:30 / 2:00 AM, the day team finally left home only to come back by 9 AM - the night team meanwhile tried closing in on valid expected issues - most teams had one or two very minor bugs which probably affected very minuscule number of orders but keeping the rest of the 99.997% orders safe and healthy.
Warehouses from various parts of India meanwhile started dispatching items by 2:00 AM and the whole juggernaut was in motion. The night-teams handed over the responsibility to the day-teams by 11:00 AM - every engineer in the team had a purpose being in the tiger-team - once in a while something would alarm and someone would practically go 'attack' the issue, resolve it immediately or raise the issue to the right stakeholders.
As luck would have it, the rest of the days went smoothly with one or two alarms every night - even systems that failed due to high load were able to come back up as if nothing really happened - anticipated issues were resolved according to the plan with minimal impact to other systems (fallbacks, hot-standby nodes, replication strategies et al)
All in all, every team demonstrated a very high sense of ownership and commitment making the TBBD a grand success at an organization level - The next two weeks would still be pretty hectic for the logistics teams while they attempt to deliver all the goods to the customers before the promised date - and as they say, all's well that ends well.
TBBD 2016, here we come !
- Vaidyanathan S is a senior software engineer working with Flipkart handling the fulfillment business. Having experienced two Big Billion Days, he has an excellent understanding of how to build fail-safe, consistent and async message-queue based high throughput low latency systems.