Scaling Flipkart for Big Billion Days: Behind the Scenes with Vaidyanathan, Senior Engineering Manager at Flipkart

Building & Running One of India’s Largest E-commerce Events at Scale. Hybrid Cloud, TiDB, Kubernetes Operators, DBREs, Platform Engineering

🚀 DevOps Expert Talks | RealOps Podcast 🎙️

Guest: Vaidyanathan(Yd), Senior Engineering Manager, Flipkart
Host: Gourav Shah, Founder of School of DevOps


🔹 Episode Summary:

Every year, Flipkart’s Big Billion Day is one of the largest shopping events in India, drawing millions of users and transactions in just a few days—akin to Black Friday or Cyber Monday in global markets. But have you ever wondered how Flipkart scales its infrastructure to handle massive traffic spikes, millions of queries per second, and petabytes of data?

In this episode, we go behind the scenes with Vaidyanathan, Senior Engineering Manager at Flipkart, as he shares the challenges, strategies, and technologies used to ensure Flipkart’s systems remain highly available, scalable, and resilient under extreme load.

From hybrid cloud architectures to database scaling with Kubernetes, we uncover what it takes to run one of India's largest e-commerce platforms.


🔹 Key Takeaways from the Episode:

1️⃣ Flipkart’s Scale: Business as Usual vs. Big Billion Days

  • Flipkart operates at massive scale even on regular days, but BBD traffic is 6x to 7x higher than normal.

  • Flipkart processes massive amount of queries per second at peak and stores huge amount of of data ! 🚀

  • Managing such high traffic spikes requires careful capacity planning, auto-scaling, and infrastructure optimization.

2️⃣ Flipkart’s Hybrid Cloud: FCP (Flipkart Cloud Platform) + Google Cloud

  • Flipkart has built its own cloud platform (FCP), running on private data centers across India.

  • To handle BBD traffic spikes, Flipkart uses a hybrid cloud approachbursting workloads to Google Cloud Platform (GCP) during peak periods.

  • Why hybrid?

    • Cost savings: No need to buy expensive hardware just for a few days of traffic.

    • Reliability: If one cloud fails, another ensures availability.

    • Flexibility: Different workloads run on different environments as needed.

3️⃣ How Flipkart Runs a Private Cloud with Kubernetes & Managed Services

  • Flipkart’s cloud (FCP) is self-built to provide VMs, storage, and managed services just like AWS or GCP.

  • Flipkart uses Kubernetes as a service internally to manage applications, databases, and scaling.

  • Despite being on private cloud, Flipkart has its own managed services for databases, load balancing, and auto-scaling—similar to AWS RDS & ELB.

  • Google Cloud is treated as just another data center where workloads can burst when needed.

4️⃣ Running Databases at Petabyte Scale on Kubernetes

  • Flipkart does not rely on managed database services from Google Cloud but instead runs its own databases on Kubernetes.

  • Databases used at Flipkart:

    • TiDB – A MySQL-compatible distributed database for seamless horizontal scaling.

    • HBase – A NoSQL database running at scale.

    • Aerospike – A high-performance, low-latency database for caching and real-time analytics.

    • MySQL on Kubernetes – Flipkart is actively working on migrating MySQL to Kubernetes using custom operators.

5️⃣ The Role of Kubernetes Operators in Database Automation

  • Kubernetes Operators help automate database tasks like provisioning, scaling, backups, and failovers.

  • Flipkart uses TiDB Operator, custom MySQL Operators, and HBase Operators to manage its databases at scale.

  • Operators reduce human intervention, making databases behave more like cloud-managed services (e.g., AWS RDS).

6️⃣ Storage Challenges for Databases on Kubernetes

  • Flipkart had to choose between local storage (fast but risky) vs. network storage (safe but costly).

  • Initially, local storage was used, but now they migrate some workloads to network storage for better availability.

  • Network storage (like EBS) makes failovers easier, reducing downtime in case of node failures.

7️⃣ The Evolution of DevOps into Platform Engineering & DBRE Roles

  • Platform Engineering at Flipkart is responsible for building internal cloud platforms, automation, and scaling strategies.

  • Database Reliability Engineering (DBRE) is an evolving role—DBREs manage database performance, scalability, and automation using Kubernetes.

  • DBREs ensure databases remain performant at Flipkart’s massive scale, working closely with SREs and platform engineers.

  • If you’re a DevOps Engineer, transitioning into DBRE or Platform Engineering could be a highly valuable career move.


🎧 Why Listen to This Episode?

Want to understand how Flipkart handles traffic surges during Big Billion Days? This episode breaks it down.
Curious about running Kubernetes and databases at petabyte scale? Learn from Flipkart’s experience.
Exploring a career shift from DevOps to Platform Engineering or DBRE? Get insights from an industry expert.

👉 Tune in now and take your DevOps career to the next level!

📢 Follow the RealOps Podcast on YouTube & Substack!
🔗 Subscribe for more DevOps Career Talks: [RealOps Podcast]

Taking Devops to Next Level

If you are a Devops Professional and want to take it to the next level, check out our programs on DevSecOps, Advanced Devops, MLOps and more at campus.schoolofdevops.com

Take your Devops Career to Next Level


Chapters

  • 00:00 Introduction to Flipkart's Big Billion Day Infrastructure

  • 02:46 Scaling Challenges and Solutions for Big Billion Day

  • 05:30 Hybrid Cloud Infrastructure: Flipkart's Approach

  • 08:04 Managed Services and Self-Service Platforms at Flipkart

  • 10:58 Kubernetes and Its Role in Flipkart's Infrastructure

  • 13:45 Database Management and Scalability at Flipkart

  • 16:18 Using Operators for Database Management

  • 18:49 Choosing and Customizing Operators for Databases

  • 28:29 Custom Database Solutions for Kubernetes

  • 32:18 Challenges in Building Scalable Infrastructure

  • 34:47 The Role of Platform Engineering

  • 38:01 Understanding the DBRE Role

  • 43:38 Responsibilities of a Platform Engineering Team

  • 49:51 Chaos Engineering and Resilience Testing

  • 53:19 Outro - SofD 2025.mp4