Data Engineering

How to Design a Scalable Data Pipeline from Scratch

Building a data pipeline that scales from thousands to millions of records doesn't require complex infrastructure at the start.

Anavii Tech 12 min read Apr 21, 2026

Enterprise data pipelines face a fundamental challenge: the moment you scale beyond prototype volumes, the architectural decisions made in development become liabilities. Ingestion bottlenecks emerge not from single points of failure, but from assumptions baked into early-stage design — assumptions that held at 10 GB but collapse at 10 TB.

Core Pattern

The modern ELT pattern addresses scale through a three-layer approach to fault tolerance: distributed ingestion via CDC, decoupled staging with schema-on-read, and dead-letter queuing with automatic alerting.

Distributed ingestion via CDC

Rather than polling at intervals that miss intermediate states, change data capture records every mutation in sequence — providing an immutable audit trail that downstream systems can replay independently. Source system failures do not propagate downstream because the ingestion layer is inherently decoupled from what came before it.

Interval Polling

Change Data Capture

✕ Misses intermediate state changes

✓ Every mutation captured in sequence

✕ Gap risk during downtime

✓ Immutable, replayable audit trail

✕ No guaranteed ordering

✓ Ordered log semantics

✕ Lossy at high mutation rate

✓ Zero-loss at any mutation rate

The key insight: CDC transforms data pipelines from push-based polling to event-driven consumption. Each data source maintains an ordered log of changes. Consumers read from this log at their own pace, retrying without re-triggering source operations. This removes coupling between source stability and pipeline health.

Implementation approaches vary by database:

PostgreSQL WAL (Write-Ahead Logging): Logical decoding exposes changes as ordered events. Tools like Debezium read the log and emit to Kafka or cloud event systems.
MySQL binlog: MySQL also maintains an ordered transaction log. Replicas consume from binlog, and CDC tools tap into the same mechanism.
Cloud-native (BigQuery, Snowflake): Many cloud warehouses offer built-in change data capture streams. This eliminates operational overhead.
Application-level CDC: For systems without native CDC support, application code emits events to an event log (Kafka) during transaction commit. Higher operational burden but universally applicable.

Decoupled staging with schema-on-read

When ingestion and transformation are tightly coupled, a schema change in production forces a complete pipeline re-execution. By implementing a raw staging layer with schema-on-read semantics, the pipeline absorbs source schema evolution without requiring re-ingestion. This separation alone can reduce incident recovery time dramatically.

Coupled Pipeline

Hours

incident recovery

→

Decoupled Staging

Minutes

incident recovery

The pattern works like this:

Ingestion layer: CDC emits raw records as JSON/Avro to the raw staging table. No schema validation. Data lands exactly as it came from the source.
Schema definition: Transformation logic lives downstream. You define which fields matter, which are optional, which transform. This logic can be updated without re-ingesting.
Evolution handling: When the source adds a column, CDC includes it in the raw record. Transformation logic simply ignores it until you update the schema definition. Zero re-ingestion.
Data recovery: If a transformation bug corrupts derived data, re-run the transformation over the same raw staging data. Minutes, not hours.

This decoupling is essential at scale. At 1 million records per second, re-ingesting the full history is prohibitively expensive. Schema-on-read lets you fix bugs in-place.

Dead-letter queuing with alerting

Dead-letter queuing transforms failures from silent data loss into actionable signals. When a record cannot be parsed or a foreign key constraint fails, the pipeline does not halt — it routes the problematic record to a quarantine table, logs the failure reason, and continues processing. Engineering teams receive alerts with enough context to resolve the issue without re-running the entire pipeline.

Incoming record Pipeline attempts processing

Success path

Record written to destination table. Pipeline continues without interruption.

Failure path

Record routed to quarantine table. Failure reason logged. Alert dispatched. Pipeline continues.

Implementation details:

Quarantine table schema: Store the original record (JSON), the transformation step that failed, error message, timestamp, and retry count. This gives engineers full context.
Error classification: Distinguish between retriable errors (network timeout) and non-retriable errors (bad data, constraint violation). Retry logic only applies to retriable failures.
Alerting thresholds: Alert if error rate exceeds 0.1% or if specific error types spike. This prevents silent data corruption.
Manual recovery: Provide tooling to fix and replay records from the quarantine table. A human reviews the record, the team fixes the underlying issue (schema update, data quality fix), then the record is replayed through the pipeline.

"The ultimate measure of fault-tolerant pipeline design is not absence of failure — it is bounded recovery time with zero data loss."

Petabyte-scale: event-driven autoscaling

Static cluster sizing wastes compute during off-peak hours and starves processing during batch windows. Event-driven autoscaling, triggered by queue depth rather than time-based schedules, aligns infrastructure consumption with actual workload demands — delivering predictable cost alongside consistent SLAs.

The mechanism:

Queue depth monitoring: Continuously measure unprocessed records in the ingestion buffer. This is your signal for workload intensity.
Scaling thresholds: Define lower and upper bounds. If depth exceeds upper bound, spin up more worker processes. If it drops below lower bound, scale down.
Processing rate calibration: Know your processing rate per worker (e.g., 50k records/sec). Scale to keep queue depth stable at a comfortable level (e.g., 2-minute backlog).
Cold start time: Account for container startup, connection pooling, and warm-up time. Add a buffer to prevent oscillation.
Cost optimization: For batch-oriented workloads, schedule expensive workers during batch windows. For streaming, maintain baseline capacity and burst on demand.

Result at petabyte scale: A pipeline that automatically adapts to load spikes, runs efficiently during quiet hours, and maintains predictable recovery time regardless of workload intensity.

Design for recovery from day one

Architecting for bounded recovery time with zero data loss eliminates the technical debt that accumulates when teams retrofit resilience into systems designed for simplicity. The three-layer ELT model is not a migration path — it is the starting point.

At startup stage: Your database holds everything. Implement CDC at the application layer (emit events to a transaction log during commits). Use raw staging. Build dead-letter queuing. You\'ve set the foundation for 1000x growth without architectural rewrites.

At enterprise scale: Your infrastructure is sophisticated — logical decoding in PostgreSQL, binlog in MySQL, native CDC streams in data warehouses. The design principles remain identical. The operational complexity increases, but the architecture does not.

Architecture Summary

CDC ingestion	Immutable, replayable log — source failures do not propagate downstream
Schema-on-read staging	Absorbs schema evolution without re-ingestion on source changes
Dead-letter queuing	Failures become actionable signals — pipeline never halts on bad records
Queue-depth autoscaling	Predictable cost alongside consistent SLAs at any scale

Data pipelines ELT Fault tolerance CDC Autoscaling

Ready to transform your data infrastructure?

Let's discuss how we can help you build enterprise-grade data platforms and AI systems.

Start Your Transformation