Data Engineering

Data Pipeline Development Cost: ETL, Real-Time, and Analytics Pipelines

April 2026 · 10 min read

Data pipeline conversations go sideways fast because "data pipeline" means three completely different things to different teams. A team that wants to sync Salesforce into BigQuery has a fundamentally different problem from a team that needs to process payment events in real time, which has a fundamentally different problem from a team that wants clean, queryable models for their business analysts.

Each of these has its own tool category, its own cost structure, and its own failure modes. Treating them as the same problem leads to either overspending (buying a streaming infrastructure for a reporting need that hourly batch ETL would have solved) or underspending (running batch ETL on data that needs to be current in seconds).

This post breaks down the three pipeline types, what each costs to build, what the managed tool options look like at different price points, and where the custom build threshold actually is.

Three Pipeline Types and Their Cost Profiles

Batch ETL is the oldest and most common pattern. Extract data from source systems on a schedule — every hour, every night, every week — transform it into a consistent format, and load it into a data warehouse. The latency is accepted: if you can live with data that is up to a day old, batch ETL is dramatically simpler and cheaper than anything real-time.

Real-time streaming processes events as they happen. A payment is initiated, a record is updated, a user clicks a button — the event flows into a pipeline, is processed, and is available downstream within seconds or milliseconds. The classic tools here are Apache Kafka and AWS Kinesis. Real-time is the right choice when your business logic actually requires it — fraud detection, operational dashboards that drive live decisions, alerting. It is the wrong choice when "we just want fresh data" — hourly batch would have served fine.

Analytics pipelines sit between raw data and business users. Raw tables from Salesforce, your product database, and your billing system land in the warehouse in whatever format the ETL produced. An analytics pipeline takes that raw data and transforms it into clean, documented, tested models that analysts and BI tools can query. dbt is the dominant tool here. This layer is often skipped by teams that have already built ETL — and then they wonder why analysts do not trust the data.

Managed Tools vs Custom Build: Cost Comparison

The first question for any data pipeline project is whether a managed tool covers your requirements. Building custom is not inherently better — it is a choice you make when the managed option does not fit.

Tool Type Cost Best for
Fivetran Managed ETL $500–5K/month 150+ connectors, zero engineering, hands-off maintenance
Airbyte Open-source ETL Self-host: $100–500/month Custom connectors, data stays in your infra, lower ongoing cost
Stitch Managed ETL $100–1,250/month Simpler setup, fewer connectors, lower entry price than Fivetran
Custom Python + dbt Custom $10K–50K build Proprietary sources, complex transforms, full control
Kafka + Flink Streaming $20K–100K build Real-time event processing, high throughput, sub-second latency

Fivetran's value proposition is engineering time saved. The cost of building and maintaining an equivalent connector for a single data source — Salesforce, HubSpot, Stripe — is typically $3K–8K in development time and then $500–2K/year in maintenance as APIs change. At Fivetran's $500–5K/month range, the math only works if you are connecting a significant number of sources. For a team with 3–5 standard SaaS sources, Fivetran typically wins on total cost of ownership within 18 months.

Airbyte changes the calculus. As an open-source tool you self-host, the per-connector cost drops to infrastructure only — roughly $100–500/month in compute depending on sync volume. The trade-off is operational overhead: someone on your team owns the deployment, the upgrades, and the debugging when a connector fails. For teams with data engineering resources who also need custom connectors for proprietary systems, Airbyte is the most flexible option.

Stitch is the right choice for smaller teams that find Fivetran's pricing steep but want a managed solution. The connector library is narrower, and the transformation capabilities are more limited, but for straightforward sync use cases it covers the fundamentals at a price that is easier to justify early on.

Build Cost by Pipeline Type

When managed tools do not cover your requirements, the build cost depends primarily on pipeline complexity, the number of source systems, and latency requirements.

Pipeline type Cost range Timeline Ongoing maintenance
Batch ETL (3–5 sources) $10K–30K 4–8 weeks $500–2K/month
Batch ETL (10+ sources) $30K–60K 8–16 weeks $1K–3K/month
Real-time streaming $30K–100K 8–20 weeks $2K–5K/month
Analytics layer (dbt) $15K–40K 4–10 weeks $1K–2K/month
Full data platform $80K–250K 4–9 months $5K–15K/month

The batch ETL range for 3–5 sources ($10K–30K) assumes sources with accessible APIs or database exports, moderate data quality issues, and daily or hourly sync frequency. The lower end of that range applies when the sources are well-documented and the transforms are straightforward. The upper end applies when source APIs are poorly documented, data quality is inconsistent, or the transform logic involves complex business rules.

Real-time streaming costs more not because the fundamental engineering is harder at the concept level, but because the operational complexity is higher by an order of magnitude. Batch failures are visible and recoverable — the job ran, it failed, you can see why, you fix it, you rerun. Stream failures are invisible until something downstream breaks. Every component in a streaming pipeline — producers, brokers, consumers, state stores — needs monitoring, alerting, and a failure playbook. That operational infrastructure is where the cost accumulates.

What Drives Cost Up (and What to Watch For)

The cost ranges above have meaningful variance. Five factors explain most of it.

1. Number of data sources

Each source system is an independent integration project. Connecting to a source means understanding its data model, handling authentication, dealing with rate limits, writing extraction logic, and testing edge cases (deleted records, schema changes, pagination). A well-designed source with a clean REST API and good documentation takes 1–2 weeks to connect. A poorly documented legacy system or an internal database with no API can take 3–4 weeks. Multiply that across 10 sources and you understand why 10-source projects cost 2–3x what 3-source projects cost.

2. Data quality issues in source systems

This is the most underestimated cost factor in data engineering. Clean data in, clean data out. Messy data in means your pipeline needs to detect, handle, and log every quality issue — duplicate records, null values in required fields, inconsistent date formats, business logic embedded in free-text fields that should have been structured data. Teams that have never audited their source data quality before starting a pipeline project routinely discover that quality issues multiply the transform complexity by 2–3x. A discovery phase that surfaces these issues before build starts is worth every dollar.

3. Latency requirements

Daily batch is the simplest engineering problem. Hourly batch is slightly more complex. Real-time adds an order of magnitude of complexity because you lose the ability to batch error handling and recovery. If your stakeholders want "fresh data," the first conversation to have is what "fresh" actually means for their decisions. Data that is one hour old is "fresh" for most business reporting use cases. Genuine real-time requirements — sub-minute latency — should be validated against specific use cases before committing to the infrastructure cost.

4. Historical backfill

Building the pipeline that moves data going forward is one project. Loading three years of historical data into the warehouse is a separate project that takes longer than most people expect. Historical backfills often surface data model inconsistencies that did not exist in more recent data — legacy values, renamed fields, deleted records that still have references. Budget the backfill separately from the pipeline build. For large datasets, the backfill can take as long as the pipeline build itself.

5. Compliance and governance requirements

GDPR and similar regulations add work that is invisible until legal gets involved. Column-level masking (PII fields must not appear in the analytics layer in identifiable form), audit logging (who accessed what data when), data lineage (where did this value come from and has it been transformed), and retention policies (this data must be deleted after N years and the deletion must be provable) each add engineering work. Teams working in regulated industries should add 20–40% to baseline estimates to cover governance requirements.

Recommended Stacks by Team Size

The right stack depends more on your team's engineering capacity and data volume than on feature preferences. These are the patterns that work reliably in practice.

Small team (under 20 people, under 10 data sources): Fivetran for extraction plus BigQuery as the warehouse plus dbt Core for transforms plus Looker Studio for visualization. Total monthly cost: $500–2K for Fivetran, $50–200 for BigQuery at this scale, $0 for dbt Core (open source), $0 for Looker Studio. This stack requires minimal ongoing engineering and covers most reporting needs for a small team. The limitation is Fivetran's connector library — if your primary data source does not have a Fivetran connector, this path requires a workaround.

Mid-size team (20–100 people): Airbyte (self-hosted) or Fivetran plus Snowflake or BigQuery plus dbt Core or dbt Cloud plus Metabase or Tableau. Total monthly cost: $500–3K depending on Airbyte vs Fivetran, $200–1K for Snowflake or BigQuery at this scale, $0–100 for dbt, $500–2K for Metabase or Tableau. This stack handles a larger number of sources and gives analysts a better self-service BI tool than Looker Studio.

Large or complex (100+ people, proprietary sources, real-time requirements): Custom Python connectors or Airbyte with custom connectors plus Apache Airflow for orchestration plus Kafka for streaming where required plus Snowflake or BigQuery plus dbt Cloud plus Tableau or Looker. Total monthly cost: $5K–20K for infrastructure and tooling. This stack requires a dedicated data engineering team to build and maintain it.

When to Build Custom: The Decision Threshold

Three signals reliably indicate that a managed ETL tool will not solve your problem:

Your primary data source has no connector. Fivetran covers 150+ sources. Airbyte covers even more through its open-source connector ecosystem. But proprietary sources — your own legacy system, a vendor's non-standard API, an internal database with a custom data model — will not be on either list. When your most important data source requires a custom connector, you are building custom whether you want to or not. The question then becomes whether to build the custom connector inside a managed tool framework (Airbyte supports this) or to build the entire pipeline from scratch.

You need sub-minute latency. Batch ETL tools — even those that support hourly syncs — cannot give you sub-minute data freshness. If your use case genuinely requires data within seconds of it being created (real-time fraud signals, live inventory updates, operational dashboards driving immediate decisions), you need streaming infrastructure. Be honest about whether your use case requires this. Most teams that think they need real-time actually need fresh-enough, which is hourly batch.

Data must stay within your own VPC. Fivetran and most managed ETL tools process data through their own infrastructure. If your compliance requirements mandate that raw data never leaves your environment, the managed tool options narrow significantly. Airbyte self-hosted is the most practical path if you still want a managed-tool framework. Otherwise, fully custom is the only compliant option.

Analytics Layer: The Layer Teams Skip

The analytics layer — transforming raw warehouse data into clean, tested, documented models that analysts actually trust — is the most consistently underinvested piece of the data stack. Teams spend money on ETL, load data into the warehouse, and then give analysts access to raw tables. The analysts query the raw tables, find inconsistencies, and stop trusting the data. The business goes back to Excel spreadsheets.

dbt (data build tool) is the standard solution. It allows you to define data models in SQL, test them automatically, document them with lineage graphs, and deploy them on a schedule. A dbt project for a mid-size company with 5–10 source systems, 50–200 models, and a CI/CD pipeline typically costs $15K–40K to build well. That is the cost of building the models, writing the tests, documenting the lineage, and setting up the deployment pipeline.

The ongoing maintenance cost is relatively low — dbt models are SQL, and SQL is readable. Changes to source systems need to be propagated into model definitions, and new data needs are added over time, but the operational complexity is far lower than batch ETL or streaming infrastructure.

A Note on Data Warehouse Costs

The cost of the data warehouse itself is often overlooked in pipeline discussions. BigQuery charges for storage ($20/TB/month) and query compute ($5/TB scanned). At small scale — a few GBs of data and moderate query volumes — BigQuery costs almost nothing. A team generating 10–50GB of warehouse data typically spends $20–100/month on BigQuery.

Snowflake uses a credit-based model. A small single-warehouse deployment running 8 hours per day costs roughly $200–500/month. Snowflake becomes cost-competitive with BigQuery at higher data volumes and concurrent query loads, where BigQuery's per-query pricing can escalate quickly if analysts are running expensive scans without query guardrails.

Neither warehouse is dramatically more expensive than the other at typical mid-market scale. The choice is usually driven by technical preferences (SQL dialect, ecosystem integrations) and where your team's existing expertise lies.

Frequently Asked Questions

What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading it into the destination. ELT (Extract, Load, Transform) loads raw data into the destination first, then transforms it there. In practice, ELT is the dominant pattern with modern cloud data warehouses like BigQuery and Snowflake — their compute is cheap enough that transforming in-warehouse is faster and more flexible. Fivetran and Airbyte both use ELT: they load raw data, and dbt handles the in-warehouse transformations. The older ETL pattern is still common when transformations are too sensitive to run inside the data warehouse, or when destination storage costs make loading raw data expensive.

When does Fivetran make more sense than a custom pipeline?

Fivetran makes sense when all or most of your source systems have existing connectors, you do not need sub-hourly sync frequency, and your team does not have data engineering resources to build and maintain custom connectors. The monthly cost ($500–5K) is almost always cheaper than the engineering time required to build equivalent connectors. Custom pipelines beat Fivetran when your source system has no connector and the data is proprietary, you need real-time sync, or your data must stay within your own VPC for compliance reasons.

How much does a real-time data pipeline cost to build?

Real-time streaming pipelines typically cost $30K–100K to build depending on throughput, source count, and in-stream transform complexity. A minimum viable real-time pipeline — a Kafka cluster with basic event routing — can be built for $30K–50K in 6–10 weeks. Pipelines with stateful aggregations, complex event joins, or in-stream enrichment move toward $80K–150K. Ongoing infrastructure and maintenance typically runs $2K–5K/month, compared to $500–2K/month for an equivalent batch pipeline.

Get an Estimate for Your Data Pipeline Project

If you are trying to figure out whether a managed tool or custom build fits your situation — or whether what you need is batch, streaming, or an analytics layer — the fastest path to clarity is a short conversation. Bring the details of your sources, destinations, and latency requirements, and I will tell you what approach makes sense and what it will realistically cost.

Get a free estimate for your data pipeline project →

Free: Data Pipeline Scoping Worksheet

A structured worksheet to define your sources, destination, latency requirements, and data quality expectations before talking to a data engineer. Prevents the most common scoping mistakes.

Related Service

AI Ops Sprint: Data Pipelines in 2 Weeks

Fixed-scope, fixed-price data engineering sprints. We scope your pipeline requirements, build the connectors and transform logic, and hand it off running with documentation.

Learn more →

Related Posts

Python Automation Services: When Scripts Beat Platforms

When Python scripts are the right tool and what they cost to build and run.

Custom Software Development Cost: 2026 Breakdown

What drives custom software costs and how to scope before getting quotes.

Internal Tool Development Cost: Admin Panels, Dashboards, and Ops Tools

Build vs buy for internal tools, with real cost ranges by tool type.

Evgeny Goncharov - Founder of TechConcepts, ex-Big 4 Advisory

Evgeny Goncharov

Founder, TechConcepts

I build automation tools and custom software for businesses. Previously at a major search platform and Big 4 Advisory. Based in Madrid.

About me LinkedIn GitHub
← All blog posts

Want a real number for your data pipeline project?

15 minutes. Tell me your sources, destinations, and latency requirements. I’ll tell you whether managed tools cover it and what a build would cost.

Book a Free Call