Automating Data Pipelines with Data Sync

In the Getting Started tutorial, you connected a single data source. Now it is time to go further — build automated pipelines that transform, schedule, and monitor data flows across multiple sources.

What You Will Learn

How to chain multiple sync jobs into a pipeline
How to add transformations between source and destination
How to set up error handling and retry logic
How to monitor pipeline health with alerts

Prerequisites

Two or more data sources connected in Data Sync
Completed the Getting Started with Data Sync tutorial

Step 1: Create a Pipeline

In Data Sync, click Pipelines > + New Pipeline. A pipeline is a sequence of sync jobs that run in order, with optional transformations between steps.

Name your pipeline — for example, "Daily Sales ETL" — and add your first step by selecting a connector.

Step 2: Add Transformations

Between the source and Data Hub, you can add transformation steps:

Field mapping — rename columns to match your data model
Filtering — exclude records that do not meet criteria
Enrichment — join data from a second source (e.g., add customer names to order IDs)
Aggregation — pre-compute sums, averages, or counts
Deduplication — remove duplicate records based on key fields

Pipeline: Daily Sales ETL
──────────────────────
Step 1: Pull orders from PostgreSQL
Step 2: Filter where status = "completed"
Step 3: Enrich with customer data from Salesforce
Step 4: Aggregate daily revenue by region
Step 5: Load into Data Hub

Step 3: Schedule the Pipeline

Click Schedule to set when the pipeline runs:

Cron expression — for precise control (e.g., 0 6 * * * for 6 AM daily)
Simple scheduler — every N hours/minutes
Event-triggered — run when new data arrives in a source

Tip: For pipelines that depend on each other, use Pipeline Chaining — Pipeline B starts automatically when Pipeline A completes successfully.

Step 4: Configure Error Handling

Pipelines can fail — network timeouts, schema changes, rate limits. Configure resilience:

Retry policy: 3 attempts with exponential backoff
Partial failure: continue processing remaining records if some fail
Dead letter queue: failed records are saved for manual review
Alert on failure: email or Slack notification when a pipeline step fails

Step 5: Monitor Pipeline Health

The Pipeline Dashboard shows:

Run history with success/failure status
Records processed per run
Average execution time and trends
Error rates and common failure reasons

Set up an Agent Gateway agent to monitor pipeline health and alert you when performance degrades or error rates spike.

Conclusion

You now have robust, automated data pipelines that transform, schedule, and self-heal. This is the foundation for reliable data intelligence — every dashboard, AI query, and agent depends on clean, timely data flowing through your pipelines.