May 15, 2025
Automating Data Pipelines with Data Sync
In the Getting Started tutorial, you connected a single data source. Now it is time to go further — build automated pipelines that transform, schedule, and monitor data flows across multiple sources.
What You Will Learn
- How to chain multiple sync jobs into a pipeline
- How to add transformations between source and destination
- How to set up error handling and retry logic
- How to monitor pipeline health with alerts
Prerequisites
- Two or more data sources connected in Data Sync
- Completed the Getting Started with Data Sync tutorial
Step 1: Create a Pipeline
In Data Sync, click Pipelines > + New Pipeline. A pipeline is a sequence of sync jobs that run in order, with optional transformations between steps.
Name your pipeline — for example, "Daily Sales ETL" — and add your first step by selecting a connector.
Step 2: Add Transformations
Between the source and Data Hub, you can add transformation steps:
- Field mapping — rename columns to match your data model
- Filtering — exclude records that do not meet criteria
- Enrichment — join data from a second source (e.g., add customer names to order IDs)
- Aggregation — pre-compute sums, averages, or counts
- Deduplication — remove duplicate records based on key fields
Pipeline: Daily Sales ETL
──────────────────────
Step 1: Pull orders from PostgreSQL
Step 2: Filter where status = "completed"
Step 3: Enrich with customer data from Salesforce
Step 4: Aggregate daily revenue by region
Step 5: Load into Data Hub
Step 3: Schedule the Pipeline
Click Schedule to set when the pipeline runs:
- Cron expression — for precise control (e.g.,
0 6 * * *for 6 AM daily) - Simple scheduler — every N hours/minutes
- Event-triggered — run when new data arrives in a source
Tip: For pipelines that depend on each other, use Pipeline Chaining — Pipeline B starts automatically when Pipeline A completes successfully.
Step 4: Configure Error Handling
Pipelines can fail — network timeouts, schema changes, rate limits. Configure resilience:
- Retry policy: 3 attempts with exponential backoff
- Partial failure: continue processing remaining records if some fail
- Dead letter queue: failed records are saved for manual review
- Alert on failure: email or Slack notification when a pipeline step fails
Step 5: Monitor Pipeline Health
The Pipeline Dashboard shows:
- Run history with success/failure status
- Records processed per run
- Average execution time and trends
- Error rates and common failure reasons
Set up an Agent Gateway agent to monitor pipeline health and alert you when performance degrades or error rates spike.
Conclusion
You now have robust, automated data pipelines that transform, schedule, and self-heal. This is the foundation for reliable data intelligence — every dashboard, AI query, and agent depends on clean, timely data flowing through your pipelines.