Azure Data Factory Guide
Build data integration pipelines with Azure Data Factory: linked services, datasets, activities, Mapping Data Flows, triggers, and SSIS.
Prerequisites
- Basic understanding of ETL/ELT concepts
- Azure account with Data Factory permissions
Introduction to Azure Data Factory
Azure Data Factory (ADF) is a fully managed, serverless data integration service for building ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines at scale. It provides a visual interface for orchestrating data movement and transformation across more than 90 built-in connectors, including Azure services, on-premises databases, SaaS applications, and third-party cloud platforms. ADF handles the infrastructure for data pipelines so you can focus on the business logic of data transformation.
ADF pipelines consist of activities (individual processing steps),datasets (references to data stores), linked services (connection configurations), and triggers(schedules and events that start pipeline runs). For complex data transformations, ADF integrates with Mapping Data Flows (visual, code-free transformations executed on Apache Spark), Azure Databricks, Azure HDInsight, and custom Azure Batch activities.
This guide covers the complete ADF workflow: creating a data factory, configuring linked services and datasets, building pipelines with activities, implementing data flows for transformations, setting up triggers for scheduling, using the self-hosted integration runtime for on-premises connectivity, integrating with SSIS packages, monitoring pipeline execution, and cost optimization.
ADF Pricing
ADF pricing has several components. Pipeline orchestration and execution is charged per activity run ($1 per 1,000 activity runs for cloud, $1.50 for self-hosted). Data movement is charged per Data Integration Unit hour (starting at $0.25/DIU-hour). Mapping Data Flows are charged per vCore-hour ($0.274 for general purpose compute). SSIS integration runtime is charged per vCore-hour. There is no charge for idle data factories, pipelines, or triggers. You pay only when pipelines execute.
Creating a Data Factory
A Data Factory instance is the top-level resource that contains all your pipelines, datasets, linked services, and data flows. Each factory has a unique name and can be connected to a Git repository for version control and collaborative development.
# Create a resource group
az group create --name rg-data-factory --location eastus
# Create the Data Factory
az datafactory create \
--factory-name production-adf \
--resource-group rg-data-factory \
--location eastus
# Configure Git integration (recommended)
az datafactory configure-factory-repo \
--factory-name production-adf \
--resource-group rg-data-factory \
--location eastus \
--factory-git-hub-configuration '{
"accountName": "myorg",
"repositoryName": "adf-pipelines",
"collaborationBranch": "main",
"rootFolder": "/adf",
"type": "FactoryGitHubConfiguration"
}'
# List data factories
az datafactory list \
--resource-group rg-data-factory \
--query '[].{Name: name, Location: location, State: provisioningState}' \
--output tableLinked Services and Datasets
Linked services define the connection information for data stores and compute resources. They are similar to connection strings: they specify where data lives (endpoint URL, server name) and how to authenticate (managed identity, connection string, key vault reference). Datasets represent the structure of data within a linked service: which table, file, or container to access and how to interpret the data format.
# Create a linked service for Azure Blob Storage
az datafactory linked-service create \
--factory-name production-adf \
--resource-group rg-data-factory \
--linked-service-name ls-blob-storage \
--properties '{
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "ls-keyvault",
"type": "LinkedServiceReference"
},
"secretName": "blob-storage-connection-string"
}
}
}'
# Create a linked service for Azure SQL Database
az datafactory linked-service create \
--factory-name production-adf \
--resource-group rg-data-factory \
--linked-service-name ls-sql-database \
--properties '{
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=analyticsdb;",
"authenticationType": "ManagedIdentity"
}
}'
# Create a linked service for on-premises SQL Server
az datafactory linked-service create \
--factory-name production-adf \
--resource-group rg-data-factory \
--linked-service-name ls-onprem-sql \
--properties '{
"type": "SqlServer",
"typeProperties": {
"connectionString": "Server=onprem-db.internal;Database=legacy_erp;Integrated Security=True",
"userName": "etl_user",
"password": {
"type": "AzureKeyVaultSecret",
"store": {"referenceName": "ls-keyvault", "type": "LinkedServiceReference"},
"secretName": "onprem-sql-password"
}
},
"connectVia": {
"referenceName": "self-hosted-ir",
"type": "IntegrationRuntimeReference"
}
}'
# Create a dataset for a CSV file in Blob Storage
az datafactory dataset create \
--factory-name production-adf \
--resource-group rg-data-factory \
--dataset-name ds-sales-csv \
--properties '{
"type": "DelimitedText",
"linkedServiceName": {"referenceName": "ls-blob-storage", "type": "LinkedServiceReference"},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "raw-data",
"folderPath": "sales",
"fileName": "*.csv"
},
"columnDelimiter": ",",
"firstRowAsHeader": true,
"quoteChar": """
},
"schema": []
}'
# Create a dataset for an Azure SQL table
az datafactory dataset create \
--factory-name production-adf \
--resource-group rg-data-factory \
--dataset-name ds-sales-table \
--properties '{
"type": "AzureSqlTable",
"linkedServiceName": {"referenceName": "ls-sql-database", "type": "LinkedServiceReference"},
"typeProperties": {
"schema": "dbo",
"table": "SalesFact"
}
}'Use Managed Identity Authentication
Wherever possible, use Azure Managed Identity for linked service authentication instead of connection strings or passwords. Managed Identity eliminates the need to manage credentials: ADF authenticates to Azure services using its system-assigned or user-assigned managed identity. Grant the identity the appropriate RBAC roles (e.g., Storage Blob Data Contributor for Blob Storage, SQL DB Contributor for SQL Database). This is more secure and easier to maintain than storing secrets in Key Vault.
Building Pipelines
A pipeline is a logical grouping of activities that together perform a data processing task. Activities are the processing steps within a pipeline and come in three categories: data movement activities (Copy Activity), data transformation activities (Data Flow, Databricks, HDInsight, Stored Procedure), and control activities (ForEach, If Condition, Switch, Wait, Execute Pipeline, Web).
# Create a simple copy pipeline
az datafactory pipeline create \
--factory-name production-adf \
--resource-group rg-data-factory \
--name pl-sales-etl \
--pipeline '{
"activities": [
{
"name": "CopySalesData",
"type": "Copy",
"inputs": [{"referenceName": "ds-sales-csv", "type": "DatasetReference"}],
"outputs": [{"referenceName": "ds-sales-table", "type": "DatasetReference"}],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {"type": "AzureBlobStorageReadSettings", "recursive": true, "wildcardFileName": "*.csv"}
},
"sink": {
"type": "AzureSqlSink",
"writeBehavior": "upsert",
"upsertSettings": {"useTempDB": true, "keys": ["OrderId"]},
"sqlWriterStoredProcedureName": "usp_UpsertSales",
"preCopyScript": "TRUNCATE TABLE staging.RawSales"
},
"enableStaging": false,
"parallelCopies": 4
}
},
{
"name": "RunTransformationProcedure",
"type": "SqlServerStoredProcedure",
"dependsOn": [{"activity": "CopySalesData", "dependencyConditions": ["Succeeded"]}],
"linkedServiceName": {"referenceName": "ls-sql-database", "type": "LinkedServiceReference"},
"typeProperties": {
"storedProcedureName": "dbo.usp_TransformSales",
"storedProcedureParameters": {
"ProcessDate": {"value": {"value": "@pipeline().parameters.processDate", "type": "Expression"}, "type": "String"}
}
}
},
{
"name": "SendCompletionEmail",
"type": "WebActivity",
"dependsOn": [{"activity": "RunTransformationProcedure", "dependencyConditions": ["Succeeded"]}],
"typeProperties": {
"url": "https://prod-logic-app.azurewebsites.net/triggers/manual/invoke?api-version=2022-05-01",
"method": "POST",
"body": {
"subject": "ETL Pipeline Completed",
"message": "Sales ETL pipeline completed successfully for @{pipeline().parameters.processDate}"
}
}
}
],
"parameters": {
"processDate": {"type": "String", "defaultValue": "2026-03-14"}
}
}'
# Trigger a manual pipeline run
az datafactory pipeline create-run \
--factory-name production-adf \
--resource-group rg-data-factory \
--name pl-sales-etl \
--parameters '{"processDate": "2026-03-14"}'Mapping Data Flows
Mapping Data Flows provide a visual, code-free interface for building complex data transformations that execute on a managed Apache Spark cluster. Data flows support common transformation patterns including joins, aggregations, pivot/unpivot, window functions, derived columns, conditional splits, and schema drift handling. The Spark cluster is provisioned on-demand when the data flow runs and deallocated when it completes.
{
"name": "df-sales-transform",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {"referenceName": "ds-sales-csv", "type": "DatasetReference"},
"name": "salesSource"
},
{
"dataset": {"referenceName": "ds-products", "type": "DatasetReference"},
"name": "productsSource"
}
],
"transformations": [
{
"name": "filterValidSales",
"description": "Remove records with null amounts or dates"
},
{
"name": "joinProducts",
"description": "Enrich sales with product details"
},
{
"name": "calculateMetrics",
"description": "Add derived columns for revenue and margin"
},
{
"name": "aggregateByRegion",
"description": "Aggregate sales by region and product category"
}
],
"sinks": [
{
"dataset": {"referenceName": "ds-sales-summary", "type": "DatasetReference"},
"name": "summaryOutput"
}
],
"script": "source(output(OrderId as string, ProductId as string, Quantity as integer, UnitPrice as decimal(10,2), OrderDate as date, Region as string), allowSchemaDrift: true) ~> salesSource\nsource(output(ProductId as string, ProductName as string, Category as string, Cost as decimal(10,2)), allowSchemaDrift: false) ~> productsSource\nsalesSource filter(!isNull(UnitPrice) && !isNull(OrderDate) && Quantity > 0) ~> filterValidSales\nfilterValidSales, productsSource join(salesSource@ProductId == productsSource@ProductId, joinType:'left', matchType:'exact') ~> joinProducts\njoinProducts derive(Revenue = Quantity * UnitPrice, Margin = (UnitPrice - Cost) * Quantity, MarginPct = round((UnitPrice - Cost) / UnitPrice * 100, 2)) ~> calculateMetrics\ncalculateMetrics aggregate(groupBy(Region, Category, MonthYear = format(OrderDate, 'yyyy-MM')), TotalRevenue = sum(Revenue), TotalMargin = sum(Margin), OrderCount = count(), AvgOrderValue = avg(Revenue)) ~> aggregateByRegion\naggregateByRegion sink(allowSchemaDrift: true, validateSchema: false) ~> summaryOutput"
}
}
}Data Flow Performance Tuning
Mapping Data Flows spin up a Spark cluster on each run, with a default startup time of 4-5 minutes. To reduce this, enable TTL (Time to Live)on the Azure Integration Runtime, which keeps the cluster warm between runs. Choose the appropriate compute size: General Purpose (4 cores) for small datasets, Memory Optimized for large joins and aggregations, and Compute Optimized for CPU-intensive transformations. Monitor Data Flow debug sessions carefully as they incur compute charges even when idle.
Triggers and Scheduling
Triggers determine when pipeline runs are initiated. ADF supports three trigger types: Schedule triggers (cron-based schedules), Tumbling window triggers (fixed-size, non-overlapping time intervals ideal for batch processing), and Event-based triggers (fired by Blob Storage events when files are created or deleted).
# Create a schedule trigger (runs daily at 2 AM UTC)
az datafactory trigger create \
--factory-name production-adf \
--resource-group rg-data-factory \
--trigger-name trg-daily-etl \
--properties '{
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2026-03-14T02:00:00Z",
"timeZone": "UTC",
"schedule": {
"hours": [2],
"minutes": [0]
}
}
},
"pipelines": [
{
"pipelineReference": {"referenceName": "pl-sales-etl", "type": "PipelineReference"},
"parameters": {"processDate": "@trigger().scheduledTime"}
}
]
}'
# Create a tumbling window trigger
az datafactory trigger create \
--factory-name production-adf \
--resource-group rg-data-factory \
--trigger-name trg-hourly-aggregation \
--properties '{
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Hour",
"interval": 1,
"startTime": "2026-03-14T00:00:00Z",
"delay": "00:15:00",
"maxConcurrency": 3,
"retryPolicy": {"count": 3, "intervalInSeconds": 300}
},
"pipeline": {
"pipelineReference": {"referenceName": "pl-hourly-aggregation", "type": "PipelineReference"},
"parameters": {"windowStart": "@trigger().outputs.windowStartTime", "windowEnd": "@trigger().outputs.windowEndTime"}
}
}'
# Create an event-based trigger (fires when a blob is created)
az datafactory trigger create \
--factory-name production-adf \
--resource-group rg-data-factory \
--trigger-name trg-new-file-arrived \
--properties '{
"type": "BlobEventsTrigger",
"typeProperties": {
"blobPathBeginsWith": "/raw-data/blobs/incoming/",
"blobPathEndsWith": ".csv",
"ignoreEmptyBlobs": true,
"scope": "/subscriptions/<sub-id>/resourceGroups/rg-data-factory/providers/Microsoft.Storage/storageAccounts/batchdatastorage",
"events": ["Microsoft.Storage.BlobCreated"]
},
"pipelines": [
{
"pipelineReference": {"referenceName": "pl-process-new-file", "type": "PipelineReference"},
"parameters": {"fileName": "@triggerBody().fileName", "folderPath": "@triggerBody().folderPath"}
}
]
}'
# Start a trigger
az datafactory trigger start \
--factory-name production-adf \
--resource-group rg-data-factory \
--trigger-name trg-daily-etlSelf-Hosted Integration Runtime
The Self-Hosted Integration Runtime (SHIR) enables ADF to connect to on-premises data sources and resources behind corporate firewalls. You install the SHIR agent on a Windows machine in your network, and it creates outbound HTTPS connections to ADF. No inbound firewall rules are required. SHIR supports high availability with multiple nodes and auto-update.
# Create a self-hosted integration runtime
az datafactory integration-runtime self-hosted create \
--factory-name production-adf \
--resource-group rg-data-factory \
--integration-runtime-name self-hosted-ir \
--description "On-premises data access"
# Get the authentication keys (used to register nodes)
az datafactory integration-runtime list-auth-key \
--factory-name production-adf \
--resource-group rg-data-factory \
--integration-runtime-name self-hosted-ir
# After installing the SHIR on a Windows machine:
# 1. Download from https://www.microsoft.com/download/details.aspx?id=39717
# 2. Install and register with the auth key
# 3. Verify node status
# Check integration runtime status
az datafactory integration-runtime get-status \
--factory-name production-adf \
--resource-group rg-data-factory \
--integration-runtime-name self-hosted-irSSIS Integration
ADF can execute existing SQL Server Integration Services (SSIS) packages without rewriting them. The Azure-SSIS Integration Runtime provisions a cluster of Azure VMs that run the SSIS engine, executing your packages with the same behavior as on-premises SSIS. This enables lift-and-shift migration of SSIS workloads to Azure.
# Create an Azure-SSIS integration runtime
az datafactory integration-runtime managed create \
--factory-name production-adf \
--resource-group rg-data-factory \
--integration-runtime-name azure-ssis-ir \
--type Managed \
--description "Azure-SSIS IR for legacy ETL packages" \
--compute-properties '{
"location": "eastus",
"nodeSize": "Standard_D4_v3",
"numberOfNodes": 2,
"maxParallelExecutionsPerNode": 4
}' \
--ssis-properties '{
"catalogInfo": {
"catalogServerEndpoint": "myserver.database.windows.net",
"catalogDatabaseName": "SSISDB",
"catalogAdminUserName": "sqladmin",
"catalogAdminPassword": {"type": "SecureString", "value": "P@ssword123!"},
"catalogPricingTier": "S1"
}
}'
# Start the Azure-SSIS IR (takes ~30 minutes)
az datafactory integration-runtime start \
--factory-name production-adf \
--resource-group rg-data-factory \
--integration-runtime-name azure-ssis-irMonitoring and Troubleshooting
ADF provides comprehensive monitoring through the Azure portal, Azure Monitor, and diagnostic logs. The Monitor hub in the ADF portal shows pipeline runs, activity runs, trigger runs, and integration runtime status with filtering and drill-down capabilities.
# List recent pipeline runs
az datafactory pipeline-run query-by-factory \
--factory-name production-adf \
--resource-group rg-data-factory \
--last-updated-after "2026-03-14T00:00:00Z" \
--last-updated-before "2026-03-15T00:00:00Z"
# Get details of a specific pipeline run
az datafactory pipeline-run show \
--factory-name production-adf \
--resource-group rg-data-factory \
--run-id <run-id>
# List activity runs for a pipeline run
az datafactory activity-run query-by-pipeline-run \
--factory-name production-adf \
--resource-group rg-data-factory \
--run-id <pipeline-run-id> \
--last-updated-after "2026-03-14T00:00:00Z" \
--last-updated-before "2026-03-15T00:00:00Z"
# Enable diagnostic logging
az monitor diagnostic-settings create \
--name adf-diagnostics \
--resource /subscriptions/<sub-id>/resourceGroups/rg-data-factory/providers/Microsoft.DataFactory/factories/production-adf \
--workspace /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/log-analytics \
--logs '[{"category": "PipelineRuns", "enabled": true}, {"category": "ActivityRuns", "enabled": true}, {"category": "TriggerRuns", "enabled": true}]'CI/CD Best Practices
Use Git integration for version control and collaborative development. Develop and test pipelines in a development factory, then use ADF's built-in CI/CD support to publish to staging and production factories via ARM template deployment. Never edit production pipelines directly in the portal. Use parameterized linked services so connection strings differ between environments. Implement automated testing with the ADF SDK to validate pipeline behavior before promotion.
Azure Data Factory is the backbone of modern Azure data platforms. It connects any data source to any destination, transforms data with code-free data flows or custom compute, and orchestrates complex multi-step pipelines. Start with Copy Activities for simple data movement, add Mapping Data Flows for transformations, use triggers for scheduling, and leverage Self-Hosted Integration Runtime for on-premises connectivity. Combine with Azure Synapse Analytics, Databricks, or Azure SQL for a complete data platform.
Azure Batch GuideAzure Event Hubs GuideAzure Arc GuideKey Takeaways
- 1ADF provides 90+ built-in connectors for Azure services, databases, SaaS apps, and third-party clouds.
- 2Mapping Data Flows enable code-free Spark-based transformations with visual design.
- 3Three trigger types support scheduling (cron), tumbling windows (batch intervals), and event-based (blob creation).
- 4Self-Hosted Integration Runtime connects to on-premises data sources without inbound firewall rules.
Frequently Asked Questions
How does ADF pricing work?
Can ADF replace SSIS?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.