AzureDevOps & IaCintermediate

Azure Data Factory Guide

Build data integration pipelines with Azure Data Factory: linked services, datasets, activities, Mapping Data Flows, triggers, and SSIS.

CloudToolStack Editorial24 min readPublished Mar 14, 2026

Prerequisites

Basic understanding of ETL/ELT concepts
Azure account with Data Factory permissions

Introduction to Azure Data Factory

Azure Data Factory (ADF) is a fully managed, serverless data integration service for building ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines at scale. It provides a visual interface for orchestrating data movement and transformation across more than 90 built-in connectors, including Azure services, on-premises databases, SaaS applications, and third-party cloud platforms. ADF handles the infrastructure for data pipelines so you can focus on the business logic of data transformation.

ADF pipelines consist of activities (individual processing steps),datasets (references to data stores), linked services (connection configurations), and triggers(schedules and events that start pipeline runs). For complex data transformations, ADF integrates with Mapping Data Flows (visual, code-free transformations executed on Apache Spark), Azure Databricks, Azure HDInsight, and custom Azure Batch activities.

This guide covers the complete ADF workflow: creating a data factory, configuring linked services and datasets, building pipelines with activities, implementing data flows for transformations, setting up triggers for scheduling, using the self-hosted integration runtime for on-premises connectivity, integrating with SSIS packages, monitoring pipeline execution, and cost optimization.

ADF Pricing

ADF pricing has several components. Pipeline orchestration and execution is charged per activity run ($1 per 1,000 activity runs for cloud, $1.50 for self-hosted). Data movement is charged per Data Integration Unit hour (starting at $0.25/DIU-hour). Mapping Data Flows are charged per vCore-hour ($0.274 for general purpose compute). SSIS integration runtime is charged per vCore-hour. There is no charge for idle data factories, pipelines, or triggers. You pay only when pipelines execute.

Creating a Data Factory

A Data Factory instance is the top-level resource that contains all your pipelines, datasets, linked services, and data flows. Each factory has a unique name and can be connected to a Git repository for version control and collaborative development.

bash

# Create a resource group
az group create --name rg-data-factory --location eastus

# Create the Data Factory
az datafactory create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --location eastus

# Configure Git integration (recommended)
az datafactory configure-factory-repo \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --location eastus \
  --factory-git-hub-configuration '{
    "accountName": "myorg",
    "repositoryName": "adf-pipelines",
    "collaborationBranch": "main",
    "rootFolder": "/adf",
    "type": "FactoryGitHubConfiguration"
  }'

# List data factories
az datafactory list \
  --resource-group rg-data-factory \
  --query '[].{Name: name, Location: location, State: provisioningState}' \
  --output table

Linked Services and Datasets

Linked services define the connection information for data stores and compute resources. They are similar to connection strings: they specify where data lives (endpoint URL, server name) and how to authenticate (managed identity, connection string, key vault reference). Datasets represent the structure of data within a linked service: which table, file, or container to access and how to interpret the data format.

bash

# Create a linked service for Azure Blob Storage
az datafactory linked-service create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --linked-service-name ls-blob-storage \
  --properties '{
    "type": "AzureBlobStorage",
    "typeProperties": {
      "connectionString": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "ls-keyvault",
          "type": "LinkedServiceReference"
        },
        "secretName": "blob-storage-connection-string"
      }
    }
  }'

# Create a linked service for Azure SQL Database
az datafactory linked-service create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --linked-service-name ls-sql-database \
  --properties '{
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=analyticsdb;",
      "authenticationType": "ManagedIdentity"
    }
  }'

# Create a linked service for on-premises SQL Server
az datafactory linked-service create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --linked-service-name ls-onprem-sql \
  --properties '{
    "type": "SqlServer",
    "typeProperties": {
      "connectionString": "Server=onprem-db.internal;Database=legacy_erp;Integrated Security=True",
      "userName": "etl_user",
      "password": {
        "type": "AzureKeyVaultSecret",
        "store": {"referenceName": "ls-keyvault", "type": "LinkedServiceReference"},
        "secretName": "onprem-sql-password"
      }
    },
    "connectVia": {
      "referenceName": "self-hosted-ir",
      "type": "IntegrationRuntimeReference"
    }
  }'

# Create a dataset for a CSV file in Blob Storage
az datafactory dataset create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --dataset-name ds-sales-csv \
  --properties '{
    "type": "DelimitedText",
    "linkedServiceName": {"referenceName": "ls-blob-storage", "type": "LinkedServiceReference"},
    "typeProperties": {
      "location": {
        "type": "AzureBlobStorageLocation",
        "container": "raw-data",
        "folderPath": "sales",
        "fileName": "*.csv"
      },
      "columnDelimiter": ",",
      "firstRowAsHeader": true,
      "quoteChar": """
    },
    "schema": []
  }'

# Create a dataset for an Azure SQL table
az datafactory dataset create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --dataset-name ds-sales-table \
  --properties '{
    "type": "AzureSqlTable",
    "linkedServiceName": {"referenceName": "ls-sql-database", "type": "LinkedServiceReference"},
    "typeProperties": {
      "schema": "dbo",
      "table": "SalesFact"
    }
  }'

Use Managed Identity Authentication

Wherever possible, use Azure Managed Identity for linked service authentication instead of connection strings or passwords. Managed Identity eliminates the need to manage credentials: ADF authenticates to Azure services using its system-assigned or user-assigned managed identity. Grant the identity the appropriate RBAC roles (e.g., Storage Blob Data Contributor for Blob Storage, SQL DB Contributor for SQL Database). This is more secure and easier to maintain than storing secrets in Key Vault.

Building Pipelines

A pipeline is a logical grouping of activities that together perform a data processing task. Activities are the processing steps within a pipeline and come in three categories: data movement activities (Copy Activity), data transformation activities (Data Flow, Databricks, HDInsight, Stored Procedure), and control activities (ForEach, If Condition, Switch, Wait, Execute Pipeline, Web).

bash

# Create a simple copy pipeline
az datafactory pipeline create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --name pl-sales-etl \
  --pipeline '{
    "activities": [
      {
        "name": "CopySalesData",
        "type": "Copy",
        "inputs": [{"referenceName": "ds-sales-csv", "type": "DatasetReference"}],
        "outputs": [{"referenceName": "ds-sales-table", "type": "DatasetReference"}],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource",
            "storeSettings": {"type": "AzureBlobStorageReadSettings", "recursive": true, "wildcardFileName": "*.csv"}
          },
          "sink": {
            "type": "AzureSqlSink",
            "writeBehavior": "upsert",
            "upsertSettings": {"useTempDB": true, "keys": ["OrderId"]},
            "sqlWriterStoredProcedureName": "usp_UpsertSales",
            "preCopyScript": "TRUNCATE TABLE staging.RawSales"
          },
          "enableStaging": false,
          "parallelCopies": 4
        }
      },
      {
        "name": "RunTransformationProcedure",
        "type": "SqlServerStoredProcedure",
        "dependsOn": [{"activity": "CopySalesData", "dependencyConditions": ["Succeeded"]}],
        "linkedServiceName": {"referenceName": "ls-sql-database", "type": "LinkedServiceReference"},
        "typeProperties": {
          "storedProcedureName": "dbo.usp_TransformSales",
          "storedProcedureParameters": {
            "ProcessDate": {"value": {"value": "@pipeline().parameters.processDate", "type": "Expression"}, "type": "String"}
          }
        }
      },
      {
        "name": "SendCompletionEmail",
        "type": "WebActivity",
        "dependsOn": [{"activity": "RunTransformationProcedure", "dependencyConditions": ["Succeeded"]}],
        "typeProperties": {
          "url": "https://prod-logic-app.azurewebsites.net/triggers/manual/invoke?api-version=2022-05-01",
          "method": "POST",
          "body": {
            "subject": "ETL Pipeline Completed",
            "message": "Sales ETL pipeline completed successfully for @{pipeline().parameters.processDate}"
          }
        }
      }
    ],
    "parameters": {
      "processDate": {"type": "String", "defaultValue": "2026-03-14"}
    }
  }'

# Trigger a manual pipeline run
az datafactory pipeline create-run \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --name pl-sales-etl \
  --parameters '{"processDate": "2026-03-14"}'

Mapping Data Flows

Mapping Data Flows provide a visual, code-free interface for building complex data transformations that execute on a managed Apache Spark cluster. Data flows support common transformation patterns including joins, aggregations, pivot/unpivot, window functions, derived columns, conditional splits, and schema drift handling. The Spark cluster is provisioned on-demand when the data flow runs and deallocated when it completes.

json

{
  "name": "df-sales-transform",
  "properties": {
    "type": "MappingDataFlow",
    "typeProperties": {
      "sources": [
        {
          "dataset": {"referenceName": "ds-sales-csv", "type": "DatasetReference"},
          "name": "salesSource"
        },
        {
          "dataset": {"referenceName": "ds-products", "type": "DatasetReference"},
          "name": "productsSource"
        }
      ],
      "transformations": [
        {
          "name": "filterValidSales",
          "description": "Remove records with null amounts or dates"
        },
        {
          "name": "joinProducts",
          "description": "Enrich sales with product details"
        },
        {
          "name": "calculateMetrics",
          "description": "Add derived columns for revenue and margin"
        },
        {
          "name": "aggregateByRegion",
          "description": "Aggregate sales by region and product category"
        }
      ],
      "sinks": [
        {
          "dataset": {"referenceName": "ds-sales-summary", "type": "DatasetReference"},
          "name": "summaryOutput"
        }
      ],
      "script": "source(output(OrderId as string, ProductId as string, Quantity as integer, UnitPrice as decimal(10,2), OrderDate as date, Region as string), allowSchemaDrift: true) ~> salesSource\nsource(output(ProductId as string, ProductName as string, Category as string, Cost as decimal(10,2)), allowSchemaDrift: false) ~> productsSource\nsalesSource filter(!isNull(UnitPrice) && !isNull(OrderDate) && Quantity > 0) ~> filterValidSales\nfilterValidSales, productsSource join(salesSource@ProductId == productsSource@ProductId, joinType:'left', matchType:'exact') ~> joinProducts\njoinProducts derive(Revenue = Quantity * UnitPrice, Margin = (UnitPrice - Cost) * Quantity, MarginPct = round((UnitPrice - Cost) / UnitPrice * 100, 2)) ~> calculateMetrics\ncalculateMetrics aggregate(groupBy(Region, Category, MonthYear = format(OrderDate, 'yyyy-MM')), TotalRevenue = sum(Revenue), TotalMargin = sum(Margin), OrderCount = count(), AvgOrderValue = avg(Revenue)) ~> aggregateByRegion\naggregateByRegion sink(allowSchemaDrift: true, validateSchema: false) ~> summaryOutput"
    }
  }
}

Data Flow Performance Tuning

Mapping Data Flows spin up a Spark cluster on each run, with a default startup time of 4-5 minutes. To reduce this, enable TTL (Time to Live)on the Azure Integration Runtime, which keeps the cluster warm between runs. Choose the appropriate compute size: General Purpose (4 cores) for small datasets, Memory Optimized for large joins and aggregations, and Compute Optimized for CPU-intensive transformations. Monitor Data Flow debug sessions carefully as they incur compute charges even when idle.

Triggers and Scheduling

Triggers determine when pipeline runs are initiated. ADF supports three trigger types: Schedule triggers (cron-based schedules), Tumbling window triggers (fixed-size, non-overlapping time intervals ideal for batch processing), and Event-based triggers (fired by Blob Storage events when files are created or deleted).

bash

# Create a schedule trigger (runs daily at 2 AM UTC)
az datafactory trigger create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --trigger-name trg-daily-etl \
  --properties '{
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Day",
        "interval": 1,
        "startTime": "2026-03-14T02:00:00Z",
        "timeZone": "UTC",
        "schedule": {
          "hours": [2],
          "minutes": [0]
        }
      }
    },
    "pipelines": [
      {
        "pipelineReference": {"referenceName": "pl-sales-etl", "type": "PipelineReference"},
        "parameters": {"processDate": "@trigger().scheduledTime"}
      }
    ]
  }'

# Create a tumbling window trigger
az datafactory trigger create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --trigger-name trg-hourly-aggregation \
  --properties '{
    "type": "TumblingWindowTrigger",
    "typeProperties": {
      "frequency": "Hour",
      "interval": 1,
      "startTime": "2026-03-14T00:00:00Z",
      "delay": "00:15:00",
      "maxConcurrency": 3,
      "retryPolicy": {"count": 3, "intervalInSeconds": 300}
    },
    "pipeline": {
      "pipelineReference": {"referenceName": "pl-hourly-aggregation", "type": "PipelineReference"},
      "parameters": {"windowStart": "@trigger().outputs.windowStartTime", "windowEnd": "@trigger().outputs.windowEndTime"}
    }
  }'

# Create an event-based trigger (fires when a blob is created)
az datafactory trigger create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --trigger-name trg-new-file-arrived \
  --properties '{
    "type": "BlobEventsTrigger",
    "typeProperties": {
      "blobPathBeginsWith": "/raw-data/blobs/incoming/",
      "blobPathEndsWith": ".csv",
      "ignoreEmptyBlobs": true,
      "scope": "/subscriptions/<sub-id>/resourceGroups/rg-data-factory/providers/Microsoft.Storage/storageAccounts/batchdatastorage",
      "events": ["Microsoft.Storage.BlobCreated"]
    },
    "pipelines": [
      {
        "pipelineReference": {"referenceName": "pl-process-new-file", "type": "PipelineReference"},
        "parameters": {"fileName": "@triggerBody().fileName", "folderPath": "@triggerBody().folderPath"}
      }
    ]
  }'

# Start a trigger
az datafactory trigger start \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --trigger-name trg-daily-etl

Self-Hosted Integration Runtime

The Self-Hosted Integration Runtime (SHIR) enables ADF to connect to on-premises data sources and resources behind corporate firewalls. You install the SHIR agent on a Windows machine in your network, and it creates outbound HTTPS connections to ADF. No inbound firewall rules are required. SHIR supports high availability with multiple nodes and auto-update.

bash

# Create a self-hosted integration runtime
az datafactory integration-runtime self-hosted create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --integration-runtime-name self-hosted-ir \
  --description "On-premises data access"

# Get the authentication keys (used to register nodes)
az datafactory integration-runtime list-auth-key \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --integration-runtime-name self-hosted-ir

# After installing the SHIR on a Windows machine:
# 1. Download from https://www.microsoft.com/download/details.aspx?id=39717
# 2. Install and register with the auth key
# 3. Verify node status

# Check integration runtime status
az datafactory integration-runtime get-status \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --integration-runtime-name self-hosted-ir

SSIS Integration

ADF can execute existing SQL Server Integration Services (SSIS) packages without rewriting them. The Azure-SSIS Integration Runtime provisions a cluster of Azure VMs that run the SSIS engine, executing your packages with the same behavior as on-premises SSIS. This enables lift-and-shift migration of SSIS workloads to Azure.

bash

# Create an Azure-SSIS integration runtime
az datafactory integration-runtime managed create \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --integration-runtime-name azure-ssis-ir \
  --type Managed \
  --description "Azure-SSIS IR for legacy ETL packages" \
  --compute-properties '{
    "location": "eastus",
    "nodeSize": "Standard_D4_v3",
    "numberOfNodes": 2,
    "maxParallelExecutionsPerNode": 4
  }' \
  --ssis-properties '{
    "catalogInfo": {
      "catalogServerEndpoint": "myserver.database.windows.net",
      "catalogDatabaseName": "SSISDB",
      "catalogAdminUserName": "sqladmin",
      "catalogAdminPassword": {"type": "SecureString", "value": "P@ssword123!"},
      "catalogPricingTier": "S1"
    }
  }'

# Start the Azure-SSIS IR (takes ~30 minutes)
az datafactory integration-runtime start \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --integration-runtime-name azure-ssis-ir

Monitoring and Troubleshooting

ADF provides comprehensive monitoring through the Azure portal, Azure Monitor, and diagnostic logs. The Monitor hub in the ADF portal shows pipeline runs, activity runs, trigger runs, and integration runtime status with filtering and drill-down capabilities.

bash

# List recent pipeline runs
az datafactory pipeline-run query-by-factory \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --last-updated-after "2026-03-14T00:00:00Z" \
  --last-updated-before "2026-03-15T00:00:00Z"

# Get details of a specific pipeline run
az datafactory pipeline-run show \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --run-id <run-id>

# List activity runs for a pipeline run
az datafactory activity-run query-by-pipeline-run \
  --factory-name production-adf \
  --resource-group rg-data-factory \
  --run-id <pipeline-run-id> \
  --last-updated-after "2026-03-14T00:00:00Z" \
  --last-updated-before "2026-03-15T00:00:00Z"

# Enable diagnostic logging
az monitor diagnostic-settings create \
  --name adf-diagnostics \
  --resource /subscriptions/<sub-id>/resourceGroups/rg-data-factory/providers/Microsoft.DataFactory/factories/production-adf \
  --workspace /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/log-analytics \
  --logs '[{"category": "PipelineRuns", "enabled": true}, {"category": "ActivityRuns", "enabled": true}, {"category": "TriggerRuns", "enabled": true}]'

CI/CD Best Practices

Use Git integration for version control and collaborative development. Develop and test pipelines in a development factory, then use ADF's built-in CI/CD support to publish to staging and production factories via ARM template deployment. Never edit production pipelines directly in the portal. Use parameterized linked services so connection strings differ between environments. Implement automated testing with the ADF SDK to validate pipeline behavior before promotion.

Azure Data Factory is the backbone of modern Azure data platforms. It connects any data source to any destination, transforms data with code-free data flows or custom compute, and orchestrates complex multi-step pipelines. Start with Copy Activities for simple data movement, add Mapping Data Flows for transformations, use triggers for scheduling, and leverage Self-Hosted Integration Runtime for on-premises connectivity. Combine with Azure Synapse Analytics, Databricks, or Azure SQL for a complete data platform.

Azure Batch Guide Azure Event Hubs Guide Azure Arc Guide

Key Takeaways

1ADF provides 90+ built-in connectors for Azure services, databases, SaaS apps, and third-party clouds.
2Mapping Data Flows enable code-free Spark-based transformations with visual design.
3Three trigger types support scheduling (cron), tumbling windows (batch intervals), and event-based (blob creation).
4Self-Hosted Integration Runtime connects to on-premises data sources without inbound firewall rules.

Frequently Asked Questions

How does ADF pricing work?

ADF charges per activity run ($1/1000 cloud runs), data movement per DIU-hour ($0.25), and Mapping Data Flows per vCore-hour ($0.274). There is no charge for idle factories. SSIS integration runtime charges per vCore-hour. For cost optimization, minimize activity count per pipeline, use appropriate DIU settings, and enable TTL on data flow clusters.

Can ADF replace SSIS?

ADF can run existing SSIS packages via the Azure-SSIS Integration Runtime without modifications. For new development, Mapping Data Flows provide a modern alternative with visual design and Spark execution. Many organizations use both: ADF orchestrates the pipeline while SSIS packages handle legacy transformation logic.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.