A first-hand experience of using Azure Data Factory
Over the last couple of months I’ve been working my first data engineering gig with a requirement that the ETL platform had to run inside the client’s Azure environments. With this in mind, it felt like a great opportunity to use Azure Data Factory for the first time and so this article is just a brain dump of my experience.
What is Azure Data Factory?
It’s a cloud-based data integration service that allows you to create cloud-based data-driven workflows to orchestrate and automate the movement and transform of data.
Data Factories are composed of four key components:
A data factory can have one or more pipelines. A pipeline is just a container for activities.
A pipeline can have one or more activities. Activities are definitions of what actions to perform on your data, eg. using the copy activity to move data from one data store to another data store. Activities can also be configured to run conditionally, such as on failure or on success, etc.
Activities take zero or more datasets as inputs and one or more datasets as outputs. Datasets are representative structures of data within datastores.
You can think of these as being similar to connection strings, they define the connection needed to connect to external resources outside of Azure Data Factory.
I’ve gotten used to thinking of Azure Data Factory as more of an Extract-and-Load and Transform-and-Load platform rather a traditional Extract-Transform-and-Load (ETL) platform and unlike typical ETL platforms, the service offers an intuitive drag/drop interface for ease-of-use but you do pay a premium for this luxury which I’ll get on to later. You can also configure an instance of Azure Data Factory using: Visual Studio, Powershell, .NET API, REST API, ARM Templates.
With Azure Data Factory, there are two offerings: Managed and self-hosted , each with their own different pricing model and I’ll touch on that later on in this article.
In our data factory, our main pipeline would orchestrate other pipelines that contained activities to move and transform data between datasets. Overall, the process of configuring data movement was quite a painless experience with the “copy” activity task, which:
- Reads data from source and determines a source schema
- Uses column mapping on the target datastore to match fields from the source dataset to the target schema
- Writes the data to sink
To give an example of how this works, you could define two datasets that are connected to their respective linked services in your data workflow, ie. a REST endoint and a blob storage location. You would then configure a copy activity to copy data between the two datastores. In this example, Azure Data Factory would perform the calls to a REST API and copy the the response payload to the desired blob storage location. This activity is done by defining mapping to map values from the response payload (source) to fields defined in the blob storage location’s JSON schema (sink).
We sometimes had issues with column mapping which applies when copying data from source to sink. By default, the copy activity maps source data to sink by column names but you can define explicit mapping to customise the mapping based on requirements but we generally found the experience of explicit mapping to be a little buggy sometimes.
The above is just one example however and you can copy data between any of the two prebuilt data stores / connectors and you can find a list here.
Data flows were only released to the general public in October 2019 so they should still be considered as new. They’re also a pain in the ass to work with and super slow. We ended up making a team decision to re-engineer our data pipelines in a way that allowed us to eliminate the use of data flows altogether and here’s why... With data flows included our pipeline would take around ~28 minutes to complete end-to-end. Without data flows, it now takes just under 10 minutes. That’s 18 minutes of pipeline execution time we’ve been able to shave off just by removing data flows!
For each pipeline that used data flows to perform data transformations, there’d be a ~6 minute cold-start time where ADF would be “acquiring compute” for an Apache Spark cluster. Azure states in their docs that you can overcome this cold start for down stream tasks ny configuring a TTL on the integration runtime but this does not work. We found our pipeline would be cold starting all data flow activities down stream. Microsoft, you really need to fix this!
For those interested, we ended up using Python Azure Functions to perform all data transformation activities and that does the job(s) nicely.
The pricing model is really confusing, expensive and you very quickly learn that there’s a cost associated to everything in the world of Azure Data Factory. Developers should take the time to understand the pricing model first before building out data workflows so that they are optimised for both performance and cost, otherwise you might end up with a very unhappy C-level at your office door.
I’ll try to explain the pricing model as best as I can, so here goes…
In Azure Data Factory, you pay for:
- Read/write and monitoring operations
- Pipeline orchestration and execution
- Data flow debugging and execution
Sounds easy enough, right…? Wrong!
Read/write and monitoring operations
Read / Write = $0.50 per 50,000 modified identies (CRUD)
Entities: Datasets, linked services, pipelines, integration runtimes and triggers
Pipeline orchestration and execution
Orchestration = $1 / 1,000 runs
Activity, trigger and debug runs
Execution: (this is where the fun starts)
Data movement activities = $0.25 per DIU / hour
Pipeline activities: $0.005 / hour
External activties: $0.00025 / hour
All activities are prorated by the minute and rounded up
A Data Integration Unit (DIU) is a representative measure of a combination of CPU, memory and network resource allocation.
The default DIU setting on all copy tasks is 4, the minimum you can configure is 2 and the maximum that you can figure is 256. With that in mind, the cheapest you’ll pay for a data movement activity on a managed ADF pipeline hosted by Azure is $0.50.
All activities are prorated by the minute and rounded up
If your pipeline runs for 1 second, you’ll be billed for a whole minute. If your pipeline runs for 5 minutes and 1 second, you’ll be billed for 6 minutes. This duration is then multiplied by the DIU cost. (6 mins x 4 (default) DIUs)
There are other additional costs to data pipelines too. You pay $0.80 / month per inactive pipeline and $0.087 per GB for data transferred outside of Azure so make sure you clean up any unwanted pipelines that you’re not using!
Data Flow Debugging and Execution
There are three types of clusters that run the managed Apache Spark service:
- Compute optimised: $0.199 / vCore-hour
- General purpose: $0.268 / vCore-hour
- Memory Optimised: $0.345 / vCore-hour
Charges are prorated by the minute and also rounded up. The minimum cluser size to run a data flow is 8 vCores and the maximum is 272 vCores (256+16).
The smallest size you can select is 4 cores + 4 driver cores, which brings the minimum hourly cost of running data flows and debug sessions to:
- Compute optimised: $1.582 / hour
- General purpose: $2.144 / hour
- Memory optimised: $2.760 / hour
and the maximum costs to (256 + 16 driver cores):
- Compute optimised: $54.128 / hour
- General purpose: $72.896 / hour
- Memory optimised: $93.840 / hour
These pipeline costs can be managed though by starting, stopping, pausing and scaling services as required — you just have to get a bit clever with your configuration.
For example, if you’re batching large datasets to your SQL database and you’re running this job daily, then before your ADF instance executes, ramp up the SQL db to a tier with high throughput and then once the pipeline has finished, scale it back down again. There’s a free tip for you all ;)
The interface is really easy to use and Microsoft have actually done a decent job in making something with a nice user interface for once. However, at times it has been buggy as hell, such as:
- When starting a data flow debug session we’d sometimes experience an infinite loop of dialog models;
- Clicking into a forEach activity would only load 70% of the time;
- (this one you’ll love) if we were working off a git branch that wasn’t master and it contained different config in our linked service for our SQL database to the one that’s on the master branch, the linked service would only read/use the linked service connection defined on the master branch which made working with different dev/staging databases a nightmare.
Overall, I’ve actually quite enjoyed using Azure Data Factory, it’s just been a little frustrating at times.
Would I use it again though? Probably not…