Setting up Microsoft Data Management Gateway Service for Azure Data Factory

I've got to setup a data gateway to support an Azure Data Factory implementation for a hybrid cloud / on-premise solution.  The process is straight-forward but worth documenting anyway.  ADF is a service available to help us move and transform between different cloud services and endpoints, with the emphasis being on the move.  So think more "ELT" than "ETL", and even then it's transform with a small 't'.

So as a service, it is probably not for everyone, but I think the use cases are still there for shifting data around the cloud, and also moving data between on-premise and cloud data stores.  In my scenario, I am dealing with the traditional bread and butter of SSIS in loading data into SQL databases, but also concerned with moving files to a blob storage account - so you can start to see how whilst looking like a ETL from the outside, the data and services it wants to deal with are suitably different from SSIS.

The Microsoft Data Management Gateway is a additional component you need when you cross the boundary between PaaS and on-premise network.  It is still needed if you are integrating data to and from your Azure virtual network.

The first thing I wanted to sort out was having a suitable naming convention, as right now looking at all the Azure resources give me eye stain with the mixing of prefix, suffix, numbers and environments.  So lets start with setting up the data factory, and then I can get to the data management gateway.  I'm going with suffix's, abbreviations for the resource type, numbers where applicable.  So I need a resource group to start with.  I'll locate this in the same region as my data factory.


Now I can provision my data factory


Since I am pursing a hybrid networking solution I need a data gateway, so if I enter my data factory and review the Contents > Linked Services tile, it will show I have not added a data gateway yet.  In the Actions > Author and deploy tile I can right click Data gateways and select New data gateway like below.


So I will name my gateway sentient-df-1-dg because this gateway can only service this ADF.  It's worth noting that an ADF can have many gateways, however a Data Management Gateway service can connect to only one ADF.  It's worth considering this at the planning and costing stage - all my DMG's are Azure VM's, so they are costing me, all be it they are not big.  The DMG is a proxy for access to resources on the network - any resources no matter how unrelated.  This is significant.  You may, until now, have considered your ADF to belong exclusively to a single application, which means you need a DMG to connect to it.  And if you have 4 environments (dev, uat, staging, prod) then you have 4 x ADF & 4 x DMG's.  You could share a single ADF and DMG - but sharing is not so straightforward when it comes to ALM considerations.  Extrapolate out, say you have 10 applications - ok this is getting silly in terms of the DMG's.  

So I must be thinking about it wrong.  I wouldn't spin up a SSIS server and associated SQL Server license for a small bunch of related packages, I would treat my SSIS as a service and it would host a multitude of unrelated packages - I might have a dev and uat to go with prod, but that's about it.  In that mindset, your planning may be that ADF as a service is part of your core infrastructure, and has no relationship to a particular application.  So I have kind of dumbed it down here in this example, but the consideration is real in the enterprise.

So charging on, once the Data Gateway has been provisioned you will be presented with authentication keys, plus links to install\download the DMG Service.  I didn't find anything untoward in the installation so will jump ahead to registration.



Copy one of your generated keys.  You will need this to pair the installed service with the ADF.  The installed service has a UI, and you can find this under Microsoft Data Management Gateway on the host server.  Just pop in your key and hit register.


And assuming all is well, within a few seconds you will get confirmation.  At this point it's only really firewall considerations that can get in the way.  So as long as you can punch an outbound hole on 443 you will be fine.  (If you deploy linked services that connect to Azure SQL Database, then at run-time you will get a failure on the DMG if you can't talk outbound on 1433).


In the Azure portal, in your data factory the Contents > Linked Services tile will now show you have a Data Gateway and on expanding this you will see it's state.


I did one of these previously and got some event log errors about performance counters, so I made sure I ran the DMG service as a service account and provided that account with suitable permissions to access performance counters and event logs.

The most obvious thing to stand out here is that at no time do you tell the installed DMG service  what it will be connecting to.  That comes in the ADF artifacts you deploy, defining as accessing their target as a source or destination via this data gateway.  So it is a simple proxy.  If you attempt to access a SQL Server  you will have to ensure you can a) access it's host across your network, and b) authenticate with it if you have specified integrated security in your ADF linked service - at that point the service account your DMG is running under will need SQL Server login and database user permissions.

Take a spin through the documentation here to see how it works, and this also goes into firewall requirements in a bit more depth if you are restricting your 443 outbound to certain domains.  But at this point you are ready to start designing your ADF artefacts and are able to move data from network resources to Azure PaaS resource.


Comments

Post a Comment