Moving Data from Snowflake to SQL Server
Moving data out of a warehouse is a less-conventional approach, but sometimes it's necessary. Here's how I pulled a large amount of data out of Snowflake to use in SQL Server.
Snowflake is a cloud-based data lake and data warehouse platform that has been rising rapidly in popularity and adoption ever since its public launch in 2014. As with any data warehouse platform, traditionally it's where data lands and is then analyzed.
However, as the amount of data being gathered and stored in these cloud platforms has continued to grow exponentially, the potential use-cases for such data have exploded right along the way. And as it turns out, the best way to activate that data to meet those use-cases is not always inside of the warehouse. Thus, a new data integration strategy has been born in the data integration space: "Reverse ETL."
In case you didn't catch my post earlier this week on data engineering processes, ETL stands for "Extract Transform Load" and has been a tried-and-true method for taking disparate data and moving it into data warehouses ever since it was created in the 1970s. Reverse ETL essentially means taking that process somewhat in reverse, getting the data out of the warehouse and to a different system. These systems can range from specific business intelligence reporting platforms, to CRM systems like Salesforce, to something as simple as a CSV or Excel file that a downstream team would prefer to work from. New startups in the growing Reverse ETL space include Hightouch and Census, to name just a couple.
My Reverse ETL Problem
Now that you've got a primer on the terminology, I'll get into the details of the specific data integration problem I faced, and how I solved it with Reverse ETL.
At my company, we have a Snowflake instance, but it's owned by a different team. Inside of that Snowflake data warehouse was a data set that could potentially be leveraged by our team to create some additional value by enhancing the data we already have and use. However, we primarily work out of Microsoft SQL Server. I was tasked with analyzing the data available to us in Snowflake, and seeing if there was a way we could marry it with the data we already had in SQL Server in a beneficial way.
My first step was to poke at the data in Snowflake, see what was there, and ensure it was usable data. It did indeed appear to be data we could make use of after my first pass through, but it was a lot of data: 19 tables with a combined 3558 columns, and a total of over a million rows. After some discussion of the data with my team, we settled on two possible integration strategies:
- Move everything over in bulk, staging the data in SQL Server, before analyzing further and then loading the data we actually confirm is useful, or
- Target the data we want on the Snowflake side, extracting only what we need before loading it into SQL Server
Our determination was that while we knew which pieces of data would have definite value in the short-term, other data might prove to be valuable longer-term, and we didn't want to throw that data out just yet. We decided to move everything to the staging tables in SQL Server first, so that if we decided we wanted more data later on, it would be much easier to tie new columns back to the original source.
First Attempt: CSV FIles
If you search the web for "Snowflake to SQL Server," there's only a few results at the start for what we're looking to do. Most of what you'll find in the results relates to moving data from SQL Server to Snowflake, which is what we would typically expect in a normal ETL approach. However, there are a few tutorials on setting up connections to move data from Snowflake to SQL Server, which helped get me started. The approach that seemed best was to set up Snowflake as a linked server, outlined here. The other benefit of this approach was the fact that it had no associated cost, in contrast to some other options I looked at. A paid connector can definitely be a good approach, especially when doing this type of work at a large scale and where high reliability and speed is needed. However, since this was initially a one-off project and we hadn't done any Reverse ETL at our company yet, I chose to attempt to start off with a free option.
Before even attempting to use the linked server, I first explored exporting the data as CSV files that I could later import to SQL Server. Snowflake offers the ability to export data to CSV files in the cloud, which you can then download. I started my attempt with the following commands:
COPY INTO @~/unload/export from (SELECT DISTINCT NODE_1_ID, NODE_2_ID, LABEL_1, LABEL_2 FROM "SOURCE"."SOURCE"."RELATIONSHIPS" WHERE RELATIONSHIP_TYPE = 'MANAGED_BY') file_format=(compression='none' field_optionally_enclosed_by='0x27' null_if=('null')); GET @~/unload file://c:\data\unload;
The first statement is exporting the data to
/unload/export.csv in whatever cloud storage is attached to the Snowflake instance (an AWS S3 bucket in our case). The Snowflake documentation goes into more details here on the
COPY INTO command and the different parameters available for it. The second statement downloads every file in the
/unload folder on the storage system.
I quickly realized I'd run into a problem with this approach. There was a maximum CSV file size for the export, and so each table was broken up into chunks. For the 19 tables I was trying to move, I had 88 total CSV files. Each CSV file also did not have column headers, though I did later find that that can be added via a parameter in the
COPY INTO command. Still, even if I had been able to add headers, processing 88 CSV files did not feel like a fast and efficient approach, especially if this were to end up being something we'd want to run on a schedule. Copying, exporting, importing, cleaning, and processing all those files introduces many potential failure points, and when designing a good data integration pipeline, you want to attempt to optimize for as few of those as possible.
Second Attempt: Linked Server
Since the CSV export/import approach clearly would not be a viable path forward, I shifted to the linked server option. If you've never worked with linked servers in SQL Server before, they essentially let you connect to a different database server and use it like it's an extension of the database server you're already connected to. It can be useful for integrating data from various servers, though it has some drawbacks (particularly in the performance realm).
In case you missed it above, here's the link again for the official instructions on how to set up Snowflake as a linked server in SQL Server. I was pretty much able to follow these instructions verbatim, but if you run into any roadblocks, feel free to reach out and I'll do what I can to help.
Getting the connection made to Snowflake was relatively straightforward. Getting the data migrated once the connection was made, on the other hand, was not.
Show me the Data
Note: Some of the table and column names have been modified and anonymized from the real data source I was working with for this project. For example purposes, however, it should all still make sense and fit together.
I had only worked with linked servers that referenced other SQL Server instances in the past, and for the most part, you can run queries and look at the tables in SQL Server Management Studio just like you can with the main database. In this case, though, I was using the Microsoft OLE DB Provider for ODBC Drivers, and functionality was quite limited by comparison. In my case, I could not even expand the tables in the SSMS tree view to see tables and columns in the database (though I think that particular limitation was tied to my specific database, more on that in a moment).
Against the non-SQL Server linked server, you must use the
OPENQUERY operator to run a query rather than executing it directly. More on
OPENQUERY can be found here at Microsoft's documentation. So, it was time to try out my first query, a simple
SELECT * against the first table I wanted to pull from to make sure the connection was working properly. Here was my query:
SELECT * FROM OPENQUERY(SNOWFLAKE, 'SELECT * FROM SOURCE.SOURCE.RELATIONSHIPS LIMIT 10')
And here's what I got back:
OLE DB provider "MSDASQL" for linked server "SNOWFLAKE" returned message "Requested conversion is not supported.". Msg 7341, Level 16, State 2, Line 1 Cannot get the current row value of column "[MSDASQL].NODE_1_ID" from OLE DB provider "MSDASQL" for linked server "SNOWFLAKE".
Uh-oh. I'm not off to a good start with a very cryptic error message. It looked like it might be related to a data type, which was strange given the fact that I was just running a
SELECT *. However, I'd also gotten a "string or binary data will be truncated" error when trying to expand the table tree, so I was starting to suspect something wasn't matching between the data types in the two systems.
My hunch was accurate, and after further investigation, I found that all textual fields in this particular Snowflake database were stored as
VARCHAR(16777216) (to my absolute horror). This is Snowflake's maximum value for a
VARCHAR, while a
VARCHAR in SQL Server can store a maximum of 8000 characters. Even though the actual values in these fields were far below 16777216 characters, or even 8000 characters, SQL Server seemed to be choking on the type conversion here. To fix this, I had to employ some heavy usage of the
CAST function in Snowflake, as the type casting has to happen there first before it could even get to SQL Server. Since
SELECT * was now off the table, here was my new query, which did work:
SELECT * FROM OPENQUERY (SNOWFLAKE, 'SELECT DISTINCT NODE_1_ID::varchar(8000) as NODE_1_ID, NODE_2_ID::VARCHAR(8000) as NODE_2_ID, NODE_1_LABEL::VARCHAR(8000) as NODE_1_LABEL, NODE_2_LABEL::VARCHAR(8000) as NODE_2_LABEL, METADATA::VARCHAR(8000) as METADATA from SOURCE.SOURCE.RELATIONSHIPS')
I'm close to twenty paragraphs into this post, and we finally have some data displaying in the SQL Server Management Studio results pane. The query was not particularly speedy, though. It returned about 113,000 rows in 17 seconds running through the linked server, compared to 1.25 seconds running the identical query directly against Snowflake.
Note that I chose to do the casting using double-colons, which is allowed in Snowflake. I could have also used
CAST(NODE_1_ID AS VARCHAR(8000)) but eliminating some of the parentheses and extra characters will pay off for scripting some of this later in the post.
Now We're Cooking
Time to take all that data we can now get and slam it into SQL Server. Should be a piece of cake, right? Not quite, as all the type-casting we're going to have to do adds an extra layer of complexity. I ended up writing a SQL Server stored procedure to get the data table by table and write it into our database. Let's step through it, piece by piece:
If you want to see the full code, all in one place, I've put it in a GitHub gist.
USE [STAGE] ALTER PROCEDURE [DBO].[COPY_FROM_SNOWFLAKE] AS BEGIN SET NOCOUNT ON DECLARE @TBL VARCHAR(8000) DECLARE @COL VARCHAR(8000) DECLARE @DATA_TYPE VARCHAR(8000) DECLARE @FIRST_ROW BIT DECLARE @SQL VARCHAR(MAX) DECLARE @SQL_CREATE_TABLE VARCHAR(MAX) DECLARE @SQL_FOR_COLS NVARCHAR(MAX) --CLEAR AND FILL RELATIONSHIPS TABLE DROP TABLE IF EXISTS SOURCE.RELATIONSHIPS SELECT * INTO SOURCE.RELATIONSHIPS FROM OPENQUERY(SNOWFLAKE, 'SELECT DISTINCT NODE_1_ID::VARCHAR(8000) AS NODE_1_ID, NODE_2_ID::VARCHAR(8000) AS NODE_2_ID, NODE_1_LABEL::VARCHAR(8000) AS NODE_1_LABEL, NODE_2_LABEL::VARCHAR(8000) AS NODE_2_LABEL, METADATA::VARCHAR(8000) AS METADATA FROM SOURCE.SOURCE.RELATIONSHIPS')
There's nothing too fancy here so far. At the top, I've named my procedure and used the
SET NOCOUNT ON statement to suppress some of the text SQL Server will otherwise display when it runs a bunch of queries or updates as part of a stored procedure. Here is a good read on what
SET NOCOUNT ON is and why it's a good practice to utilize.
After that, I've declared a number of variables I'll be utilizing later on. Then, I'm dropping my first target table, and recreating it using the query we looked at earlier. This way, if there is old data in there, I can overwrite it with the latest from Snowflake. There are other more sophisticated methods I could use here to incrementally load or update new and changed rows, but as an initial pass, this more brute-force solution will get the job done. Also of note, for this particular table I hardcoded the query, because I'm using it to get the names of the other tables I'll be selecting from. I'll be using a lot of dynamically-generated SQL to generate the remaining tables, so without further ado, let's look at the next section of the procedure:
DROP TABLE IF EXISTS #MYTEMP --GET LIST OF TABLES TO COPY SELECT TBLS INTO #MYTEMP FROM OPENQUERY(SNOWFLAKE, 'SELECT DISTINCT NODE_1_LABEL::VARCHAR(8000) AS TBLS FROM SOURCE.SOURCE.RELATIONSHIPS')
I have 18 more tables to copy and didn't want to put 18 separate
SELECT statements back-to-back, so I selected the table names into a temporary table (after making sure the table was first dropped). I could have hardcoded the table names into a temporary table to iterate through, but in my case, the table names I needed were actually the distinct
NODE_1_LABEL values in the
RELATIONSHIPS table. Dynamically selecting those into the temporary table
#MYTEMP also means that the list of tables I pull can easily grow later to accommodate new data in Snowflake.
SET ROWCOUNT 1 SELECT @TBL = TBLS FROM #MYTEMP WHILE @@ROWCOUNT <> 0 BEGIN SET ROWCOUNT 0
SET ROWCOUNT 1 instructs SQL Server to stop processing queries after a specified number of rows has been returned, in this case 1. The next line will select the first table name I put into my temporary table on the previous step, and move it into the
@TBL variable. Then, I begin a while loop.
@@ROWCOUNT is a special system function in SQL Server which will always return the number of rows affected by the last statement. Here, I'm using it to instruct SQL Server to keep looping as long as there is another table name to select from the temporary table. Next, the loop begins by setting
ROWCOUNT back to 0, which instructs SQL Server to default back to processing all results.
The above is one of several ways to structure a loop through data in a SQL Server stored procedure. Earlier this year, in my post Searching Across Tables in SQL Server, I demonstrated a different method by using a cursor. Here is a good article comparing the difference between using a while loop and a cursor, and the pros and cons to each approach.
--DROP THE TABLE SET @SQL = 'DROP TABLE IF EXISTS SOURCE.['+@TBL+']' EXEC(@SQL)
If you weren't already having fun, here's where things really start to get interesting. I already populated the
@TBL variable with the first table name before entering the loop, so this is first dynamically building and executing SQL to drop the target staging table if it exists.
Note: I could probably truncate the tables instead of dropping and recreating, but I wanted to cover my bases in case any of the table structures change in Snowflake in the future.
If you've never worked with dynamic SQL before, it's an extremely powerful way to programmatically build and execute SQL statements. For something like this where we have 18 tables that all have their own unique structure, it will save a lot of time and code, but it also requires great care and caution. As I put this stored procedure together, I first used
EXEC statements, so that I could validate the dynamically generated SQL and ensure that it would not cause unexpected consequences.
--BUILD THE DYNAMIC SQL TO EXTRACT DATA --PREPEND THE QUERY WITH THE DESTINATION TABLE SET @SQL = CAST('INSERT INTO SOURCE.[' AS VARCHAR(MAX))+@TBL+'] EXECUTE(''SELECT '
After dropping the table, I then started building the SQL for the
INSERT. This is one spot where I hit another roadblock. Up above, when I inserted data from the
RELATIONSHIPS table into SQL Server, I was able to use
SELECT * INTO SOURCE.RELATIONSHIPS FROM OPENQUERY, but here you can see that I'm using
INSERT INTO with
EXECUTE. Why the change? Well, it turns out that running
OPENQUERY against a linked server comes with an 8000-character limit. Since these Snowflake tables are extremely wide, and I can't do a
SELECT * against them because I must type-cast everything, the queries I was generating were far in excess of 8000 characters and were erroring out.
EXECUTE('SQL') AT LINKED_SERVER is the alternative option in this case, but unlike
OPENQUERY, the result cannot be passed into
SELECT * INTO, so I had to dynamically build an
INSERT statement instead.
--BUILD THE SQL TO CREATE TABLE SET @SQL_CREATE_TABLE = CAST('CREATE TABLE SOURCE.[' AS VARCHAR(MAX))+@TBL+'] ('
INSERT requires the table to already exist in SQL Server, and cannot create the table like
SELECT * INTO can, here I also must dynamically create the table, which I start to do here. When first creating the SQL statements for both the insertion and the table creation, I'm casting to
VARCHAR(MAX), as I seemed to have trouble with my strings being truncated if I did not do that.
--GET THE COLUMNS FROM SNOWFLAKE AND DYNAMICALLY CONSTRUCT SELECT STATEMENT WITH VARCHAR CASTS DROP TABLE IF EXISTS ##MYTEMPCOLS SET @SQL_FOR_COLS = ' SELECT * INTO ##MYTEMPCOLS FROM OPENQUERY(SNOWFLAKE, ''SELECT COLUMN_NAME::VARCHAR(8000) AS COLUMN_NAME, DATA_TYPE::VARCHAR(8000) AS DATA_TYPE FROM SOURCE.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = UPPER('''''+@TBL+''''')'')' EXECUTE SP_EXECUTESQL @SQL_FOR_COLS
Here I'm creating a list of columns for the given table, in extremely clunky fashion. First, I'm dropping the global temporary table
##MYTEMPCOLS (in case you're unfamiliar with temporary tables in SQL Server, one pound sign indicates a local temp table and two indicates a global temp table). Then, I'm populating it by using
INFORMATION_SCHEMA in Snowflake to get a list of all columns and their data types in the table I'm currently working on transferring. Enjoy the mess of single quotes on the second last line, as I had to do some serious quote escaping to both include the quotes in the generated SQL itself and concatenate the table name.
The generated SQL would look something like this:
SELECT * INTO ##MYTEMPCOLS FROM OPENQUERY(SNOWFLAKE, 'SELECT COLUMN_NAME::VARCHAR(8000) AS COLUMN_NAME, DATA_TYPE::VARCHAR(8000) AS DATA_TYPE FROM SOURCE.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = UPPER('snowflaketablename')')
In the final line of that previous code block, you may notice that I used
SP_EXECUTESQL rather than the
EXEC function I used earlier to run dynamic SQL. In this case, I had to use
EXEC did not support including
OPENQUERY within the dynamic SQL. This is also why I'm using a global temporary table, instead of a local one, as the
SP_EXECUTESQL stored procedure needs visibility to it. If I used a local temp table, it would be out of scope for
SET @FIRST_ROW = 1 SET ROWCOUNT 1 SELECT @COL = COLUMN_NAME, @DATA_TYPE = DATA_TYPE FROM ##MYTEMPCOLS WHILE @@ROWCOUNT <> 0 BEGIN SET ROWCOUNT 0 IF (@FIRST_ROW = 0) BEGIN SET @SQL += ' ,' SET @SQL_CREATE_TABLE += ' ,' END
The start of this code block is running another while loop, similar to what we saw earlier in the post. I'm setting a
FIRST_ROW variable to true initially, and then I begin to loop through each combination of column name and data type that was pulled from
INFORMATION_SCHEMA. I check to see if I'm looking at the first row of data or not, as for all subsequent rows, I'll include a leading comma in both the
CREATE TABLE statements, but I don't want that comma to prepend the very first column.
SET @SQL += @COL IF (@DATA_TYPE = 'TEXT') BEGIN SET @SQL += '::VARCHAR(1000) AS '+@COL SET @SQL_CREATE_TABLE += @COL + ' VARCHAR(1000)' END IF (@DATA_TYPE = 'FLOAT') SET @SQL_CREATE_TABLE += @COL + ' FLOAT' IF (@DATA_TYPE = 'NUMBER') SET @SQL_CREATE_TABLE += @COL + ' NUMERIC(38,0)' IF (@DATA_TYPE = 'DATE') SET @SQL_CREATE_TABLE += @COL + ' DATE' IF (@DATA_TYPE = 'BOOLEAN') SET @SQL_CREATE_TABLE += @COL + ' BIT' DELETE ##MYTEMPCOLS WHERE COLUMN_NAME = @COL SET @FIRST_ROW = 0 SET ROWCOUNT 1 SELECT @COL = COLUMN_NAME, @DATA_TYPE = DATA_TYPE FROM ##MYTEMPCOLS END
Now, I'm appending the name of the column to the dynamically-generated
INSERT statement before a series of if statements evaluates the data type of the column. If it's a text column, that is where the most work needs to be done. The
INSERT statement will need to be appended with a cast to
VARCHAR(1000); this is where the decision to use the double-colon rather than wrapping the column inside
CAST() makes things a bit easier to work with. Note also that I've dropped down to 1000 characters rather than 8000; with many of these tables being very wide, returning everything as a
VARCHAR(8000) was causing SQL Server to choke on the returned data once again and error out.
The second line inside the text data type if block, as well as the remaining if statements, puts the proper data type after each column in my dynamic
CREATE TABLE statement. I used
INFORMATION_SCHEMA.COLUMNS to gather all of the possible data types in use to keep the if statements limited to what actually needs to be evaluated. The remainder of this block of code wraps up this iteration of the while loop; the current column being looked at is deleted from the temp table,
FIRST_ROW is set to false since that'll only ever be true for the initial run through the loop. Then,
ROWCOUNT is temporarily set to 1 to get the next column from the temp table, and the loop will run again assuming the temp table is not empty.
We're almost there. Here comes the last section of the code:
SET ROWCOUNT 0 --APPEND THE LINKED SERVER TO CLOSE THE QUERY SET @SQL += ' FROM SOURCE.'+@TBL+''') AT SNOWFLAKE' --CLOSE TABLE CREATION STATEMENT SET @SQL_CREATE_TABLE += ')' --EXECUTE THE SQL EXEC(@SQL_CREATE_TABLE) EXEC(@SQL) DELETE #MYTEMP WHERE TBLS = @TBL SET ROWCOUNT 1 SELECT @TBL = TBLS FROM #MYTEMP END SET ROWCOUNT 0 END
Coming out of the inner while loop, I have a
SET ROWCOUNT 0 once again to ensure that any statements we run process all rows. Then, I'm finishing up the dynamic SQL for the actual query from snowflake, by adding the table name I'm selecting from for this iteration of the outer while loop. Finally, that query is concluded by naming the linked server the
EXECUTE() is running against. I then run the dynamic
CREATE TABLE and
INSERT statements, and since neither are using
OPENQUERY I was able to go back to using
EXEC() rather than the
SP_EXECUTESQL stored procedure (though using the stored procedure would have been absolutely fine as well).
After the two statements are executed, the same steps that we saw at the end of the inner loop take place: the current record from the temp table is deleted and the next one is selected if there is one (otherwise the loop concludes). Finally, there's a little cleanup at the end of the stored procedure to set the
ROWCOUNT back to the default 0, and we're done! Whew!
The final stored procedure ran in a little over 25 minutes, successfully loading just over a million rows.
Wrapping Up: Loading the Staged Data and Opportunities for Improvement
Hopefully this has helped provide some insight on how to think through a data integration problem and some of the challenges that you may face along the way. The good news is the data itself was relatively clean, which is usually expected coming from a warehouse. The final 18 tables I copied all contained location data, and some of the address columns were named differently, but normalizing those column names was really all I had to do to prepare the data to be moved from the staging tables to a final location in the DB (at least for now, if we decide we want to utilize additional columns later on, we'll likely need to do additional cleanup and transformation). Things like unique IDs on the rows were handled well, making it easy to process the data and easy to tie back to the staged data if we do want to use more of it down the line.
If I wanted to improve this in the future, the biggest low-hanging fruit would be those data type casts. Casting everything to
VARCHAR(1000) got the job done, but even 1000 characters appears to far exceed the data in any of these columns. That being said, I wanted to err on the side of caution, in terms of avoiding truncating any real data. The performance gain in the extraction due to going any smaller is probably minimal as well, since the massively oversized columns are still being pulled from the Snowflake side. Real adjustments would be gained by adjusting the columns in Snowflake, but since that's not a piece I have control over, I'm working with what I have. In addition, the final tables are more properly-sized, so this is really only a concern with the staging table.
In a situation where we started to use Reverse ETL more, I'd probably consider reaching for a true Reverse ETL tool with standard, maintained connectors to all of these data warehouse systems. Those are more likely to be optimized for performance and reliability versus building something myself with the linked server approach. Still, as a one-off, this got the job done and was a good learning experience. If I was going to use it more going forward, I'd probably clean up a few things in the code, such as renaming some of the variables and reordering some of the table creation dynamic SQL construction to happen before the
INSERT statement creation. Things got a little messy as I added on new variables or sections of code each time I hit a new roadblock.
If you made it all the way to the end, I hope you found this post insightful! Let me know in the comments or on Twitter if you have any questions about my approach.
Enjoyed this post? Consider buying me a coffee to fuel the creation of more content.