Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Blog March 13, 2019

Extracting Data from Azure Data Lake Store Using Python: Part 2 (The Design Part)

Michael Noto

In Part 1, we covered the technical basics of pulling data from Azure Data Lake Store (ADLS) using Python. We learned a little about how ADLS handles access for external applications, set up the credentials required for a Python application to connect to ADLS, and wrote some Python code to read from files stored on the ADLS. In Part 2, we expand our view to consider several important design decisions for building a production data pipeline between ADLS and another data store.

Small files are not your friends.

In the Part 1 code, we were dealing with one very small file, but what changes when we have much larger files or a high number of small files?

As is well documented about big data systems, Azure Data Lake Store exhibits the “small files” problem--performance suffers from having to read many small files, as opposed to only a few large files. This is because there is a certain amount of overhead incurred with opening each file in memory; therefore, more files to open means more time spent trying to get to their data, no matter their file size. Microsoft prescribes an average file size on ADLS for optimal performance--as of this writing, 256 MB to 2 GB each. We can take Microsoft’s guidelines at their word, but let’s explore this a bit to see the impact of file size and number first-hand.

In our exercise, we’ll start with a single 25 MB file. Putting it through the same “open-convert-read” code we had back in Step 6 of the Part 1 blog, except with an additional step of reading each data row into a list object post-conversion, results in a cumulative runtime of 6.61 seconds.

Python code for reading and parsing a large ADLS JSON file

If the single 25 MB file were split into 100 separate 250 KB pieces and read in the same way as the original file, we see a real difference in execution time--33.8 seconds. This is about 5 times more expensive an operation than with just a single file.

Python code for reading and parsing many small ADLS JSON files

Not only does the “small files” problem pose an issue for file access, it also impacts storage on the ADLS. Every file takes up a minimum amount of space by default (256 KB at the time of this writing), so, if a lot of files are smaller than this limit, a potentially nontrivial amount of extra space on the ADLS will be used up, leading to higher costs. Should this extra space consumption become significant enough, writing a consolidation program that combines small files together may be a good approach to lower the non-value-add storage.

Whenever possible in your data environment, you will want to maintain a smaller number of files to limit both storage footprint on ADLS and execution time in Python.

What if I need to transform the data while in Python?

For source data that is in JSON, CSV, and many other file formats, the Pandas https://img.captechconsulting.... offers the extremely useful DataFrame class for manipulating data in an in-memory table-like object. Operations such as filtering, joining, stacking, and aggregating are all possible with DataFrames. See the Pandas documentation for more information.

Alternatively, if you only need light transformations or just a sub-selection of your input data, standard Python object types, such as lists, tuples, and dictionaries, may suffice.

What are my options for landing Azure data in an on-premises database?

If your need is to send data from ADLS to an on-premises database, you have several options. The two main ones are to go directly from Python to database (all in-memory) or to first export to a file, followed by executing a bulk load.

  • In-Memory
    • Non-parameterized SQL INSERT statements
    • Parameterized SQL INSERT statements
    • SQLAlchemy
  • To Disk First
    • Bulk load SQL statement

The decision to go in-memory or to-disk should be influenced by the speed required, whether or not the target database has a bulk load capability, and whether or not the data needs to be transformed before landing in the target database.

For workloads requiring speed, you may want to consider the to-disk + bulk load option, as that kind of load will generally outperform executing individual INSERT statements.

When speed is not a huge concern or the data is limited, the in-memory SQL INSERT statement option works really well. This involves constructing a SQL INSERT statement for each row of data to be loaded, as well as executing those statements on the database from Python using a compatible Python database https://img.captechconsulting..... Within this option, you can elect to create and run either parameterized INSERT statements or non-parameterized INSERT statements. The former is usually the more prudent choice, especially if you have a mixture of data types across columns or the possibility of adding more columns to the INSERTs in the future. This is because the parameterized option automatically handles wrapping or not wrapping values with single quotes depending on the data type of each column. (It even handles nuances in individual row values, like when you have NULLs for a string column, in which case no single quotes will be used.) Note that, if you encounter errors attempting parameterized statements, it may be due to an out-of-date ODBC driver that does not support this. For SQL Server, only the “SQL Server Native Client…” or “ODBC Driver for SQL Server…” drivers allow parameterized inserts. An example of parameterized insert code using pyodbc is shown below:

import pyodbc

# Sample JSON file contents (i.e., data to be loaded)
file_json = {'meta': {'view':{'columns':[
 {'name':'col1'},{'name':'col2'},{'name':'col3'},{'name':'col4'},{'name':'col5'}
 ]}}
 ,'data':[
 ['r1v1','r1v2','r1v3','r1v4','r1v5']
 ,['r2v1', 'r2v2', 'r2v3', 'r2v4', 'r2v5']
 ]
}
# Database information
database_schema_name = "dbo"
database_table_name = "SampleTable"
columns = ["Col1", "Col2", "Col3", "Col4", "Col5"]
# SQL INSERT statement shell (with placeholders)
database_insert_sql = "INSERT INTO {}.{} ({}) VALUES ({})"

# Populate SQL INSERT statement placeholders with schema and table name, comma-delimited string of column names,
 # and comma-delimited string of question marks for the values, one for each column
insert_sql = database_insert_sql.format(
 database_schema_name
 , database_table_name
 , ",".join(column for column in columns)
 , ",".join("?" for column in columns)
)

# Connect to database using the correct connection string
conn = pyodbc.connect("<SQL CONNECTION STRING>")
cursor = conn.cursor()

# Execute INSERT statement against database (statement is the first argument and actual data that will be substituted
 # into the INSERT statement is the second argument)
for row in file_json["data"]:
 cursor.execute(insert_sql, row)

If the target database platform allows it (such as SQL Server), you could create and execute multi-row INSERT statements in lieu of standard INSERTs. Keep in mind that there is a limit to the number of rows you can insert in one statement, so be sure to architect looping logic if a given run of your data flow could ever exceed that row count.

When you need the functionality offered by Pandas DataFrames, you may want to consider using SQLAlchemy or the file-to-bulk load method. SQLAlchemy is a Python https://img.captechconsulting.... that provides an object-oriented way to interact with a database, effectively translating Python code into SQL statements compatible with the specified database and executing them on that database. Its functionality can be accessed via Pandas DataFrames. For writing a DataFrame to a database table, use the to_sql method. If the performance is not acceptable, there are ways to speed it up, including dividing your DataFrame into smaller sub-DataFrames and running to_sql on each or directing to_sql to use pyodbc’s fast_executemany method (more information). If you want to dump to a file rather than use SQLAlchemy, Pandas DataFrames feature the to_csv method, which writes a CSV from your DataFrame in one line. Once the file is created, you can execute a bulk load SQL command (still within your Python program), feeding it the path to the file you just exported. That way, you take advantage of the speed offered by bulk loading.

Whether you go the in-memory route or the to-disk route, you will need to use a Python package that enables connection to your target database for the statement(s) that actually load to the database. Certain databases have their own connection packages (such as MySQLdb for MySQL or psycopg2 for PostgreSQL), but any that can use ODBC as the connection mechanism will work with the pyodbc package, provided you have the appropriate ODBC driver installed.

Make your code modular.

When organizing your application code, you will want to consider the execution method, reusability, modularization, and the skills of those who will be supporting the application.

Options for executing the application from the command line are either directly or as a module. The latter requires a particular organization of your files, but it enables the import and use of your application code in other Python scripts/applications.

Another key decision point is choosing a programming paradigm--object-oriented or procedural. Python allows you to do either one or a blend of both, unlike many other programming languages, such as C++, Java, or C#. To establish a baseline understanding of these terms for comparison: Object-oriented code involves dividing attributes and behaviors into separate classes that exist independently from one another but that can be used collaboratively to form a complete program; on the other hand, procedural code contains all the attributes and behaviors of its object-oriented equivalent but is written in a much flatter way, with no concept of class hierarchy, inheritance, or encapsulation--instead of being tied to particular classes, attributes are just variables, and behaviors are just functions that can be used anywhere in the program. Each paradigm has its merits, depending on the situation. For software development, object-oriented is usually the way to go, but for data integration programs, consider modularizing your code (whether or not that means explicit object-orientation) while avoiding excessive hierarchies and abstraction, especially if those responsible for supporting the application have more of a procedural background. If the development/support team has an object-oriented bend to it, then feel free to lean more in that direction--with Python, you have that flexibility!

In the ADLS-to-database use case, a solid approach would be to create a class for code dealing with connecting and authenticating to the ADLS and another class for code dealing with database interaction, leaving the main code file(s) reserved for the nuances of your particular application. In this way, your ADLS class and your database class can be reused across any number of Python programs and also more easily reused within your main program. If you will be building multiple similar applications that pull from different ADLS sources and/or load to different targets, think about making a generic parent class to contain the common functionality and a child class to contain the specific functionality for each application.

Coding is cool, but what about testing?

Just like any other piece of Python code, you can write unit tests for your methods that call Azure SDK code; however, each time a test is run against Azure, a varying amount of expense may be incurred depending on the actions being performed. Although these costs are fairly minuscule, especially if you keep your test data small, you might want to minimize them. To this end, you can use the Python unittest https://img.captechconsulting.com/library’s mocking capabilities to intercept any calls the method you’re testing makes to Azure and directly provide an expected return value. This allows for testing your method’s logic only, as opposed to testing an already-tested ADLS SDK method or requiring test data to actually exist on your ADLS. Furthermore, it prevents costs from piling up due to repeatedly running your unit tests, while still fully testing the intent of your code.

Let's assume you have written a method (get_files) that gets only CSV files contained within the parent directories on an ADLS and returns them as a list. To obtain a list of files and folders on the ADLS, the get_files method calls the listdir method of the ADLS SDK AzureDLFileSystem class. It does so first at the top level (Line 46) and again at the second level (Line 47) to find only those files that exist at the second level. It then adds only the CSV files to a list object (by using the endswith method of the string object) and returns that list.

Example method to unit test

Below is an example of a Python unit test for this method. The @mock.patch decorators above the test method definition override, or mock, the behavior of the ADLS authentication. When calling mock.patch as a decorator, the class, function, or method being mocked in the decorator can be added as an argument to the test (i.e., mock_auth and mock_adls in our example). Each mocked object can then be given a hard-coded return value when the object is called during test execution. In this case, we are mocking both the lib.auth method and core.AzureDLFileSystem class, which come from the ADLS SDK. The first one's return value does not matter since it is not called directly in get_files, so we just set it to an empty string (Line 73). The second mocked object represents the ADLS connection ("adls_connection" in the get_files method above), so it has a listdir method. In the test, this method can be forced to return made-up results in the format it would return normally if not mocked (Lines 74-77). Recall that the get_files method calls listdir twice--once for the top level of the ADLS and once for the second level. As a result, we set two return values in the test--"Folder1" for the top level and "Folder1/File1.csv," "Folder1/File2," and "Folder1/EmptyFolder1" for the second level. Because get_files is only concerned with CSV files at the second level, it will ignore the top level "Folder1" folder, second level non-CSV file, and second level empty folder and just return the one CSV file, which matches the expected result of the test.

Python unittest code test method example for mocking calls to ADLS

By mocking the parts of the code that access ADLS, you prevent unit tests from ever hitting Azure and, thereby, your next billing statement. As long as ADLS code is mocked with return values that match the format of their actual, un-mocked return values, these tests prove useful in validating the logic of your code.

In Closing...

Using Python for an Azure Data Lake Store ETL solution can be a perfect fit, but it takes some careful planning to avoid headache and development churn. Remember to keep in mind the overhead involved with opening many small files from ADLS; factor in speed, bulk load capability, and data transformations when determining how to load to your target database; make your code modular, reusable, and easy to follow; and always be thinking about testing and costs.