Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Blog February 20, 2018

Working with AWS from a Data Perspective

Recently, more of my projects have involved data science on AWS, or moving data into AWS for data science, and I wanted to jot down some thoughts on coming from an on-prem background about what to expect from working in the cloud.

For some context, in my day-to-day, I work with a variety of tools, including Spark, R, Hadoop, and machine learning libraries like scikit-learn, and plain vanilla Python, so my experience with AWS is coming from that perspective.

AWS Runs on JSON

In on-prem environments, I have had direct access to data through direct software interfaces. Some examples of these have been JDBC/ODBC SQL connections, secure FTP folders, or direct connections to on-prem (or vendor-managed, behind a company firewall) HDFS volumes through Spark.

The overarching goal of AWS is to abstract anything that can't be accessed through a REST protocol, meaning that instead of dealing with SQL UI tools, direct Spark shell access, or RStudio, I found myself dealing with a lot of command line scripts that passed JSON data structures as configurable parameters.

To understand why AWS does this, check out this fantastic post on the mandate Jeff Bezos's issued around 2002.

> 1) All teams will henceforth expose their data and functionality through service interfaces.
> 2) Teams must communicate with each other through these interfaces.

Currently, these interfaces are all customer-facing and accessible through JSON.

For example, if I wanted to move data that landed in a shared directory to an Oracle database, I'd probably set up an ETL job, either by writing cron scripts or using a tool like Pentaho to set up an ETL job to communicate between the data directory and Oracle using JDBC. All of this would involve a combination of cron scripts, possibly custom Java for Pentaho and UI-generated transformation steps.

In AWS, I could set up file movement from S3, the object storage service, by triggering a lambda function (for more on lambdas, read on) to write to Redshift, a common data warehousing solution in S3..

My steps would include:

1) Create a bucket:

aws s3api create-bucket --bucket my-bucket --region eu-west-1 --create-bucket-configuration LocationConstraint=eu-west-1

The results would be:

"Location": ""

2) Create a lambda that accesses that bucket:

aws lambda create-function \
--region region \
--function-name helloworld \
--zip-file fileb://file-path/ \
--role role-arn \
--handler helloworld.handler \
--runtime python3.6 \
--profile adminuser

Here's what the POST request syntax looks like behind the scenes in JSON based on a script from an S3 bucket:

"Code": {
"S3Bucket": "string",
"S3Key": "string",
"S3ObjectVersion": "string",
"ZipFile": blob
"DeadLetterConfig": {
"TargetArn": "string"
"Description": "string",
"Environment": {
"Variables": {
"string" : "string"
"FunctionName": "string",
"Handler": "string",
"KMSKeyArn": "string",
"MemorySize": number,
"Publish": boolean,
"Role": "string",
"Runtime": "string",
"Tags": {
"string" : "string"
"Timeout": number,
"TracingConfig": {
"Mode": "string"
"VpcConfig": {
"SecurityGroupIds": [ "string" ],
"SubnetIds": [ "string" ]

Alternatively, passing it in Python would look something like this.

As a result, I've gotten very friendly with all sorts of JSON structures:

  • simple objects, {'key': 'value} that map directly to Python dictionaries,
  • arrays that map to dicts with list values {key', : [value, value value]} and
  • my favorite, nested arrays that map to lists of dictionaries nested in lists: {'key': [{'inner_key':'inner_value'}]}

Fortunately, the Python dictionary is a workhorse data structure that's easy to loop through and reference. Still, there are a couple of Python dictionary methods that have made working with JSON in AWS much easier for me:

1) Items - access keys and values and loops through the dictionary. Remembering this syntax makes accessing elements easier.

2) The collections library, particularly OrderedDict and defaultdict which will enforce order and missing values on your dictionaries (although order for dictionaries will be a default in Python 3.6...kind of)

3) The json library, particularly load and dump. Usually the AWS SDK and command line tools take care of this for you, but there are times when you'll want to create some JSON in the CLI to test. This library is specifically designed to convert Python dictionaries to JSON data structures and vice versa, and is good for understanding the internals of JSON structures relative to your code.

You'll be interacting with the AWS APIs in a lot of different ways.

There are three main ways to access AWS, from easiest to most involved:

1) Management Console - UI

2) AWS CLI - Command Line + some tweaking

3) AWS SDKs - Full-on writing code

The contrast between this and on-prem work is that (usually) there's only one way to access a development endpoint if you're working inside a single data center.

These three different interfaces mean that, often, I'm spending a lot of time split in three different development modes: making one-time manual changes to IAMs in the console, running some Linux file system-like commands doing inventory in AWS CLI (for example, listing bucket directories,) and building an application made of several components using boto3 in PyCharm.

In this way, AWS can really divert your attention if you let it. That is, there is no pre-defined workflow like there can be in on-prem environments, and you really have to make cognizant choices about how you set up your processes and why.

For example, if you're looking to create an MLLib job doing linear regression in Spark, in an on-prem environment, you'd SSH into your Spark cluster edge node, and write a script accessing HDFS data, to be run through spark-submit on the cluster.

In AWS, you could potentially do the same thing through EMR. You could spin up an EMR cluster using either the management console or CLI, or even the SDK, and run a job any of those three ways, as well. One is not inherently better than the other; at least Amazon doesn't tell you it is, which is where the issue is - you'll have to figure out what combination of flexiblity/reproducibility works for your given workflow.

Now, let's take a closer look at each of the three environments:

1) The Management Console, my first point of entry into AWS. You can do many basic things with the services there, such as create lambdas, set triggers, upload files to S3 buckets, set up EC2 instances and much more. However, the UI doesn't let you replicate what you've done or automate certain processes.

2) AWS CLI. It's a Python-based tool that you can install (pip install awscli) and run recurrent commands. Once you run it the first time, it will also configure with your local AWS credentials file, which is a must-have for working with AWS.

For example, you can take a look at all of your S3 buckets with aws s3 ls, or bootstrap an EMR instance aws emr create-cluster --release-label emr-5.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate . You can customize many of these commands by passing in config files that are stored in - you guessed it - JSON in S3 buckets.

Since a lot of these commands are repetitive, or need to be run again and again (for example, spinning up multiple clusters or continuing to move code to AWS), I've started to enrich cli commands with bash scripts.

3) SDKs allow you to develop more complex applications outside of the command line. I really like using boto3, the Python SDK, because the documentation is nicely done. The only pain point is that there are many different ways to do the same thing.

For example, in order to access an S3 bucket, you can call a resource, a client or a session. Just to connect to S3, you can do:

s3 = boto3.resource('s3') # resource
boto3.session(region_name = 'us-east-1') # session
s3 = boto3.client('s3') # client

At a high level, resources are similar to objects, sessions manage connections between various AWS profiles and clients are per-service connections. Crystal clear?

This and this are good reads about the top-level differences between the nuances.

May I have some permissions, please?

Since you are pushing data to and from someone else's servers, Amazon is very strict about what the given action can do, and each action has fine-grained security permissions, to the point where sometimes it feels like you're navigating through the bureaucracy of an embassy trying to get a passport.

At a high level of abstraction,

AWS is composed of collections of resources. An IAM user is a resource. An Amazon S3 bucket is a resource. When you use the AWS API, the AWS CLI, or the AWS Management Console to take an action (such as creating a user), you send a request for that action. Your request specifies an action, a resource, a principal (group, user, or role), a principal account and any other necessary request information. All of this information provides context.

There is also a helpful diagram.

At an even more abstract level, it helps to understand that IAM is the central place that manages permissions, and it grants out users and roles, which can write or read to and from the various services on AWS. There's also the concept of policies, which defines what users and roles can access within the context of a service.

This was a completely new ballgame for me because other than setting up Jira tickets to get access to Oracle or admin access on my laptop, I never had to think about whether the services I used were secure.

In this context, since everything is abstracted from you onto a different server, every POST/GET request you make needs to be authenticated correctly, and services within AWS also need to talk to each other securely in order to keep track of who's accessing what.

Before you can read the docs, you'll need to find the docs.

AWS has a LOT of documentation, and it's not great. In a way, it makes sense. In the beginning of the post I linked to Bezos's famous edict, decreeing that "all teams will henceforth expose their data and functionality through service interfaces", meaning that the APIs themselves are extremely well-documented, but what happens after you call an API and the implications are not well documented. This culture seems to have carried through to modern day. The JSON parameters and CLI calls do have good documentation, but there are few examples, and few explanations, particularly for newer or less-used services.

For example, the documentation for creating an S3 bucket doesn't explictly have a link to the Python SDK, probably one of the most common ways to create buckets, and points instead to the "Sample Code and Libraries" section, which, after some navigation, has a place to download a link to awspylib, which is not the official Python documentation - Boto is.

Much like the API itself, the documentation is not opinionated, and there is no single, concrete place to get started with anything, no mind map or understanding of how documentation is written and structured. In a way, finding your way through AWS documentation is similar to being dropped into a library without a map, where the books are not ordered alphabetically, and the pages in the books are constantly being rewritten.

It makes sense from a business perspective. Writing documentation is often the hardest and most time-consuming part of development with more returns to the end users than to developers. Since Amazon is the market leader in cloud, it has no incentive to improve it in the way that Digital Ocean has to maintain a competitive edge with friendly and accessible documentation.

All of this is to say, I spent a lot of time banging my head against finding concrete examples of AWS CLI recipes I wanted to follow, a lot of time sharing snippets of code with other developers on my team, and much more time on StackOverflow, in the amazon-web-services tag.

## Learning to love lambdas

Lambdas, not to be confused with Python anonymous lambda functions, are fantastic. I've used them in several projects and each time I find a great new use case for them.

In short, they are the glue that holds AWS processes together. Lambdas are temporal services that run in response to given triggers in the AWS ecosystem.

Here's a short list of things they can do (much more in the docs):

  • Run when a file is placed in an S3 bucket and move the file to another location.
  • Process logs with Kinesis streams and store the logs in DynamoDB.
  • Compile a static website based on a new code change and reload the site.
  • Backup files and put them into cold storage.

Think of them as small functions that can be called across the AWS universe.

Keep in mind the limitations of lambdas:

  • 512 MB disk space.
  • Execution timeout after 5 minutes of activity.
  • 128 to 1536 MB.
  • Have boto included as a runtime library.

A really neat thing about them is that they have temporary file storage (each lambda has 500MB of non-persistent disk space in its own /tmp ) so you can perform operations inside the lambda at runtime.

I've only begun to explore their full capabilities in my projects, but, coming from environments where communication between services was tedious and required a complete buildout of new ETL tools, lambdas are lightweight, relatively easy to use and one of the few recent development tools that I've found to be fun.

I heard you like services. Here's over 100 of them.

Another interesting service I had the chance to work with is Glue, one of AWS's newest offerings. Glue was introduced in December 2016 and launched to consumers late spring, so it's just over six months hardened at this point.

At launch, it was described as an ETL tool that handles data movement through PySpark on auto-generated EMR instances, creating a metadata catalog of all your data in S3 buckets, and being able to manage those jobs from a single interface.

This is great, except there are already several similar services. It only makes sense that some overlap, given that AWS currently stands at 131 total services. For Glue, the analogous services are EMR, Data Pipeline, Kinesis Analytics and AWS Batch. Which one to use?

It turns out that each one has a slightly different use case, and it takes some playing around to figure out if your specific case is covered by the service, whether your data can move between one and the other correctly.

I found working with Glue to be relatively straightforward, but the real key was trying to figure out which workflow AWS was optimizing the product for. In my case, Glue felt like the right tool if:

A) You have sane and clean S3 bucket structures from which to pull data.

B) You have standard, scheduled data flows.

C) You just want to move files from S3 into Athena-readable Parquet files or similar.

D) You're comfortable with not knowing what your EMR spin-up will look like, or how long it will take.

E) You're comfortable with working with Spark code that's 2 layers removed from Spark (wrapped in EMR code, which is then wrapped in Glue code).

My main takeaway is that each service takes time to evaluate, price and figure out the API for, whereas in an on-prem instance, you're usually limited to a range of 3-4 tools that you have to get creative with.

Testing is harder.

For Lambdas and Glue, and many other services I've worked with, I discovered that testing is harder.

In on-prem, in-house applications, you write some code and then write unit tests against that piece of code. You can also run integration tests to see how those pieces will work with other pieces of code external to the codebase you're working on.

For example, you can unit test PySpark code by creating a cluster locally and using pytest fixtures to recreate a session and assert your expected results.

This is harder to do in AWS, because it's next to impossible to recreate an entire AWS environment, with all the services you need, locally. For example, you're trying to put files into an S3 bucket, or create a table in Athena, or stream files through Kinesis, and tie those actions together with Lambdas.

One way I've tried to fill this gap is with moto, which mocks out (creates dummy copies) of calls to services through boto by making use of Python decorators.

For example, you can mock a connection to S3, create a bucket, pass text into that bucket, call that "fake bucket" to return text, and make sure that the text matches what you have in your code, all without accessing AWS.

You can't necessarily test what S3 will pass to other AWS services, but you can at least test parts of your code locally. The rest involves testing job runs in an AWS dev account, but you are racking up costs for each test you run.

This brings me to:

You're paying Amazon money every time you make a typo.

This was hard to wrap my head around since, in an on-prem environment, you pay for servers up front and the costs depreciate over time. You're not paying to spin up a Hadoop cluster or for every job run. In AWS, you are.

So, every time you test something in AWS, and you forget a semicolon or a tab, or point the service to the wrong directory, you're paying money for your mistakes.

Every single action you perform, from loading to storing files, to running models, to saving those models back out, incurs some sort of fractional cost that is hard to keep track of. Because AWS is so flexible, there are many different options to consider for each possible service, each byte of data you move, each user permission, region and availability zone.

It's easy to approximate costs, but harder to pinpoint, and even harder to forecast. The best thing I did was set up Billing Alerts to monitor proactively.


Overall, if I had to describe my experience of working with AWS, I'd say that it requires a different mindset than on-prem. Instead of having everything included and being locked into a particular environment, AWS is a lot more granular. You're trading lock-in for flexibility, with the caveat that what you get may be too flexible, and knowing that AWS doesn't give you any opinions on how to work with what you have. It's up to you to make the right choices, with the consequence that you can make more mistakes along the way, but once you figure out how to best leverage the environment, you can get a lot out of it.