Max's notebook

A collection of sorts

Boto Over Time

05 Jul 2020

If you’ve worked with AWS using python, then you’ve come across the AWS SDK. The current generation is boto3, the previous version is boto, and you can use both side-by-side in the same code-base, and after a few incidents due to this, I will never do this thing.

How many ways can you grant an app running on an EC2 instance access to AWS resources?

Here are a few:

dedicated IAM credentials in an app-specific config file
dedicated IAM credentials in the .boto file or in the .aws/ directory
instance profiles or roles

Guess what I inherited? (hint: it was all of them.)

Bonus round: Configuration management

This application uses ansible for config management, which suffers from the exact same issue since it uses the same libraries, sometimes in parallel too (for an example, see the s3 module), so debugging and deploying reliable fixes was harder still.

…WAT

The application originally got it’s access by reading dedicated creds from the config file. While this isn’t ideal (roles with short-lived credentials ftw), I’ve seen it a lot.
Some crons and app functionality needed access to $UNIQUE_SERVICE_SET_1 and didn’t read from the config file, so it read from that user’s .boto or .aws files
When a new instance was provisioned, scripts run by cloud_init needed access to $UNIQUE_SERVICE_SET_2, so it read from the environment and got access through the instance profile and role
ansible… well, ansible DGAF #YOLOSWAG From the aws_s3 module docs:

Ansible uses the boto configuration file (typically ~/.boto) if no credentials are provided. See https://boto.readthedocs.io/en/latest/boto_config_tut.html

sob

And then? We find it and kill it

Spelunk the App the first: find all the code that loads the IAM creds, and identify the services and calls made
Spelunk the App again: compare these calls against the IAM policy, and patch to match the code when needed
Remove the creds from the config file and cross your fingers
Test: did it work?
Ship it if so, fix it if not
Remove the user creds
Spelunk the config management: find the calls and services, remove unused, patch the policy where required
Remove the non-instance profile creds
Test it again: how about now?
How about the crons? You did check the crons, didn’t you? (Narrator: they did check the crons)
Ship it if so, fix it if not
Find surprise edge cases and cross-service library usage by watching breakage in prod
Cry while fixing and testing and shipping
Express anger at vague error messages
Express gratitude for fast deployments
Go to bed. It was a very long week

RSS