If you’ve worked with AWS using python, then you’ve come across the AWS SDK. The current generation is boto3, the previous version is boto, and you can use both side-by-side in the same code-base, and after a few incidents due to this, I will never do this thing.
How many ways can you grant an app running on an EC2 instance access to AWS resources?
Here are a few:
dedicated IAM credentials in an app-specific config file
Guess what I inherited? (hint: it was all of them.)
Bonus round: Configuration management
This application uses ansible for config management, which suffers from the exact same issue since it uses the same libraries, sometimes in parallel too (for an example, see the s3 module), so debugging and deploying reliable fixes was harder still.
…WAT
The application originally got it’s access by reading dedicated creds from the config file. While this isn’t ideal (roles with short-lived credentials ftw), I’ve seen it a lot.
Some crons and app functionality needed access to $UNIQUE_SERVICE_SET_1 and didn’t read from the config file, so it read from that user’s .boto or .aws files
When a new instance was provisioned, scripts run by cloud_init needed access to $UNIQUE_SERVICE_SET_2, so it read from the environment and got access through the instance profile and role
Ansible uses the boto configuration file (typically ~/.boto) if no credentials are provided. See https://boto.readthedocs.io/en/latest/boto_config_tut.html
And then? We find it and kill it
Spelunk the App the first: find all the code that loads the IAM creds, and identify the services and calls made
Spelunk the App again: compare these calls against the IAM policy, and patch to match the code when needed
Remove the creds from the config file and cross your fingers
Test: did it work?
Ship it if so, fix it if not
Remove the user creds
Spelunk the config management: find the calls and services, remove unused, patch the policy where required
Remove the non-instance profile creds
Test it again: how about now?
How about the crons? You did check the crons, didn’t you? (Narrator: they did check the crons)
Ship it if so, fix it if not
Find surprise edge cases and cross-service library usage by watching breakage in prod