Openshift on AWS Caveats
Cloud versus on-premises based Openshift deployments have their own unique set of challenges. From a consulting perspective, I generally view cloud as easier in terms of orchestration, but with the possibility of deeper technical issues.
The main challenges people seem to face with OCP on AWS are integration with the cloud plugin, registry storage, DNS, and successfully managing the AWS and Openshift layers in harmony:
Kubernetes Cloud Provider Plugin
Don’t talk to me about Azure
Cloud Provider plugins allow for Kubernetes to integrate with the platform hosting it. The general objective of these plugins are to add features and increase reliability. At the time of writing this, the AWS Kubernetes plugin adds two features: Creating Elastic Load Balancers (ELBs) and dynamic storage (If you create a PV in Kubernetes, requests a disk with that amount of storage and attaches it) via Elastic Block Storage (EBS).
This plugin is currently pretty underutilized, but integration is still recommended because of features that are planned for the future. The use for provisioning ELBs is nullified by the Openshift Router.
Storage, Stateful Applications, limitations.
Stateless apps are easy
Elastic Block Volumes are block devices. They are not shared storage and are bound to their respective Availability Zones. These limitations need to be kept in mind. The first thing this effects is the internal docker registry if there are multiple replicas of the pod. The recommended work around is to use an S3 Bucket as registry storage. This practice has pretty solid performance, so even if you have another storage solution in place for OCP on AWS, this is still the recommended practice.
To escape the possible limitations of EBS, you could use NFS (not recommended for anything significant, but fine in a lab) or something more reliable like Openshift Container Storage (Containerized or external)
The vast majority of installs in new environments you will run into DNS issues. Cloud providers are no different.
DNS is so painful for users new to OCP/AWS to troubleshoot. Especially in environments with deviations from standard procedure.
Most guidelines I see online always assume Route53 is being utilized for DNS. If you’re using GovCloud, there is no Route53 available, making problem solving even more interesting. Route53 is easy to manage; branching away from this is where we start running into problems.
Most (Including AWS) Cloud Provider plugins require the Kubernetes
NodeName to match whatever the cloud provider has those node registered as. In Amazon, this is often
ip-x-y-z-q.ec2.internal. Most people don’t care for this because
oc get nodes isn’t quite as pretty as most clusters, and it’s harder to keep track of nodes:
1 2 3 4 5 6 7 8 9 10 11
To check your meta-data hostname:
In VPCs the second IP after the network IP is reserved for DNS.
169.254.169.253 is also available, but only returns default values (not usable if custom FQDNs are configured)
So: The installer needs to resolve the nodes via Amazon Private DNS. The hostname needs to be set to what Amazon knows it as. If you use custom DNS, but change the hostname, the control plane will fail to come up because that ID is based off the name in the Ansible inventory. If you change the hostname, to account for this, the cloud provider plugin fail to initialize. If you use only private AWS DNS, the install will fail because the masters cannot verify the install, because that requires successfully resolving the loadbalancer.
There are two solutions to this:
Add the private resolutions to your non-amazon DNS.
Configure dnsmaq to fallback on the Amazon DNS server for private (ec2.internal) routes.
This is a pretty cool workaround a coworker showed me:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Registry Storage via S3 Bucket
This feels weird, but it’s pretty cool
This is supported out of the box and can be stood up automatically via the Openshift installer, provided the S3 exists and you provide the key or have the correct IAM roles in place
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The generated storage section of the registry configuration looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
This is kind of confusing on the Kubernetes side because this stored as a secret.
oc describe dc docker-registry -n default gives no insight that S3 storage is being used (It shows EmptyDir) The only way to confirm it using kubectl/oc is:
Or you can just view your bucket via the AWS console and you’ll see the registry files show up in /registry.
IAM roles allow/deny access to AWS resources. In this context, we use IAM roles to grant Kubernetes the permission to request EBS volumes, and connect to the S3 registry.
The role to connect to the registry, attach to the Infra Nodes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
For the cloud provider plugin, attach this role to Masters:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
All other nodes need:
1 2 3 4 5 6 7 8 9 10 11 12 13
Implementation Knowledge Gap
There are tons of well written Ansible Playbooks that build all of the infrastructure from scratch. You just give them a key and they work. They assume 100% AWS components, are not flexible, and could be depreciated overnight.
The largest challenge we face with the Operations side of cloud provider hosted instances of Openshift are knowledge gaps sustained by how fast and how many directions things can change. It is crucial to be able to effectively react and adapt to changes that could come to Openshift, Kubernetes, AWS, or your organization’s architecture.