When bitcoin exchange Coinbase launched two and a half years ago, Heroku, a simple hosting platform, provided the right solution, said Brian Armstrong, Coinbase co-founder, writing in Medium. Heroku provided a more battle-tested solution than anything the founders could create on their own.
But the founders realized that they had to build the next version of their infrastructure from the ground up, giving special attention to security.
After one year, the team completed a transition that has operated inside AWS. The experience in developing this solution can serve as a starting point for building productive infrastructure in the cloud.
Coinbase currently stores 10% of all bitcoin in circulation. Armstrong noted that the company’s security measures are constantly evolving.
Two Key Security Principles
Two of the key principles they followed are eliminating single points of failure and layered security. The concepts are based on not putting all your eggs in one basket. The idea is to seek consensus and redundancy among different parties. Such concepts are deployed in nuclear launches, corporate governance, bank security, certificate authorities and human resources.
One example is the way one secures an administrator account on AWS with a two-factor token controlled by another person.
If one person controls the password to the account, the second-factor token goes to another party. The second factor is stored in a safe deposit box or vault for some physical (in addition to crypto) security. This will prevent a single person from accidentally or intentionally destroying the company.
The developers should not need production SSH access for their regular tasks of deploying code, debugging, spinning up news services, etc. But it is hard to remove the need for SSH access. Some employees will always require a way to debug problems.
The Lock Down Process
When someone in the company needs SSH access, they can follow the following lock down process:
1) Add two-factor to the SSH. Each SSH should require a second factor. Duo two-factor authentication for SSH pushes a request for approval to a phone. Another option is a FIDO U2F key that is similar to a small hardware security module on a USB stick. The company can require all SSH to be “pair programmed” to separate the keys.
2) Use special laptops for SSH access. It is important to prevent the ability to SSH into production using a regular laptop. Most high profile breaches are due to malware arriving on a laptop caused by spear phishing. Some people assume that 0-day vulnerabilities or other sophisticated techniques cause hacks. The fact is that simple spear phishing – clicking on spoofed email links – is a greater cause. Some attackers dedicate six months or more to developing relationships for the purpose of spear phishing. Companies should allocate certain machines in a locked room that only certain individuals can access. There should also be a dropcam in the room to record who enters and exits. These machines should not be used to browse the Internet or open email. These machines need wiping regularly.
3) Audit SSH access heavily. Establish bastion hosts for all SSH requests. Restrict access to these least-privileged hosts and advise the team when they are accessed. PagerDuty is available to alert people based on certain commands. To prevent untraceable action after access, log every action and keystroke going through the bastion host. Coinbase wrote a custom software for this portion and may open source it. SSH log storage is equally important since the logs have sensitive information. Coinbase runs a separate disaster recovery environment to guarantee storage of each action in its environment for 10 years minimum. Immutable logging provides an audit trail to determine the cause of a breach.
4) Limit SSH access to those who are less likely to steal. Special rules should govern production access. Every employee granted access needs to have a background check. The company should be prepared to issue an arrest warrant if something goes wrong. This can create controversy, but consideration should be given to granting access to citizens of the country of operation.
Coinbase decided to store 98% of customer bitcoin offline in a safe deposit box.
The first version included USB drives and paper backups stored in a safe deposit box at a bank.
The New Cold Storage
Version three of Coinbase’s cold storage looks different. The company generates keys in a secure environment. The keys are split using Shamir’s secret sharing. Every private key consists of different parts. Some subset of the pieces is needed to restore the secret. In this manner, the secret can be recovered if some pieces get lost. The system also requires a quorum of key holders to restore a key.
Coinbase distributes key holders geographically and follows a protocol during key signing ceremonies to verify holders’ identities.
An example of generating a two of three key, where at least two of the three pieces are needed to recombine the secret, using Hashicorp’s open source vault project, is shown below.
A company can require five of 10 pieces or any threshold they choose.
It is important to log everything occurring across all containers in the infrastructure.
Having a good audit trail is critical. Worse than getting hacked is getting hacked and not knowing how it occurred. The only option in that scenario is to hope you have patched the right thing before relaunching.
Proper logging also prevents theft. People are less likely to steal when they believe they may get caught.
High Variety Logs
An environment focused on low latency and high variety logs required a new design for Coinbase’s new infrastructure. To minimize the complexity of logging, the company sought to push all logs through a single place that could be consumed in many ways. Running bitcoin nodes globally required logging endpoints that are accessible across numerous networks.
To reduce the complexity of adding consumers and log producers, Coinbase now pipes each event through a streaming, distributed log providing flexible, at-least-once guaranteed, processing as well as a multi-day data buffer that can replay as required.
The company runs a fleet of Docker containers to process the entirety of this pipe to perform different evaluations, transformations and transfer data to more permanent homes for search, archival and more.
Coinbase built another piece of software that seeks irregularities in the logs that flow through Kinesis. When it detects something, there are three alert levels.
Warnings appear in the infrastructure slack channel that the team can observe for context. Someone attempting to brute force passwords and running into the rate limiting would be an example.
Errors triggering PagerDuty to alert someone represents a more serious issue calling for immediate attention. Unusual movement of funds would be an example.
Critical issues can trigger a kill switch that shuts down critical services, such as outgoing payment processing. Kill switches need their own key signing ceremonies. Unauthorized access to certain services and machines are an example.
Addressing Common Tasks
Deploying new code is one of the most common tasks. Coinbase has developed tools around the idea of consensus. It has a three-phase process in which anyone can propose a change. Consensus is required to apply a change.
A tool called Sauron comments on every pull request, requiring approvals before deploying code into production.
This branch requires a +1 from two developers besides the author. More sensitive services require more approvers. In periods of higher risk, like a compromised laptop, an employee can dial the number of +1’s as much as needed system wide without blocking all deploys. This protects against cases where one or more developers have malware on their laptop.
Consensus is also used when updating the environment, such as the docker-compose files for launching services.
Coinbase has run entirely on Docker in production for more than one year. Before the new deployment tools, the company began building its own tool called CodeFlow. This tool provides each developer the ability to deploy their code by combining Docker-Compose file, Dockerfile and Envars for 12-factor applications.
Other topics to consider are: red team drills, bug bounty programs, pen tests with outside firms, working with vendors storing PII, incident response, and educating new developers.
Images from Medium/Brian Armstrong.