The following is a message for guests of Janusz Jezowicz, CEO of Speedchecker. The Speedchecker team operates a global distributed measurement network and offers speed test solutions using the Cloudflare platform.
Software companies wishing to offer a public API to third-party developers have many options to choose from to offer their APIs securely with high reliability and fast performance. When it comes to costs, commercial solutions are expensive and open source solutions take a long time to manage and synchronize servers. This blog post describes how we successfully moved our API gateway to Cloudflare Workers and reduced our costs by a factor of 10.
Our original solution based on the open source API gateway Kong
When we created the API of our measurement network for cost reasons, we opted for the open-source Kong solution. Kong is a great solution that has a vibrant community of plug-in users and developers that extend and manage the platform. Kong is a good alternative to the commercial solutions of companies like Apigee or Mulesoft, whose solutions are really aimed at large companies that can afford them. Kong is free and works. On the other hand, if your company has complex requirements for managing the API, e.g. powerful analysis, access control, user-friendly administration, so you'll need plug-ins. Kong plug-ins are often not free and you end up with costs that come close to commercial solutions.
At Speedchecker we already have a lot of metrics and access control logic within the API itself, so we liked the basic functionality of Kong. Here you can see a simplified architecture scheme of our API gateway:
Naturally, two instances of Kong are indispensable if we want to provide a reliable service to our customers. Kong offers flexibility in the database it uses for its API management engine. We experimented with MySQL and moved to PostgreSQL for its best replication support using Bucardo.
We have been managing this solution for over a year and during production we have learned the following drawbacks in our architecture:
- Although our Azure Cloud Service is scalable, Kong's instance is not. With increased loads we were worried that the instance could fail if we did not anticipate the increase in traffic and we had not scaled the VM correctly.
- The installation of the replica was quite complex and we had incidents where it failed and we spent days trying to figure out why and how to fix it. During those times we were exposed to a live instance and, if there was a decline, our API did not work
- We had at least one incident where a rogue actor launched a DDOS on our customer-facing Web site (not API). If the attacker is started on our Kong instance endpoints, we would not be able to protect our API
- API management has become more complex and not simpler, which is not as it should be once the API gateway is integrated. Our API works using apikey authentication and the use of a user apikey can access all our APIs. Quotas per user are not based on the number of API calls but are based on the number of measurement results we perform on behalf of the user. Each API call can have a different number of measurement results and therefore the complex quota logic and billing calculations must be performed on the Azure API and not on Kong. This means that the central repository for user monkeys and their quotas is in the Azure API and we need to make sure that synchronization occurs between the Azure and Kong APIs. For these reasons, many of the plug-ins make sense for us to use because we've done all the work on the Azure API side.
- While we saved money on a commercial API gateway license, we spent more man-hours on server administration and monitoring our system
New API solution with Cloudflare Workers
After Cloudflare announced the Workers function, we followed closely and started experimenting with its functionality.
Some of the things that originally stood out for us were:
- For years we have been using the Cloudflare platform for other parts of our infrastructure: we like their platform for its features, performance, economy and reliability. It's always best to use your existing suppliers rather than start exploring new ones that we do not have experience with.
- Interesting prices per request. With our 30 million API requests per month, workers would cost us $ 25. Comparing Azure's API management would cost us $ 300 for 2 basic instances and Apigee would cost $ 2,500 for the business plan.
- Powerful DDOS protection. Cloudflare has one of the best DDOS protections available for small businesses included in the price.
- Separate DNS failover and status monitoring is not required
- Extensible platform that we can take advantage of for any custom logic in the future if we change our needs
On the downside, we knew that Cloudflare Workers was still in beta and we needed to spend some time coding the logic instead of using a ready-to-use solution. After brainstorming with the developers, we realized that, for our situation, Cloudflare workers are a good choice. Since most of our API management logic is already present in Azure, we really need an easy and convenient way to protect our source API. Furthermore, we must make sure that the new solution is 100% compatible with our Kong solution. I think this situation is common to all API providers when they are thinking about modifying the API gateway infrastructure. You never want to get into a situation where during migration or after you realize that some API users can not access the API and have to update their code to work with your new API gateway. For this reason, it was important for us not to make any changes to the endpoints and no change of authentication is necessary and our new solution will work perfectly with a simple DNS change.
After a week of development, we were ready with our first demonstration of the concept of preparation for migration. The architecture of our new solution resembles the attached diagram.
The typical API call is handled in the following way:
- The user uses apikey in the HTTP headers or querystrings of their HTTP request (GET or POST) to query the hosted endpoint on CloudFlare workers
- Worker examines apikey and searches for it in the local cache. There are some different mechanisms in Workers for storing data tables. For our purposes we have chosen the cache in the global memory. Since the data table contains only a list of APIs, it is not very large and the 128 MB restriction does not cause problems. In addition, each Cloudflare POP has a different cache that can be problematic for some use cases. In our case, however, it is not – if apikey is not in the cache, it can be quickly retrieved from our Azure API.
- If apikey is not found in the local cache, Worker runs the HTTP subrequest to the Azure API and retrieves information about whether apikey is valid. The response is then stored locally in the global memory cache so that subsequent apikey requests are saving the round-trip in the Azure API and not overloading our source.
- If apikey is not valid, Worker replies to the API user message for an invalid apikey and Origin is not hit by a request.
- If apikey is valid, Worker forwards the API user request to our source and responds to the API user when it receives a response. In this step we also include any custom redirection logic because some API calls have different Origin endpoints. Using Workers, we can easily specify the custom logic in which API calls use different endpoints.
As described above, we wanted the architecture modification to be performed without requiring users to update any API code. For this reason, we have devised the following approach to migrate to Cloudflare workers.
- Enable Cloudflare workers on a staging domain. Perform all tests on the production API endpoints and the same tests on the staging domain with the enabled Workers. The API endpoints should behave the same way.
- Enable workers in the production domain with the source IP that still points to the Kong live instance. Using the Workers Routes settings, we make sure that the Workers code does not run on any of the live API endpoints.
- Using Worker Routes we instantly bring method API calls to workers online per method. In case of problems, you can quickly restore by changing the routes.
- We monitor the Worker Analytics screens, the number of API calls, and the secondary request status codes to ensure that no calls work.
During our live migration process, everything went smoothly until we started seeing some mistakes with some of our customers. We did not realize that the Cloudflare firewall has a certain speed limitation and prevented our API users from querying more than 2000 requests per minute from the same IP. After increasing the ticket with Cloudflare support, we reached the limit and the errors were interrupted.
We believe that Cloudflare workers are a good alternative to existing API gateway solutions. For companies that already have an existing code base for authentication and analysis, there is no compelling reason to use the commercial API gateway package when Cloudflare Workers can add a security level to the APIs at a fraction of the cost of alternatives. While workers are a relatively new product from Cloudflare, we already feel comfortable using it in production. We encourage you to explore workers for your new projects. In addition, if you want to save costs or make your self-hosted solution more robust, workers are a good alternative that can be implemented without any impact on users or API activities.