Wix Media Service Fortifies Its Disaster Recovery Cluster on Google Cloud Platform
Editor’s Note: This article is written by guest author Eugene Olshenbaum. Eugene is the Head of Media Services at Wix, a cloud-based web development platform that makes it easy for everyone to create beautiful websites.
“Web Creation Made Simple”
Wix provides technology that makes it simple for everyone to create a stunning, professional, and functional web presence. It is a cloud-based web development platform that allows users to create HTML5 websites and mobile sites through the use of a powerful online editor. Wix was founded in 2006, has 36 million registered users from all around the world, and is growing rapidly.
This technical case study describes how Wix replicated their services, which are running on their managed data centers, to Google Cloud Platform to increase availability and improve disaster recovery. Taking advantage of the features provided by App Engine, Compute Engine, and Cloud Storage, Wix was able to complete the migration in a very short time. This case study also includes key lessons learned during the migration process.
As a website building and serving platform, we want to provide close to 100% uptime for data serving and protect our user data against loss. We originally ran our service in one managed hosting environment. To improve data disaster recovery, we added a second one, running both services in active/active mode. Later, we added a third data center to run our services in 3x active/active mode.
However, we learned that maintaining three cross-data center replicas was much more complex than managing two, especially with the data centers owned by different ISPs for ISP redundancy. One of the challenges in 3x active/active mode was database replication. To replicate across three data centers, we had to configure our MySQL in a ring topology. The ring would break when one data center went down for a long time or failed completely.
To address this, instead of implementing 3x active/active mode with our current infrastructure, we decided to run in 2x active/active mode, with the third replica running on an entirely different technology platform. The third replica also added protection against data poisoning—when a faulty piece of code unintentionally corrupts data and remains undetected for some time.
We decided to build a full replica of our service on Google Cloud Platform and port all system components to work efficiently on Google Cloud. This replica would be capable of working as a primary server without dependencies to the Wix servers in our managed data centers.
We chose Google Cloud Platform for the following reasons:
Ease of Management
App Engine eliminates the need for system management.
App Engine charges only for actual usage, while we have to estimate the required computing resources in our managed hosting environment.
App Engine automatically scales according to the volume of requests.
Datastore scales with our growth.
Cloud Storage scales with our storage demands.
Speed of Development
App Engine provides all the technology building blocks for application development. One example is Task Queues, which we use heavily.
SPDY protocol is built in. App Engine applications automatically use the SPDY protocol when accessed over SSL by a browser that supports SPDY.
Our system uses the following Cloud Platform components:
We built the following two applications on App Engine:
A replication supervisor that controls application-level replication between Wix data centers and Google Cloud.
An application server that manages the metadata and provides client API access.
Google Cloud Storage
We used Cloud Storage to store our static media files.
Google Compute Engine
We do all content manipulations, such as image cropping or resizing, on the fly. Although App Engine provides an Image API for image manipulation, we require image processing capabilities such as noise reduction, sharpening, and image filters, which are not provided by the API. We have highly optimized code written in C for this purpose. We developed a set of servers on Compute Engine that handles image manipulation using our code. To optimize performance, we use a high-memory instance type and perform all the I/O on the RAM disk. This is the only component that requires attention to its scalability and health management.
Figure 1 shows the high-level architecture of our serving system on Google Cloud Platform:
Figure 1: High-level architecture of the Wix media serving system
Next, we had to choose a database. Our existing application was built on top of MySQL. MySQL has scaling issues and lacks features we require, such as full text search, arbitrary attributes data structure (image files have completely different sets of attributes than audio or PDF files), and paging based on multiple criteria. Instead of continuing development with MySQL, we decided to use the App Engine Datastore since it fits our fuzzy data models better and could scale with our growth.
Since we wanted to keep the Wix servers in our managed data center and Google Cloud running at the same time, we had to develop bidirectional synchronization between the systems at the application level.
To synchronize data between the Wix managed hosting and Google Cloud Platform systems, we had to build a replication supervisor. In our first attempt, we replicated the metadata and static files synchronously as they were uploaded to our current system (Figure 2).
Figure 2: First version of the replication supervisor
However, replication often failed with a DeadlineExceeded error. App Engine Frontend Instances are designed to respond to a request within 60 seconds. Our initial design included too many operations in a single request, and their execution depended on various external factors such as file size and network jitter.
We learned that, in general, App Engine Frontend handlers should be as small as possible and involve at most one or two Datastore requests. If a handler is more complex, it should be split and chained into smaller asynchronous blocks.
Task Queue was the perfect match for our revised design. Tasks are executed asynchronously and, instead of 60 seconds, have a 10-minute deadline. Since the time it takes to copy media files varies greatly from request to request, we used task queues to make file replication an asynchronous process. After the metadata is replicated and saved in the App Engine Datastore, our application adds a task to the task queue. The task is then pushed to another handler that handles the copying. If copying is successful, a new task is added to the task queue to notify the initiator that the transaction is complete. If copying fails, the system retries, until it completes the task successfully. Figure 3 shows the revised process.
Figure 3: Final design of the replication supervisor
The App Engine Datastore is designed for scalable web applications. It manages scaling, availability, and replication across multiple data centers automatically. When data is replicated, there is a delay from the time a Datastore write is committed until the change becomes visible in all data centers. If an application has to wait for replication to complete (and data to be consistent across all data centers) before reading the data, its throughput can be affected.
If a process does not require the latest value of a Datastore entry, eventual consistency suffices, and the process can read whichever value is available without waiting. This improves throughput and the latest data will eventually be available.
If a process requires the latest value of a Datastore entry, strong consistency is required, and the process must wait until the latest change is applied across the board. This ensures consistency but may affect throughput.
A good example is a blog application. A reader’s own comment to a blog post should be strongly consistent; that is, they should see their comment after posting it. On the other hand, comments posted by other users can be eventually consistent; that is, the reader just needs to see them eventually.
To balance data throughput and consistency, developers can choose when eventual data consistency is sufficient, and when they require strong consistency for their application.
Coming from using MySQL, this was a concept that we had to grasp.
Data objects in the App Engine Datastore are called entities. Entities can be arranged in parent-child relationships to form entity groups. An entity group is a group of tightly linked entities to achieve strong consistency and transactionality. Strong consistency is achieved by limiting a query to a single entity group.
For our application, we need to keep track of users and the media they upload for their websites. Querying the user data has to be strongly consistent. So, we created an entity group for each user and their media information.
Datastore limits the frequency of writes to an entity group in order to provide strong consistency. Since our application has a high write rate, we ran into concurrent modification exceptions. To reduce write contention, we decided to use memcache to manage user data updates. Changes are stored in memcache and pushed to the Datastore once a minute using deferred tasks. We understood that we might lose data if the memcache was flushed, but our application is designed specifically with that in mind, and no user data or files are actually lost.
By using Google Cloud Platform, we were able to rewrite our Java-MySQL application to a Python App Engine application that was ready for integration tests in just two weeks.
App Engine provides a rich set of APIs and managed services that freed us from the chore of writing scalability and fault tolerance-related code. We were able to solve our business problem quickly and seamlessly.
We have started serving our production media traffic from Compute Engine as a primary server and performance is impressively good and stable.