Using AWS for on-premises WordPress site continuity

Applications running on LAMP (Linux, Apache, MySQL, and PHP) stack are ubiquitous—WordPress alone represents 38% of all content management systems. Other popular CMS applications such as Drupal and Joomla also run on LAMP as well as Moodle, a widely used learning management system (LMS). Because of the popularity of these applications, public sector organisations such as educational institutions should protect their business continuity by implementing disaster recovery (DR) solutions: policies, tools, and procedures to help the recovery or continuation of technology infrastructure and systems following a disaster.

Amazon Web Services (AWS) Professional Services created a business continuity solution for on-premises LAMP applications that could eliminate the need for physical backup infrastructure and improve recovery time. The solution—which uses File Gateway, Amazon Relational Database Service (Amazon RDS) for MySQL, Amazon Route 53, Amazon CloudWatch, and AWS Lambda—was recently piloted by Cardiff University.

Cardiff University tests solution

Simon Williams, senior manager (platform and storage) at Cardiff University outlined the goal of the project: “We wanted to develop a proof-of-concept DR solution for one of the university’s WordPress-based web services to demonstrate the feasibility of failover of the service to a cloud instance in the event of issues with the university’s infrastructure.”

The DR solution for a WordPress microsite was implemented in 10 days. Shortly after the solution was implemented, one of the university’s data centres experienced an uninterruptible power supply (UPS) failure during a scheduled power test, which took down a number of IT services—including the WordPress instance with the DR solution. According to Simon, “One of the few positives to come out of the data centre failure was that the DR worked flawlessly. “

The unplanned, real-life test showed that this solution provides business continuity of LAMP applications with minimal data loss and rapid recovery. The cost of the solution in standby is half the cost of the solution in recovery mode, and in both cases, less than physical infrastructure. Additionally, data is protected with Amazon Simple Storage Service (Amazon S3), due to its durability and replication across Availability Zones (AZs), MySQL databases’ multi-AZ redundancy, and easy conversion of the solution to run on AWS as part of a cloud migration strategy.

Simon Williams at Cardiff University said, “From the university’s perspective, perhaps the most valuable part of the engagement was the opportunity it presented for university staff across a number of technical teams to work together with guidance from AWS on a small-scale, ‘real’ cloud migration project with a tangible outcome.”

Disaster recovery requirements for LAMP stack applications

The two key metrics that define the requirements for a DR solution are: recovery time objective (RTO) and recovery point objective (RPO):

RTO is the targeted duration of time and a service level within which a system must be restored after a disaster to avoid unacceptable consequences associated with a break in business continuity.
RPO is the maximum targeted period in which data (transactions) might be lost from a system due to a disaster.

Continuity is an important part of DR, as a disaster can disrupt not only a single, isolated system but an entire data centre or colocation facility. In this case, the effort to acquire a new system at an alternate location increases downtime, potentially pushing it beyond the key RTO metric.

LAMP stack applications including WordPress store user content such as blog posts and comments in a MySQL database, while uploads are stored in the file system. Additionally, WordPress can update its own code, plugins, and themes, thus the relevant PHP files are part of the solution. The entire data set must be backed up and restored as a unit in order to be consistent.

Data in a MySQL database and the file system have to be backed up at exactly the same time to maintain a consistent dataset, but it may be hard to achieve as these two data stores may not be co-located and could be backed up with different systems at different intervals. Restoring the latest pair of these disparate backups may cause issues from the wrong content being displayed to re-introducing security vulnerabilities that were patched at the primary site after the backup was taken. This may necessitate restoring from earlier backups until parity is achieved, which increases data loss and recovery time.

On-premises WordPress/LAMP application DR solution on AWS

The DR solution for WordPress uses continuous replication of MySQL databases and file systems to deliver durable and highly available storage in AWS and keeps a replacement system in standby, reducing the running costs and reducing RPO and RTO from hours or days to minutes. In the pilot with Cardiff University, time to detect failure was set at 10 minutes with a 200 second failover time once detected. These targets were met when the data centre power supply failure occurred.

Figure 1: DR solution in standby

To implement this DR solution, a File Gateway virtual machine must first be deployed on premises, along with an NFS share on the File Gateway. This must then be mounted on the Web server and the file system data moved to the NFS share. The File Gateway maintains a local copy of this data for low-latency access and asynchronously uploads it to Amazon S3.

The next step is to create a multi-AZ database in Amazon RDS as a replication target for the primary MySQL database using either binary logs or global transaction identifiers (GTIDs), depending on the version of MySQL. Scheduled backups should be enabled in Amazon RDS. To encrypt the replication traffic, an IPSec VPN connection should be deployed between the on-premises environment and AWS. An existing VPN gateway hardware can be used for this purpose or a Virtual Machine can be used as a VPN gateway software appliance.

The solution deploys a File Gateway on an Amazon EC2 instance and exposes the Amazon S3 bucket containing data replicated from the on-premises file system as an NFS share. Then a Web server is set up on an Amazon EC2 instance and mounts the NFS share from the File Gateway. Then both instances are stopped to prevent them from incurring compute costs while they are in standby.

DNS records in Route 53 Hosted Zone point the website domain name to the IP address of the primary webserver. A health check monitors the primary web server; if downtime exceeds a set threshold then an alarm is triggered in Amazon CloudWatch, sending a notification to an SNS topic, which triggers an AWS Lambda function to perform failover. It also starts resolving the website domain name to the elastic IP address of the web server in Amazon EC2, redirecting the website traffic from users to the DR environment in the AWS Cloud.

When an event triggers failover, an SNS notification is sent to the administrator in an email and the AWS Lambda function orchestrates the following steps:

Retrieve RDS database credentials and VPC parameters from AWS Systems Manager (SSM).
Detach the VPN Gateway from the VPC to stop replication traffic and prevent the RDS database from becoming inconsistent.
Invert and disable the health check in Route 53 to prevent the failback in the event of the primary Web server coming back online with state that may be out of date.
Retrieve the wp-config.php configuration file from Amazon S3 and replace the on-premises database credentials with those of the database running in RDS and write the updated file back to S3.
Start the File Gateway and the Web server EC2 instances.
Wait for the File Gateway to become available and trigger the cache refresh, so that up-to-date files are presented to the Web server instance.

Figure 2: DR solution in failover

At this stage, the failover is complete and the DR website is up and running with up-to-date data. With this solution, we achieved RTO of 200 seconds and RPO of near-zero, as well as a consistent dataset after recovery.

Security of data at rest is provided by enabling encryption of the S3 bucket, RDS database, and Elastic Block Storage (EBS) volumes. Data in transit is encrypted with SSL for file uploads to Amazon S3 and with IPSec VPN for MySQL replication. Network perimeter is enforced by security groups on Amazon EC2 and Amazon RDS instances and by keeping the RDS database in private subnets. Database credentials are kept in SSM parameter store and not hard-coded.

If either of the EC2 instances fails, the system checks for a preset amount of time before an Amazon CloudWatch alarm triggers automatic recovery.

This solution is extendable to multiple co-located systems and may require minor changes to existing applications.

Learn more about our disaster recovery solution for on-premises LAMP stack applications. Contact us or your AWS Account Manager to learn more about AWS Professional Services.

AWS Public Sector Blog

Using AWS for on-premises WordPress site continuity

Cardiff University tests solution

Disaster recovery requirements for LAMP stack applications

On-premises WordPress/LAMP application DR solution on AWS

Resources

Follow