AWS Big Data Blog

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

Updated 3/30/2022: Amazon EMR has announced official support of Apache Ranger (link). Open-source plugin support will not be maintained moving forward and compatibility with latest versions will not be tested. We recommend customers to move to the Amazon EMR support for Apache Ranger. Ranger Presto plugin support on EMR has been deprecated.

Updated 12/03/2020: Support for EMR 6.1 and Ranger 2.2 has been added. See this for the list of fixes. A new git repo has been created under aws-samples (link) that has the code tied to this blogpost, including the roadmap (link).

Updated 2/14/2020: Updates have been made to support the latest versions of EMR and Apache Ranger 2.0.

Updated 9/26/2018: Updates have been made to support the latest versions of EMR and Apache Ranger.

————————————————–

Role-based access control (RBAC) is an important security requirement for multi-tenant Hadoop clusters. Enforcing this across always-on and transient clusters can be hard to set up and maintain.

Imagine an organization that has an RBAC matrix using Active Directory users and groups. They would like to manage it on a central security policy server and enforce it on all Hadoop clusters that are spun up on AWS. This policy server should also store access and audit information for compliance needs.

In this post, I provide the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Apache Ranger

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.

Architecture

Using the setup in the following diagram, multiple EMR clusters can sync policies with a standalone security policy server. The idea is similar to a shared Hive metastore that can be used across EMR clusters.

EMRRanger_1

Walkthrough

In this walkthrough, three users—analyst1, analyst2, and admin1—are set up for the initial authorization, as shown in the following diagram. Using the Ranger Admin UI, I show how to modify these access permissions. These changes are propagated to the EMR cluster and validated through Hue.


To manage users/groups/credentials, use Simple AD, a managed directory service offered by AWS Directory Service. Then, set up the security policy server (Ranger) and create and configure the EMR cluster. Test the security policies and then update them.

Prerequisites

The following steps assume that you have a VPC with at least two subnets, with NAT configured for private subnets. Also, verify that DNS Resolution (enableDnsSupport) and DNS Hostnames (enableDnsHostnames) are set to Yes on the VPC.

If the Ranger server and EMR cluster are created in private subnet, you will need a bastion host or a VPN connection to access the Web UI links (Hue, Ranger).

I have created AWS CloudFormation templates for each step and a nested CloudFormation template for single-click deployment (launch_stack). If you use this nested Cloudformation template, skip to the “Testing the cluster” step after the stack has been successfully created.

It takes the following parameters:

DomainName corp.emr.local
DomainPassword <user input>
DomainShortName < Short name for directory>
Subnet1SimpleAD < Subnet 1>
Subnet2SimpleAD  < Subnet 2>
VPC < VPC >
KeyName < EC2 key pair name >
CoreInstanceType <Instance type of the EMR core servers >
CoreInstanceCount < Number of EMR core servers >
MasterInstanceType <Instance type of the EMR master server>
EMRClusterName <EMR cluster name>
EMRLogDir < EMR logging directory, e.g.: s3://xxx >
RangerInstanceType <Instance type of the Ranger server>
myDirectoryBaseDN dc=corp,dc=emr,dc=local
myDirectoryBindUser binduser@corp.emr.local
myDirectoryBindPassword <user input>
rangerVersion <Version of Ranger> – Choose between 0.6, 0.7 and 1.0 and 2.0
emrReleaseLabel <Version of EMR>- Choose between “emr-5.0.0″,”emr-5.4.0”, “emr-5.16.0”, “emr-5.17.0””
myDirectoryDefaultUserPassword <user input> (Used as the default passwords for all users created with the script – eg: analyst1, analyst2)

To create each component individually, follow the steps below.

IMPORTANT: The templates use hard-coded username and passwords, and open security groups. They are not intended for production use without modification.

Setting up a SimpleAD server

Using simple-ad-template, set up a SimpleAD server. To launch the stack directly through the console, use launch_stack. It takes the following parameters:

DomainName corp.emr.local
DomainPassword <user input>
DomainShortName < Short name for directory>
Subnet1SimpleAD < Subnet 1>
Subnet2SimpleAD  < Subnet 2>
VPC < VPC >

CloudFormation output:

SimpleADDomainID d-xxxxx
SimpleADIPAddress X.X.X.X

NOTE: SimpleAD creates two servers for high availability. For the following steps, you can use either of the two IP addresses.

Setting up the Ranger server

Now that SimpleAD has been created and the users loaded, you are ready to set up the security policy server (Ranger). This runs on a standard Amazon Linux instance and Ranger is installed and configured on startup.

Using the create-ranger-server template, set up this instance. To launch the stack directly through the console, use launch_stack. It takes the following parameters:

InstanceType <Instance type of the Ranger server >
KeyName < EC2 key pair name >
myDirectoryBaseDN dc=corp,dc=emr,dc=local
myDirectoryBindUser binduser@corp.emr.local
myDirectoryBindPassword <user input>
myDirectoryIPAddress <One of the IP address of SimpleAD server>
Subnet <Subnet to place Ranger server>
rangerVersion <Version of Ranger> – Choose between 0.6, 0.7 and 1.0 and 2.0
VPC <VPC>

CloudFormation output:

IPAddress <IP address of the Ranger server>

NOTE: The Ranger server syncs users with SimpleAD and enables LDAP authentication for the Admin UI. The default Ranger Admin password is not changed.

Creating an EMR cluster

Finally, it’s time to create the EMR cluster and configure it with the required plugins. You can use the AWS CLI or emr-template to create and configure the cluster. Not all EMR security configurations are currently supported by CloudFormation.

Using a CloudFormation template to create a cluster

Use the emr-template.template template, set up the EMR cluster using cloudformation. To launch the stack directly through the console, use  .

It uses IP address of the LDAP server (SimpleAD) for Hue LDAP configuration.

LDAPServerIP IP address of the LDAP server
emrReleaseLabel <Version of EMR>- Choose between “emr-5.0.0″,”emr-5.4.0”, “emr-5.16.0”, “emr-5.17.0”
CoreInstanceType <Instance type of the EMR core servers >
CoreInstanceCount < Number of EMR core servers >
MasterInstanceType <Instance type of the EMR master server>
EMRClusterName <EMR cluster name>
EMRLogDir < EMR logging directory, e.g.: s3://xxx >
KeyName < EC2 key pair name >
RangerHostname < IP address of the Ranger server>
rangerVersion <Version of Ranger> – Choose between 0.6, 0.7 and 1.0. Pick the same version you used to setup Ranger server
Subnet <Subnet to place the EMR cluster>
VPC <VPC>

CloudFormation output:

IPAddress <IP address of EMR master node>

Using the AWS CLI to create a cluster

aws emr create-cluster --applications Name=Hive Name=Spark Name=Hue --tags 'Name=EMR-Security' \
--release-label emr-5.0.0 \
--ec2-attributes 'SubnetId=<subnet-xxxxx>,InstanceProfile=EMR_EC2_DefaultRole,KeyName=<Key name>' \
--service-role EMR_DefaultRole \
--instance-count 4 \
--instance-type m3.2xlarge \
--log-uri '<s3 location for logging>' \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/scripts/download-scripts.sh","Args":["s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger"],"Name":"Download scripts"}]' \
--steps '[{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/install-hive-hdfs-ranger-policies.sh","<ranger host ip>","s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/inputdata"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"InstallRangerPolicies"},{"Args":["spark-submit","--deploy-mode","cluster","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","10"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"SparkStep"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/install-hive-hdfs-ranger-plugin.sh","<ranger host ip>","0.6","s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"InstallRangerPlugin"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/loadDataIntoHDFS.sh","us-east-1"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"LoadHDFSData"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/createHiveTables.sh","us-east-1"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"CreateHiveTables"}]' \
--configurations '[{"Classification":"hue-ini","Properties":{},"Configurations":[{"Classification":"desktop","Properties":{},"Configurations":[{"Classification":"auth","Properties":{"backend":"desktop.auth.backend.LdapBackend"},"Configurations":[]},{"Classification":"ldap","Properties":{"bind_dn":"binduser","trace_level":"0","search_bind_authentication":"false","debug":"true","base_dn":"dc=corp,dc=emr,dc=local","bind_password":"<user input>","ignore_username_case":"true","create_users_on_login":"true","ldap_username_pattern":"uid=<username>,cn=users,dc=corp,dc=emr,dc=local","force_username_lowercase":"true","ldap_url":"ldap://<ip address of simple ad server>","nt_domain":"corp.emr.local"},"Configurations":[{"Classification":"groups","Properties":{"group_filter":"objectclass=*","group_name_attr":"cn"},"Configurations":[]},{"Classification":"users","Properties":{"user_name_attr":"sAMAccountName","user_filter":"objectclass=*"},"Configurations":[]}]}]}]}]' \
--service-role EMR_DefaultRole --name 'SecurityPOCCluster' --region us-east-1

The LDAP-related configuration for HUE is passed using the –configurations option. For more information, see Configure Hue for LDAP Users and the EMR create-cluster CLI reference.

EMR steps are used to perform the following:

  • Install and configure Ranger HDFS and Hive plugins
  • Use the Ranger REST API to update repository and authorization policies.
    NOTE: This step needs to be executed the first time. New clusters do not need to include this step action.
  • Create Hive tables (tblAnalyst1 and tblAnalyst2) and copy sample data.
  • Create HDFS folders (/user/analyst1 and /user/analyst2) and copy sample data.
  • Run a SparkPi job using the spark submit action to verify the cluster setup.

To verify that all the step actions were executed successfully, view the Step section for the EMR cluster.

o_EMRRanger_3

NOTE: Cluster creation can take anywhere between 10-15 minutes.

Testing the cluster

Congratulations! You have successfully configured the EMR cluster with the ability to manage authorization policies, using Ranger. How do you know if it actually works? You can test it with HDFS or with Hive queries.

Access Web UI

In order to test the setup, we need access to the Ranger admin UI and Hue. Below are the URL’s you can use:

  1. Ranger admin UI: http://<ip address of the ranger server>:6080/login.jsp
  2. Hue web UI: http://<EMR master IP>:8888

Follow the steps outlined in this blog to access Web interfaces on EMR clusters launched in the private subnet.

NOTE: This setup is not required if these instances are in a public subnet or a VPN connection exists.

The same setup can be used to access the Ranger admin UI.

Using HDFS

Log in to Hue (URL: http://<EMR master IP>:8888) as “analyst1” and try to delete a file owned by “analyst2”. For more information about how to access Hue, see Launch the Hue Web Interface. 

  1. Log in using the “analyst1” credentials (password: <user provided >).
  2. Browse to the /user/analyst2 HDFS directory and move the file “football_coach_position.tsv” to trash.
  3. You should see a “Permission denied” error, which is expected.

o_EMRRanger_4

Using Hive queries

Using the HUE SQL Editor, execute the following query.

These queries use tables with data in HDFS. The Apache Ranger plugin is installed with HiveServer2 (where Hue is submitting queries), and you can use Ranger policies to enable fine-grained SQL-based permissions for users. For more information on Hive and EMR, see Additional Features of Hive on Amazon EMR.

SELECT * FROM default.tblanalyst1

This should return the results as expected. Now, run the following query:

SELECT * FROM default.tblanalyst2

You should see the following error:

o_EMRRanger_5

This makes sense. User analyst1 does not have table SELECT permissions on table tblanalyst2.

User analyst2 (default password: <user provided >) should see a similar error when accessing table tblanalyst1. User admin1 (default password: <user provided>) should be able to run any query.

Updating the security policies

You have verified that the policies are being enforced. Now, let’s try to update them.

  1. Log in to the Ranger Admin UI server
    • URL: http:://<ip address of the ranger server>:6080/login.jsp
    • Default admin username/password: admin/<apache ranger default password>.
  2. View all the Ranger Hive policies by selecting “hivedev”
    o_EMRRanger_6
  3. Select the policy named “Analyst2Policy”
  4. Edit the policy by adding “analyst1” user with “select” permissions for table “tblanalyst2”
    EMRRanger_7
  5. Save the changes.

This policy change is pulled in by the Hive plugin on the EMR cluster. Give it at least 60 seconds for the policy refresh to happen.

Go back to Hue to test if this change has been propagated.

  1. Log back in to the Hue UI as user “analyst1” (see earlier steps).
  2. In the Hive SQL Editor, run the query that failed earlier:
    SELECT * FROM default.tblanalyst2

This query should now run successfully.

o_EMRRanger_8

Column masking and row filtering

Apache Ranger provides the ability to enable column masking and row filtering.

Suppose we want to allow “analyst1” to only view a subset of rows from the table tblanalyst1 and mask one of the column values. Steps below will show how to set that up:

  1. Log in to the Apache Ranger UI as admin (see earlier steps). URL: http://<ip address of the ranger server>:6080/login.jsp
  2. Choose “hivedev” on the Service Manager screen.

Row level filter

  • Choose “Row Level Filter” tab provided on the top of the page and select “Add New Policy”.
  • Create new policy with the following values:
    • Policy Name: analyst1filter
    • Hive Database: default
    • Hive Table: tblanalyst1
    • Under Row filter conditions:
      • Select user: analyst1
      • Access Types: Select
      • Row Level Filter: page=’yelp.com’
      • Choose “Add” to enable the plugin.

Column Masking

  • Under the same “hivedev Policies” choose the tab “Masking”
  • Choose “Add New Policy”
  • Create new policy with the following values:
    • Policy Name: analyst1mask
    • Hive Database: default
    • Hive Table: tblanalyst1
    • Hive Column: request_begin_time
    • Under Mask conditions:
      • Select User: analyst1
      • Access Types: select
      • Select Masking Option: Partial mask: show first 4
    • Choose “Add” to enable the plugin.

These policy changes will be pulled in by Hive plugin on the EMR cluster. Give it at least 60 seconds for the policy refresh to happen.

Go back to Hue to test if this change has been propagated.

  1. Log back in to the Hue UI as “analyst1” (see earlier steps).
  2. In the Hive Query UI, run the query that failed earlier:
    SELECT * FROM default.tblanalyst1

This query should now run successfully and only shows rows filtered by “page” column value of “yelp.com”. The column “request_begin_time” should only display the first 4 characters.

Audits

Can you now find those who tried to access the Hive tables and see if they were “denied” or “allowed”?

  1. Log back in to the Ranger UI as admin (see earlier steps).
    URL: http://<ip address of the ranger server>:6080/login.jsp
  2. Choose Audit and filter by “analyst1”.
    • Analyst1 was denied SELECT access to the tblanalyst2 table.
      o_EMRRanger_9
    • After the policy change, the access was granted and logged.
      o_EMRRanger_10

The same audit information is also stored in SOLR for performing more complex and full test searches. The SOLR instance is installed on the same instance as the Ranger server.

  • Open Solr UI:
    http://<ip-address-of-ranger-server>:8983/solr/#/ranger_audits/query
  • Perform a document search
    o_EMRRanger_11

Direct URL: http:// <ip-address-of-ranger-server>:8983/solr/ranger_audits/select?q=*%3A*&wt=json&indent=true

Conclusion

In this post, I walked through the steps required to enable authorization and audit capabilities on EMR using Apache Ranger, a centrally managed security policy server. I also covered the steps to automate this using CloudFormation templates.

Stay tuned for more posts about security on EMR. If you have questions or suggestions, please comment below.

For information about other EMR security aspects, see Jeff Barr’s posts:

Last updated June 26, 2017


About the author


varun_90Varun Rao is a Big Data Architect for AWS Professional Services.
He works with enterprise customers to define data strategy in the cloud. In his spare time, he tries to keep up with his 2-year-old.

 

 


Related

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

security_config