Employing Amazon Macie to Discover and Protect Sensitive Data in your Amazon S3-based Data Lake

Introduction

Working with Analytics customers, it’s not uncommon to see data lakes with a dozen or more discrete data sources. Data typically originates from sources both internal and external to the customer. Internal data may come from multiple teams, departments, divisions, and enterprise systems. External data comes from vendors, partners, public sources, and subscriptions to licensed data sources. The volume, velocity, variety, veracity, and method of delivery vary across the data sources. All this data is being fed into data lakes for purposes such as analytics, business intelligence, and machine learning.

Given the growing volumes of incoming data and variations amongst data sources, it is increasingly complex, expensive, and time-consuming for organizations to ensure compliance with relevant laws, policies, and regulations. Regulations that impact how data is handled in a data lake include the Organizations Health Insurance Portability and Accountability Act (HIPAA), General Data Privacy Regulation (GDPR), Payment Card Industry Data Security Standard (PCI DSS), California Consumer Privacy Act (CCPA), and the Federal Information Security Management Act (FISMA).

Data Lake

AWS defines a data lake as a centralized repository that allows you to store all your structured and unstructured data at any scale. Once in the data lake, you run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Data in a data lake is regularly organized or separated by its stage in the analytics process. Incoming data is often referred to as raw data. Data is then processed — cleansed, filtered, enriched, and tokenized if necessary. Lastly, the data is analyzed and aggregated, and the results are written back to the data lake. The analyzed and aggregated data is used to build business intelligence dashboards and reports, machine learning models, and is delivered to downstream or external systems. The different categories of data — raw, processed, and aggregated, are frequently referred to as bronze, silver, and gold, a reference to their overall data quality or value.

Protecting the Data Lake

Imagine you’ve received a large volume of data from an external data source. The incoming data is cleansed, filtered, and enriched. The data is re-formatted, partitioned, compressed for analytical efficiency, and written back to the data lake. Your analytics pipelines run complex and time-consuming queries against the data. Unfortunately, while building reports for a set of stakeholders, you realize that the original data accidentally included credit card information and other sensitive information about your customers. In addition to being out of compliance, you have the wasted time and expense of the initial data processing, as well as the extra time and expense to replace and re-process the data. The solution — Amazon Macie.

Amazon Macie

According to AWS, Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data stored in Amazon Simple Storage Service (Amazon S3). Macie’s alerts, or findings, can be searched, filtered, and sent to Amazon EventBridge, formerly called Amazon CloudWatch Events, for easy integration with existing workflow or event management systems, or to be used in combination with AWS services, such as AWS Step Functions or Amazon Managed Workflows for Apache Airflow (MWAA) to take automated remediation actions.

Data Discovery and Protection

In this post, we will deploy an automated data inspection workflow to examine sample data in an S3-based data lake. Amazon Macie will examine data files uploaded to an encrypted S3 bucket. If sensitive data is discovered within the files, the files will be moved to an encrypted isolation bucket for further investigation. Email and SMS text alerts will be sent. This workflow will leverage Amazon EventBridge, Amazon Simple Notification Service (Amazon SNS), AWS Lambda, and AWS Systems Manager Parameter Store.

Macie data inspection workflow architecture

Source Code

Using this git clone command, download a copy of this post’s GitHub repository to your local environment.

git clone --branch main --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/macie-demo.git

AWS resources for this post can be deployed using AWS CloudFormation. To follow along, you will need recent versions of Python 3, Boto3, and the AWS CLI version 2, installed.

Sample Data

We will use synthetic patient data, freely available from the MITRE Corporation. The data was generated by Synthea, MITRE’s open-source, synthetic patient generator that models the medical history of synthetic patients. Synthea data is exported in a variety of data standards, including HL7 FHIR, C-CDA, and CSV. We will use CSV-format data files for this post. Download and unzip the CSV files from the Synthea website.

REMOTE_FILE="synthea_sample_data_csv_apr2020.zip"

wget "https://storage.googleapis.com/synthea-public/${REMOTE_FILE}"

unzip -j "${REMOTE_FILE}" -d synthea_data/

The sixteen CSV data files contain a total of 471,852 rows of data, including column headers.

> wc -l *.csv

      598 allergies.csv
    3,484 careplans.csv
    8,377 conditions.csv
       79 devices.csv
   53,347 encounters.csv
      856 imaging_studies.csv
   15,479 immunizations.csv
   42,990 medications.csv
  299,698 observations.csv
    1,120 organizations.csv
    1,172 patients.csv
    3,802 payer_transitions.csv
       11 payers.csv
   34,982 procedures.csv
    5,856 providers.csv
        1 supplies.csv
  ------------------------------
  471,852 total

Amazon Macie Custom Data Identifier

To demonstrate some of the advanced features of Amazon Macie, we will use three Custom Data Identifiers. According to Macie’s documentation, a custom data identifier is a set of criteria that you define that reflects your organization’s particular proprietary data — for example, employee IDs, customer account numbers, or internal data classifications. We will create three custom data identifiers to detect the specific Synthea-format Patient ID, US driver number, and US passport number columns.

The custom data identifiers in this post use a combination of regular expressions (regex) and keywords. The identifiers are designed to work with structured data, such as CSV files. Macie reports text that matches the regex pattern if any of these keywords are in the name of the column or field that stores the text, or if the text is within the maximum match distance of one of these words in a field value. Macie supports a subset of the regex pattern syntax provided by the Perl Compatible Regular Expressions (PCRE) library.

Patient ID custom data identifier console

Enable Macie

Before creating a CloudFormation stack with this demonstration’s resources, you will need to enable Amazon Macie from the AWS Management Console, or use the macie2 API and the AWS CLI with the enable-macie command.

aws macie2 enable-macie

Macie can also be enabled for your multi-account AWS Organization. The enable-organization-admin-account command designates an account as the delegated Amazon Macie administrator account for an AWS organization. For more information, see Managing multiple accounts in Amazon Macie.

AWS_ACCOUNT=111222333444

aws macie2 enable-organization-admin-account \
    --admin-account-id ${AWS_ACCOUNT}

CloudFormation Stack

To create the CloudFormation stack with the supplied template, cloudformation/macie_demo.yml, run the following AWS CLI command. You will need to include an email address and phone number as input parameters. These parameter values will be used to send email and text alerts when Macie produces a sensitive data finding.

Please make sure you understand all the potential cost and security implications of creating the CloudFormation stack before continuing.

SNS_PHONE="+12223334444"
SNS_EMAIL="your-email-address@email.com"

aws cloudformation create-stack \
  --stack-name macie-demo \
  --template-body file://cloudformation/macie_demo.yml \
  --parameters ParameterKey=SNSTopicEndpointSms,ParameterValue=${SNS_PHONE} \
  ParameterKey=SNSTopicEndpointEmail,ParameterValue=${SNS_EMAIL} \
  --capabilities CAPABILITY_NAMED_IAM

As shown in the AWS CloudFormation console, the new macie-demo stack will contain twenty-one AWS resources.

CloudFormation stack successfully created

Upload Data

Next, with the stack deployed, upload the CSV format data files to the encrypted S3 bucket, representing your data lake. The target S3 bucket has the following naming convention, synthea-data-<aws_account_id>-<region>. You can retrieve the two new bucket names from AWS Systems Manager Parameter Store, which were written there by CloudFormation, using the ssm API.

aws ssm get-parameters-by-path \
  --path /macie_demo/ \
  --query 'Parameters[*].Value'

Use the following ssm and s3 API commands to upload the data files.

DATA_BUCKET=$(aws ssm get-parameter \
    --name /macie_demo/patient_data_bucket \
    --query 'Parameter.Value')

aws s3 cp synthea_data/ \
    "s3://$(eval echo ${DATA_BUCKET})/patient_data/" --recursive

You should end up with sixteen CSV files in the S3 bucket, totaling approximately 82.3 MB.

Synthea patient data files uploaded to in S3

Sensitive Data Discovery Jobs

With the CloudFormation stack created and the patient data files uploaded, we will create two sensitive data discovery jobs. These jobs will scan the contents of the encrypted S3 bucket for sensitive data and report the findings. According to the documentation, you can configure a sensitive data discovery job to run only once for on-demand analysis and assessment, or on a recurring basis for periodic analysis, assessment, and monitoring. For this demonstration, we will create a one-time sensitive data discovery job using the AWS CLI. We will also create a recurring sensitive data discovery job using the AWS SDK for Python (Boto3). Both jobs can also be created from within Macie’s Jobs console.

For both sensitive data discovery jobs, we will include the three custom data identifiers. Each of the custom data identifiers has a unique ID. We will need all three IDs to create the two sensitive data discovery jobs. You can use the AWS CLI and the macie2 API to retrieve the values.

aws macie2 list-custom-data-identifiers --query 'items[*].id'

Next, modify the job_specs/macie_job_specs_1x.json file, adding the three custom data identifier IDs. Also, update your AWS account ID and S3 bucket name (lines 3–5, 12, and 14). Note that since all the patient data files are in CSV format, we will limit our inspection to only files with a csv file extension (lines 18–33).

	{
	"customDataIdentifierIds": [
	"custom-data-identifier-id-1",
	"custom-data-identifier-id-2",
	"custom-data-identifier-id-3"
	],
	"description": "Review Synthea patient data (1x)",
	"jobType": "ONE_TIME",
	"s3JobDefinition": {
	"bucketDefinitions": [
	{
	"accountId": "111222333444",
	"buckets": [
	"synthea-data-111222333444-us-east-1"
	]
	}
	],
	"scoping": {
	"includes": {
	"and": [
	{
	"simpleScopeTerm": {
	"comparator": "EQ",
	"key": "OBJECT_EXTENSION",
	"values": [
	"csv"
	]
	}
	}
	]
	}
	}
	},
	"tags": {
	"KeyName": "Project",
	"KeyValue": "Amazon Macie Demo"
	}
	}

view raw macie_job_specs_1x.json hosted with ❤ by GitHub

The above JSON template was generated using the standard AWS CLI generate-cli-skeleton command.

aws macie2 create-classification-job --generate-cli-skeleton

To create a one-time sensitive data discovery job using the above JSON template, run the following AWS CLI command. The unique job name will be dynamically generated based on the current time.

aws macie2 create-classification-job \
    --name $(echo "SyntheaPatientData_${EPOCHSECONDS}") \
    --cli-input-json file://job_specs/macie_job_specs_1x.json

In the Amazon Macie Jobs console, we can see a one-time sensitive data discovery job running. With a sampling depth of 100, the job will take several minutes to run. The samplingPercentage job property can be adjusted to scan any percentage of the data. If this value is less than 100, Macie selects the objects to analyze at random, up to the specified percentage and analyzes all the data in those objects.

Once the job is completed, the findings will be available in Macie’s Findings console. Using the three custom data identifiers in addition to Macie’s managed data identifiers, there should be a total of fifteen findings from the Synthea patient data files in S3. There should be six High severity findings and nine Medium severity findings. Of those, three are of a Personal finding type, seven of a Custom Identifier finding type, and five of a Multiple finding type, having both Personal and Custom Identifier finding types.

Macie’s Findings console displaying the results of the one-time job

Isolating High Severity Findings

The data inspection workflow we have deployed uses an AWS Lambda function, macie-object-mover, to isolate all data files with High severity findings to a second S3 bucket. The offending files are copied to the isolation bucket and deleted from the source bucket.

	#!/usr/bin/env python3

	# Purpose: Lambda function that moves S3 objects flagged by Macie
	# Author: Gary A. Stafford (March 2021)


	import json
	import logging
	import boto3
	from botocore.exceptions import ClientError

	logger = logging.getLogger()
	logger.setLevel(logging.INFO)

	s3_client = boto3.client('s3')


	def lambda_handler(event, context):
	logging.info(f'event: {json.dumps(event)}')

	destination_bucket_name = 'macie-isolation-111222333444-us-east-1'
	source_bucket_name = event['detail']['resourcesAffected']['s3Bucket']['name']
	file_key_name = event['detail']['resourcesAffected']['s3Object']['key']
	copy_source_object = {'Bucket': source_bucket_name, 'Key': file_key_name}

	logging.debug(f'destination_bucket_name: {destination_bucket_name}')
	logging.debug(f'source_bucket_name: {source_bucket_name}')
	logging.debug(f'file_key_name: {file_key_name}')

	try:
	response = s3_client.copy_object(
	CopySource=copy_source_object,
	Bucket=destination_bucket_name,
	Key=file_key_name
	)
	logger.info(response)
	except ClientError as ex:
	logger.error(ex)
	exit(1)

	try:
	response = s3_client.delete_object(
	Bucket=source_bucket_name,
	Key=file_key_name
	)
	logger.info(response)
	except ClientError as ex:
	logger.error(ex)
	exit(1)

	return {
	'statusCode': 200,
	'body': json.dumps(copy_source_object)
	}

view raw lambda_function.py hosted with ❤ by GitHub

Amazon EventBridge

According to Macie’s documentation, to support integration with other applications, services, and systems, such as monitoring or event management systems, Amazon Macie automatically publishes findings to Amazon EventBridge as finding events. Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale using events generated from your applications, integrated Software-as-a-Service (SaaS) applications, and AWS services.

Each EventBridge rule contains an event pattern. The event pattern is used to filter the incoming stream of events for particular patterns. The EventBridge rule that is triggered when a Macie finding is based on any of the custom data identifiers, macie-rule-custom, uses the event pattern shown below. This pattern examines the finding event for the name of one of the three custom data identifier names that triggered it.

Post’s event rules, shown in the Amazon EventBridge console

Each EventBridge rule contains an event pattern. The event pattern is used to filter the incoming stream of events for particular patterns. The EventBridge rule that is triggered when a Macie finding is based on one of the three custom data identifiers, macie-rule-high, uses the event pattern shown below. This pattern examines the finding event for the name of one of the three custom data identifier names that triggered it.

{
  "source": [
    "aws.macie"
  ],
  "detail-type": [
    "Macie Finding"
  ],
  "detail": {
    "classificationDetails": {
      "result": {
        "customDataIdentifiers": {
          "detections": {
            "name": [
              "Patient ID",
              "US Passport",
              "US Driver License"
            ]
          }
        }
      }
    }
  }
}

Six data files, containing High severity findings, will be moved to the isolation bucket by the Lambda, triggered by EventBridge.

Isolation bucket containing data files with High severity findings

Scheduled Sensitive Data Discovery Jobs

Data sources commonly deliver data on a repeated basis, such as nightly data feeds. For these types of data sources, we can schedule sensitive data discovery jobs to run on a scheduled basis. For this demonstration, we will create a scheduled job using the AWS SDK for Python (Boto3). Unlike the AWS CLI-based one-time job, you don’t need to modify the project’s script, scripts/create_macie_job_daily.py. The Python script will retrieve your AWS account ID and three custom data identifier IDs. The Python script then runs the create_classification_job command.

	#!/usr/bin/env python3

	# Purpose: Create Daily Macie classification job – Synthea patient data
	# Author: Gary A. Stafford (March 2021)

	import logging
	import sys

	import boto3
	from botocore.exceptions import ClientError

	logging.basicConfig(format='[%(asctime)s] %(levelname)s – %(message)s', level=logging.INFO)

	ssm_client = boto3.client('ssm')
	sts_client = boto3.client('sts')
	macie_client = boto3.client('macie2')


	def main():
	params = get_parameters()
	account_id = sts_client.get_caller_identity()['Account']
	custom_data_identifiers = list_custom_data_identifiers()
	create_classification_job(params['patient_data_bucket'], account_id, custom_data_identifiers)


	def list_custom_data_identifiers():
	"""Returns a list of all custom data identifier ids"""

	custom_data_identifiers = []

	try:
	response = macie_client.list_custom_data_identifiers()
	for item in response['items']:
	custom_data_identifiers.append(item['id'])
	return custom_data_identifiers
	except ClientError as e:
	logging.error(e)
	sys.exit(e)


	def create_classification_job(patient_data_bucket, account_id, custom_data_identifiers):
	"""Create Daily Macie classification job"""

	try:
	response = macie_client.create_classification_job(
	customDataIdentifierIds=custom_data_identifiers,
	description='Review Synthea patient data (Daily)',
	jobType='SCHEDULED',
	initialRun=True,
	name='SyntheaPatientData_Daily',
	s3JobDefinition={
	'bucketDefinitions': [
	{
	'accountId': account_id,
	'buckets': [
	patient_data_bucket
	]
	}
	],
	'scoping': {
	'includes': {
	'and': [
	{
	'simpleScopeTerm': {
	'comparator': 'EQ',
	'key': 'OBJECT_EXTENSION',
	'values': [
	'csv',
	]
	}
	},
	]
	}
	}
	},
	samplingPercentage=100,
	scheduleFrequency={
	'dailySchedule': {}
	},
	tags={
	'Project': 'Amazon Macie Demo'
	}
	)
	logging.debug(f'Response: {response}')
	except ClientError as e:
	logging.error(e)
	sys.exit(e)


	def get_parameters():
	"""Load parameter values from AWS Systems Manager (SSM) Parameter Store"""

	params = {
	'patient_data_bucket': ssm_client.get_parameter(Name='/macie_demo/patient_data_bucket')['Parameter']['Value']
	}

	return params


	if __name__ == '__main__':
	main()

view raw create_macie_job_daily.py hosted with ❤ by GitHub

To create the scheduled sensitive data discovery job, run the following command.

python3 ./scripts/create_macie_job_daily.py

The scheduleFrequency parameter is set to { 'dailySchedule': {} }. This value specifies a daily recurrence pattern for running the job. The initialRun parameter of the create_classification_job command is set to True. This will cause the new job to analyze all eligible objects immediately after the job is created, in addition to on a daily basis.

Scheduled sensitive data discovery job in an active/idle state

Conclusion

In this post, we learned how we can use Amazon Macie to discover and protect sensitive data in Amazon S3. We learned how to use automation to trigger alerts based on Macie’s findings and to isolate data files based on the types of findings. The post’s data inspection workflow can easily be incorporated into existing data lake ingestion pipelines to ensure the integrity of incoming data.

This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

Amazon Macie, AWS, Data Governance, Data Lake, Data Protection

This entry was posted on March 15, 2021, 6:35 am and is filed under AWS, Bash Scripting, Big Data, Cloud, Enterprise Software Development, Python. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

#1 by RandyB on March 18, 2021 - 3:09 pm

Gary, excellent post! I needed to investigate a python script for create a Macie classification job, and found your write-up. This has literally saved me days of efforts. Thank you.

Programmatic Ponderings

Employing Amazon Macie to Discover and Protect Sensitive Data in your Amazon S3-based Data Lake

Introduction

Data Lake

Protecting the Data Lake

Amazon Macie

Data Discovery and Protection

Source Code

Sample Data

Amazon Macie Custom Data Identifier

Enable Macie

CloudFormation Stack

Upload Data

Sensitive Data Discovery Jobs

Isolating High Severity Findings

Amazon EventBridge

Scheduled Sensitive Data Discovery Jobs

Conclusion

Leave a comment Cancel reply

Gary Stafford

Recent Posts

Top Posts & Pages

Tag Cloud

Tweets

Programmatic Ponderings

Employing Amazon Macie to Discover and Protect Sensitive Data in your Amazon S3-based Data Lake

Introduction

Data Lake

Protecting the Data Lake

Amazon Macie

Data Discovery and Protection

Source Code

Sample Data

Amazon Macie Custom Data Identifier

Enable Macie

CloudFormation Stack

Upload Data

Sensitive Data Discovery Jobs

Isolating High Severity Findings

Amazon EventBridge

Scheduled Sensitive Data Discovery Jobs

Conclusion

Share this:

Leave a comment Cancel reply

Gary Stafford

Recent Posts

Top Posts & Pages

Tag Cloud

Tweets