[AWS] Detect sensitive data using Macie with fully event-driven architecture in all regions

rex.chun
4 min readSep 18, 2022
I decided to call this architecture ‘rainbow’

Summary

It’s possible to build a completely event-driven based architecture using AWS Macie. Let me introduce this.

What’s the Macie?

Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to help you discover, monitor, and protect sensitive data in your AWS environment.

This is an introduction to Macie. Yes. The Macie is a good feature for finding sensitive data. But, with no programming then, It’s like the earth having no sun.

Let’s see how can use Macie effectively with programming.

GitHub Link

How it works

  1. Deploy Macie, S3, Lambda, etc using Terraform
  2. Execute Lambda function using AWS’s event-driven architecture
  • Execute ssm:SendCommand for collecting specific files
  • After executing commands, create Macie's classification job
  • Finally, process and parse the job’s results received after the job exited using classificationjob CloudWatch Logs’s subscription filter
  • And create a results document and save it to the Bucket
  • Sends an alarm to Slack from time to time

Configuration Description

  1. macie.tf : Deploy Macie and create the export configuration, You can deploy your custom data identifier.
  2. s3.tf
  • macie-result-bucket : It’s a bucket that exports Macie’s inspection results. Because Macie saves inspection result up to 15 by default. Refer to this link.
  • private-info-bucket : Save the files that need to be scanned. and, To refer to real data. You can input your policy on what you want.
  • detect-result-bucket : Save the detected results to file(CSV). You can input your policy on what you want.

3. eventbridge.tf : It’s a scheduled job for detecting sensitive data using Lambda.

4. cloudwatch-logs.tf : This is Macie's job status log group for watching when the job ended. And, the job is ended then sends an event to Lambda for parsing the data using the subscription filter.

5. kms.tf : It’s a key for Macie export configuration.

6. sns:tf : It’s an alarm to detecting Systems Manager Run Command ended for Lambda creating Macie classification job.

7. lambda.tf : This is a very important Lambda function for fully event-driven architecture. This function allows many triggers.

8. _variables.tf : Input your Slack hook URL and channel for receiving this Architecture’s job status.

Lambda function Description

I introduce my code simple but important spot. Like complicated(?) logic, API characteristics, etc. Let’s see!

First, get available regions and get available instances for using send_command API. And, execute send_command() using Linux commands.

  1. Get INSTANCE_ID using instance metadata.
  2. Find files that are according to the rules except for some paths.
  3. Exits to state 0 for graceful shutdown notification for Run Command.

You should the COMMAND_COMPLETED_REGIONSvariable ignore now.

Second, Lambda gets an alarm from the SNS(Simple Notification Service). Then, run Macie’s classification job. Include all managed identifiers and custom identifiers.

Frankly, This code is the most important in my architecture. Let’s take a look at it slowly.

  1. Lambda receives an alarm from the CloudWatch Logs subscription filter(JOB_COMPLETED).
  2. Get a job_id parsing the message.
  3. List findings using job_id and get findings using all_finding_ids.
  4. Lambda gets CLASSIFICATION category’s data using list comprehension.
  5. And, Whether addtionalOccurrences key exists or not after checking the status code. This key’s description refers to this link.
  6. Get detected results from sensitive results.

This function has two-way conditions that custom and managed(sensitive) . Finally, get private info from an object using a function like the one below.

Macie has schemas that check multiple types. cells , pages , records , offsetRanges , lineRanges. offsetRanges is not available now.

I think You can understand this logic because It’s just code that is written to a document.

Additionally, Lambda has many other functions like convert_center_string_to_asterisk , stream_to_str , etc. But, It’s simple I think.

Now, Let’s look function’s initial part. To understand this part, You should know How Lambda works. COMMAND_COMPLETED_REGIONS variable is shared in a warm state. So, I wrote the variable on top. But, It maybe occurs race condition.(I think it’s a very low probability.) If you wanna solve this problem you should use lock mechanisms like Semaphore, and Mutex.

Results

Slack example
CSV file

Limit

  1. Many regions, one bucket’s region so aws s3 cp command’s speed is different each region. If you solve this problem, You distribute private-info-bucket region by region.
  2. Cross account… You can create roles for assuming. And Modify my Lambda function’s code.
  3. Lambda timeout(60s), and memory(512MB) are insufficient maybe. If you have instances greater than 50? 100? scale-up your Lambda function.
  4. This architecture is only for Linux OS. If you use Windows OS then, update send_command() and scripts.
  5. This architecture needs EC2’s instance profile. You can refer to this link. If you wanna forced instance profile assignment.
  6. https://acloudguru.com/blog/engineering/how-to-keep-your-lambda-functions-warm

--

--