AWS Data Lake 🌊

Building a modern data pipeline with Terraform, AWS Glue, S3, and Athena.

🔍 Project Overview

This project demonstrates a scalable data lake architecture hosted on AWS. Using Terraform, I provisioned an end-to-end data pipeline including S3 buckets for storage, IAM roles for security, a Glue database for metadata, and a Glue ETL job for processing.

The pipeline enables a classic "Bronze-to-Silver" transformation: raw JSON event data is uploaded to a landing zone, transformed via a PySpark script, and stored in optimized Parquet format for high-performance querying with Amazon Athena.

🛠️ Tech Stack

Terraform AWS Glue (ETL) S3 (Data Lake) Amazon Athena Python / PySpark IAM Roles & Policies

🧑🏻‍💻 Transformation & Querying

Glue ETL: PySpark Transformation

A script that reads from the raw catalog and writes to S3 as Parquet.

# Glue ETL: Transform raw JSON to Parquet
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="datalake_db",
    table_name="julien_datalake_raw"
)

# Converting DynamicFrame to DataFrame for transformation
df = datasource.toDF()

# Writing transformed data back to Processed S3 Bucket
df.write.mode("overwrite").parquet(
    "s3://julien-datalake-processed/events/"
)

Athena: SQL Analysis

Instant insights on petabytes of data using standard SQL.

-- Select the top 10 transformed event records
SELECT 
    event_id, 
    event_type, 
    user_id, 
    from_unixtime(timestamp) as event_time
FROM datalake_db.processed_events
WHERE event_type = 'purchase'
ORDER BY timestamp DESC
LIMIT 10;

🚀 Next Steps & Evolution

To evolve this project into a production-grade system, I am looking to implement:

Skills Demonstrated

  • Data Engineering (ETL Pipelines)
  • Infrastructure as Code (Terraform)
  • Distributed Computing (PySpark)
  • Serverless Analytics (Athena)
  • AWS Security & IAM Best Practices

Source Code

The Terraform modules and ETL scripts are available on my GitHub.

View on GitHub →