Building a modern data pipeline with Terraform, AWS Glue, S3, and Athena.
This project demonstrates a scalable data lake architecture hosted on AWS. Using Terraform, I provisioned an end-to-end data pipeline including S3 buckets for storage, IAM roles for security, a Glue database for metadata, and a Glue ETL job for processing.
The pipeline enables a classic "Bronze-to-Silver" transformation: raw JSON event data is uploaded to a landing zone, transformed via a PySpark script, and stored in optimized Parquet format for high-performance querying with Amazon Athena.
A script that reads from the raw catalog and writes to S3 as Parquet.
# Glue ETL: Transform raw JSON to Parquet
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datalake_db",
table_name="julien_datalake_raw"
)
# Converting DynamicFrame to DataFrame for transformation
df = datasource.toDF()
# Writing transformed data back to Processed S3 Bucket
df.write.mode("overwrite").parquet(
"s3://julien-datalake-processed/events/"
)
Instant insights on petabytes of data using standard SQL.
-- Select the top 10 transformed event records
SELECT
event_id,
event_type,
user_id,
from_unixtime(timestamp) as event_time
FROM datalake_db.processed_events
WHERE event_type = 'purchase'
ORDER BY timestamp DESC
LIMIT 10;
To evolve this project into a production-grade system, I am looking to implement: