What Is the role?
We need a data engineer who can build and operate production data pipelines on AWS. You’ll work with S3, Glue, and Athena daily — ingesting data from various sources, transforming it into usable formats, and making it queryable for analytics and AI teams. This is a hands-on role where you own the data layer end-to-end.
Key Responsibilities
Pipeline Development:
- Design and build ELT pipelines using AWS Glue (ETL jobs, crawlers, Data Catalog) and S3
- Ingest data from relational databases, APIs, event streams, and flat files
- Implement schema evolution, partitioning strategies, and file format optimization (Parquet, ORC, Iceberg)
- Build orchestrated workflows using Glue Workflows, Step Functions, or Airflow
Data Lake & Storage:
- Design and maintain S3-based data lake architecture with clear layer separation (raw, cleaned, curated)
- Optimize S3 layout for query performance and cost — partitioning, compaction, and lifecycle policies
- Implement data cataloging and metadata management with Glue Data Catalog
Query & Analytics:
- Optimize Athena queries for performance and cost
- Build views and tables that analytics and BI teams can self-serve from
- Support data modeling for analytics use cases (star schema, dimensional modeling)
Quality & Operations:
- Implement data quality checks and validation at each pipeline stage
- Set up monitoring and alerting for pipeline failures and data anomalies (CloudWatch)
- Enforce data access controls, IAM policies, encryption, and governance
- Document data flows, schemas, and pipeline dependencies
Required Skills
AWS Data Services (Hands-on):
- S3 — data lake storage, lifecycle policies, access control, and layout optimization
- AWS Glue — ETL jobs (PySpark), crawlers, Data Catalog, and job bookmarks
- Athena — writing and optimizing analytical queries over S3 data
- Step Functions or Glue Workflows — pipeline orchestration
- CloudWatch — monitoring, logging, and alerting for data pipelines
- IAM / KMS — data security, encryption, and access management
Data Engineering Fundamentals:
- 2+ years building data pipelines in production
- Strong SQL skills — complex joins, window functions, CTEs, and query optimization
- Experience with columnar formats (Parquet, ORC) and partitioning strategies
- Understanding of data lake design patterns and layer separation (bronze/silver/gold or raw/cleaned/curated)
- Data modeling for analytics: star schema, wide tables, and dimensional modeling
- Python for ETL scripting and transformations
General:
- Git and CI/CD for data pipeline code (GitHub Actions, CodePipeline)
- Data quality testing and validation approaches
- Clear communication — can translate business data needs into technical designs
Preferred Skills
- Experience with Apache Iceberg or Delta Lake table formats
- Streaming ingestion with Kinesis or Kafka, and CDC tools (Debezium, DMS)
- Familiarity with Redshift, EMR, or Lake Formation
- Experience supporting ML pipelines and feature stores
- Airflow for pipeline orchestration
- Scala or PySpark beyond basic Glue jobs
- Experience at a consulting or product engineering firm
Personal Qualities
- You care about data quality — bad data downstream bothers you
- Methodical debugger — can trace a pipeline failure from alert to root cause
- Thinks about cost from the start, not as an afterthought
- Documents data flows and schemas without being asked
- Comfortable working across teams (analytics, ML, product)
What We Offer
- Opportunity to work on GenAI, cloud-first projects for diverse clients
- Collaborative engineering culture with mentoring and career growth
- Competitive salary and benefits (location-adjusted)
- Flexible work arrangements