Capstone Integrated Solutions

Senior Data Engineer (AWS)

RemotePosted 4 days ago

Capnexus is seeking a highly skilled Senior AWS Data Engineer to lead data architecture, pipeline development, and data integrations, leveraging advanced cloud data engineering skills on a platform that uses generative AI to automate and modernize enterprise workflows.

Location: Remote

Responsibilities

Participate in data discovery workshops to inventory source systems including property management platforms, marketing channels, and CRM data, and translate findings into data lake architecture requirements.
Design and implement a multi-zone enterprise data lake on Amazon S3 (raw, conformed, enriched, aggregated) with ingest, cleansing, and business layers including schema versioning, checksum validation, business rule validation, and quarantine/notify workflows on failure.
Build batch and streaming data ingestion pipelines using AWS Glue, Amazon Kinesis, and containerized ingestion applications across CDP, marketing, and property management data sources.
Write PySpark and Python ETL code for AWS Glue jobs to transform, cleanse, and enrich data at scale; apply Apache Iceberg table format for ACID-compliant, schema-evolving data lake tables.
Implement data transformation and orchestration frameworks using AWS Glue ETL and AWS Step Functions; configure AWS Glue Data Catalog with crawlers for automated metadata management and discovery.
Implement AWS Lake Formation for fine-grained data governance including table-level and column-level permissions, data filters, and resource links.
Configure Amazon Athena for serverless SQL querying across the data lake with performance optimization; implement Amazon DynamoDB for sub-second customer profile lookups, with DAX where latency requirements demand it.
Develop and deploy AWS Lambda functions using AWS Lambda Powertools for structured logging, handler routing, and observability; implement error handling patterns including exponential backoff, retries, dead-letter queues, and CloudWatch alarms.
Write and maintain Terraform (or CloudFormation/CDK) modules to provision and deploy AWS data infrastructure as part of the CI/CD pipeline.
Integrate CI/CD pipelines using GitHub Actions for automated deployment of Glue jobs, Lambda functions, and Step Functions workflows with lint checks and validation gates.
Support Azure Data Lake migration: conduct discovery of ADLS assets, schemas, and transformation logic; provision AWS target environments; execute migration via AWS DataSync; perform row-count reconciliation, schema validation, and checksum comparison post-migration.
Design and implement entity resolution pipelines to identify, deduplicate, and merge customer records into unified golden records using deterministic and fuzzy matching with lineage tracking and manual review pathways.
Build and maintain data models to support Customer 360 views and executive analytics dashboards via Amazon QuickSight.
Ensure data quality, validation, and integrity across all pipeline stages; support UAT for data-dependent features.
Collaborate with Full Stack, DevOps/MLOps, and AI/ML team members working with Bedrock and SageMaker; contribute to architecture documentation, pipeline runbooks, and data governance documentation.

Requirements

5+ years of hands-on data engineering experience with at least 2+ years in AWS cloud environments.
Strong proficiency in Python and SQL; hands-on PySpark or Scala coding experience for AWS Glue ETL — this is a coding role, not a configuration role.
Hands-on experience with AWS Glue (jobs, crawlers, Data Catalog), AWS Step Functions, AWS Lambda, and Amazon S3 data lake architecture.
Proficiency with AWS Lambda Powertools for structured logging, handler management, and observability in production serverless workloads.
Working knowledge of Apache Iceberg table format including schema evolution, time travel, and partition management.
Hands-on experience with Terraform, AWS CloudFormation, or AWS CDK for infrastructure as code integrated into CI/CD pipelines — candidates who have only consumed pre-made DevOps templates will not meet this requirement.
Experience with AWS Lake Formation for fine-grained access control including table-level and column-level permissions, data filters, and resource links.
Solid understanding of DynamoDB data modeling and key design patterns for sub-second lookups; familiarity with DAX for caching.
Experience with Amazon Athena performance tuning: file formats, partitioning strategies, query optimization, and understanding of when Athena is and is not the right tool.
Experience with GitHub Actions or comparable CI/CD tooling for automated deployment of data pipeline code.
Strong understanding of data quality patterns: schema validation, checksum validation, business rule validation, quarantine workflows, and lineage tracking.
Strong analytical, problem-solving, and communication skills; comfortable working in Agile/Scrum teams alongside AWS Professional Services.