Principal responsibilities
Design and Develop ETL Processes:
- Lead the design and implementation of ETL processes using all kinds of batch/streaming tools to extract, transform, and load data from various sources into GCP.
- Collaborate with stakeholders to gather requirements and ensure that ETL solutions meet business needs.
Data Pipeline Optimization:
- Optimize data pipelines for performance, scalability, and reliability, ensuring efficient data processing workflows.
- Monitor and troubleshoot ETL processes, proactively addressing issues and bottlenecks.
Data Integration and Management:
- Integrate data from diverse sources, including databases, APIs, and flat files, ensuring data quality and consistency.
- Manage and maintain data storage solutions in GCP (e.g., Big Query, Cloud Storage) to support analytics and reporting.
GCP Dataflow Development:
- Write Apache Beam based Dataflow Job for data extraction, transformation, and analysis, ensuring optimal performance and accuracy.
- Collaborate with data analysts and data scientists to prepare data for analysis and reporting.
Automation and Monitoring:
- Implement automation for ETL workflows using tools like Apache Airflow or Cloud Composer, enhancing efficiency, and reducing manual intervention.
- Set up monitoring and alerting mechanisms to ensure the health of data pipelines and compliance with SLAs.
Data Governance and Security:
- Apply best practices for data governance, ensuring compliance with industry regulations (e.g., GDPR, HIPAA) and internal policies.
- Collaborate with security teams to implement data protection measures and address vulnerabilities.
Documentation and Knowledge Sharing:
- Document ETL processes, data models, and architecture to facilitate knowledge sharing and onboarding of new team members.
- Conduct training sessions and workshops to share expertise and promote best practices within the team.