Course Introduction
Renewal prep
Professional Data Engineer Renewal
Refresh the skills that expire on your Data Engineer badge: pipelines, storage decisions, governance, and operations. Use this outline to target renewal topics fast.
Pipeline refresh
Revisit batch/stream choices and managed services that minimize ops toil.
Storage choices
BigQuery, Bigtable, Spanner, and Cloud SQL?know defaults and migration cues.
Security & compliance
IAM scopes, CMEK, VPC SC, and lineage checks to keep data safe.
Status
Renewal outline live—more practice and diagrams coming.
Professional Data Engineer Renewal
Focus on the “high-yield” renewal competencies: BigQuery optimization, pipeline design, datastore choices, governance, and ML ops.
Summary
This guide targets the renewal-style questions: pick the right managed service, explain tradeoffs (cost/latency/ops), and apply best practices for performance, reliability, and governance.
Key Concepts
Open a topic to drill the essentials.
BigQuery performance tuning
Partitioning vs Clustering: partition by date/timestamp for pruning; cluster by high-cardinality IDs to speed aggregations and filters.
Denormalization: prefer nested + repeated fields (STRUCT/ARRAY) to reduce expensive joins.
External tables: query data in GCS / Google Sheets without loading, when freshness matters.
Streaming & batch pipelines
Exactly-once: use unique event IDs (Pub/Sub message id or app-generated id) and deduplicate at the sink or within the pipeline.
Orchestration: Composer (Airflow) for dependencies/retries across services; Workflows for lightweight API orchestration.
Scaling tip: avoid single gzip files (not splittable) to keep Dataflow parallel.
Datastore design
Bigtable row keys: avoid sequential hot-keys; use device_id#timestamp patterns to distribute writes.
Spanner: global horizontal scale + strong consistency for transactional systems.
Security & governance
Authorized Views: share aggregated outputs without exposing raw PII tables.
Cloud DLP: inspect/redact sensitive fields in pipelines.
Joinable masking: SHA256 hashing when you need deterministic joins/counts without revealing identifiers.
Key Questions
Click to reveal the answer you should say in an exam response.
Efficient IoT time-series without hotspots?
Use Bigtable with a row key that puts high-cardinality device_id before timestamp (e.g., device_id#timestamp).
Cheapest way to run a daily Spark job without rewrites?
Use Dataproc ephemeral clusters via Workflow Templates so you pay only for execution time.
Give analysts ML without moving data?
Use BigQuery ML (BQML) to train/predict using SQL directly in BigQuery.
Stop massive query cost spikes by users?
Implement BigQuery custom quotas / bytes billed limits per user or project, plus monitoring and governance.
Fast Dataflow worker startup?
Build a custom container image with all dependencies pre-installed to avoid runtime downloads.
Vocabulary
Time Travel
Query BigQuery data as it existed within the past 7 days.
Datastream
Serverless CDC/replication service for low-downtime migrations.
UNNEST
Flattens arrays into rows for querying nested fields.
Transfer Appliance
Ship large datasets (e.g., 50TB) to Google Cloud when bandwidth is limited.
Lifecycle Management
Automatically transition GCS objects to cheaper storage tiers by age/conditions.
Dataproc Ephemeral Clusters
Spin up a cluster just for a job (Workflow Templates), then delete it to save cost.
Flashcards
Renewal-focused recall: governance, pipelines, storage, and ops
Question Text
Click to reveal answerAnswer Text
Architecture Decision Diagrams
Click on any diagram title to show/hide the diagram.
Storage Decision