My Experience: Replicating On-Prem Oracle CDC to BigQuery Using Debezium and Pub/Sub

Description
Learn how I replicated on-prem Oracle CDC data to BigQuery using Debezium and Google Cloud Pub/Sub. A step-by-step guide based on real-world experience.
Lately, I’ve been working on modernizing how we replicate data from our on-premises Oracle database into Google BigQuery for analytics. After a lot of research and trials, I landed on a pretty solid pipeline using Debezium, Apache Kafka, Google Cloud Pub/Sub, and BigQuery.
Here’s how I got it all working—and how you can too.
We needed near real-time data in BigQuery for dashboards, but the data lived in an on-prem Oracle database. Manual extracts were too slow and inconsistent. We needed a Change Data Capture (CDC) setup that could stream data continuously.
The Stack I Used
- Debezium (Oracle Connector) – for CDC
- Apache Kafka – temporary buffer & transport
- Google Cloud Pub/Sub – cloud-native messaging
- Google BigQuery – final destination for analytics
- Dataflow (optional) – to transform/load into BigQuery
Set Up Debezium for Oracle CDC
Debezium is an open-source CDC tool that integrates well with Kafka. For Oracle, I used Debezium’s Oracle connector, which requires setting up:
- LogMiner in Oracle to track changes
-
Oracle permissions for mining logs (
GRANT SELECT ON V_$LOGMNR_CONTENTS, etc.) - Kafka Connect running with the Debezium plugin
My connector configuration looked something like this (simplified):
{
"name": "oracle-connector",
"config": {
"connector.class": "io.debezium.connector.oracle.OracleConnector",
"database.hostname": "oracle.internal.company.com",
"database.port": "1521",
"database.user": "debezium",
"database.password": "*****",
"database.dbname": "ORCL",
"table.include.list": "HR.EMPLOYEES",
"database.history.kafka.topic": "schema-changes.oracle",
"database.history.kafka.bootstrap.servers": "kafka:9092"
}
}
Once the connector was live, Debezium started capturing row-level changes in real-time from the Oracle tables.
Stream Events to Google Pub/Sub
Since we couldn’t push data from on-prem directly into Pub/Sub, I set up a Kafka Connect sink connector to forward messages to Google Cloud Pub/Sub.
Alternatively, you can use a Kafka-Pub/Sub bridge or build a lightweight custom consumer that reads Kafka topics and publishes messages to Pub/Sub using the Pub/Sub API.
Each Oracle table change was streamed into a specific Pub/Sub topic—one topic per table for simplicity.
Load into BigQuery
Now that the data was flowing into Pub/Sub, the next task was to load it into BigQuery.
I had two options:
-
Use Cloud Dataflow
-
Wrote a Dataflow pipeline (Apache Beam) to:
- Subscribe to the Pub/Sub topic
- Parse CDC events (Debezium format)
- Insert into the appropriate BigQuery table
- This approach gave us full control over transformation and error handling.
-
Wrote a Dataflow pipeline (Apache Beam) to:
-
Use Pub/Sub to BigQuery directly
- For simpler cases, you can use Pub/Sub subscription with a BigQuery sink
- Just make sure to flatten the schema (Debezium JSON format can be nested)
In my case, I needed some transformation (like field renaming, audit columns), so I went with Dataflow.
Sample Data Flow
Here’s a simplified view of the data flow:
Oracle DB → Debezium → Kafka → Pub/Sub → Dataflow → BigQuery
Each component plays a key role:
- Debezium captures CDC
- Kafka buffers data
- Pub/Sub streams to the cloud
- Dataflow processes and writes to BigQuery
Results
With this setup:
- Data latency reduced to under 10 seconds
- We eliminated batch jobs and cron-based loads
- Data teams now run queries in BigQuery almost in real time
Notes
- Debezium’s Oracle connector is powerful but needs careful tuning (especially log mining configuration)
- Monitoring is key—use Kafka Connect logs and Pub/Sub dead letter topics
- Handle schema evolution carefully—Debezium provides schema history in the message envelope
Final Thoughts
This pipeline helped us to bring legacy on-prem Oracle data into the cloud. If you’re dealing with similar requirements and want a cost-effective, open-source, real-time CDC pipeline, the Debezium + Pub/Sub + BigQuery stack is a great choice.
292 total views, 1 views today
