My Experience: Replicating On-Prem Oracle CDC to BigQuery Using Debezium and Pub/Sub



Description

Learn how I replicated on-prem Oracle CDC data to BigQuery using Debezium and Google Cloud Pub/Sub. A step-by-step guide based on real-world experience.


Lately, I’ve been working on modernizing how we replicate data from our on-premises Oracle database into Google BigQuery for analytics. After a lot of research and trials, I landed on a pretty solid pipeline using Debezium, Apache Kafka, Google Cloud Pub/Sub, and BigQuery.

Here’s how I got it all working—and how you can too.


We needed near real-time data in BigQuery for dashboards, but the data lived in an on-prem Oracle database. Manual extracts were too slow and inconsistent. We needed a Change Data Capture (CDC) setup that could stream data continuously.


The Stack I Used

  • Debezium (Oracle Connector) – for CDC
  • Apache Kafka – temporary buffer & transport
  • Google Cloud Pub/Sub – cloud-native messaging
  • Google BigQuery – final destination for analytics
  • Dataflow (optional) – to transform/load into BigQuery

Set Up Debezium for Oracle CDC

Debezium is an open-source CDC tool that integrates well with Kafka. For Oracle, I used Debezium’s Oracle connector, which requires setting up:

  • LogMiner in Oracle to track changes
  • Oracle permissions for mining logs (GRANT SELECT ON V_$LOGMNR_CONTENTS, etc.)
  • Kafka Connect running with the Debezium plugin

My connector configuration looked something like this (simplified):

{
  "name": "oracle-connector",
  "config": {
    "connector.class": "io.debezium.connector.oracle.OracleConnector",
    "database.hostname": "oracle.internal.company.com",
    "database.port": "1521",
    "database.user": "debezium",
    "database.password": "*****",
    "database.dbname": "ORCL",
    "table.include.list": "HR.EMPLOYEES",
    "database.history.kafka.topic": "schema-changes.oracle",
    "database.history.kafka.bootstrap.servers": "kafka:9092"
  }
}

Once the connector was live, Debezium started capturing row-level changes in real-time from the Oracle tables.


Stream Events to Google Pub/Sub

Since we couldn’t push data from on-prem directly into Pub/Sub, I set up a Kafka Connect sink connector to forward messages to Google Cloud Pub/Sub.

Alternatively, you can use a Kafka-Pub/Sub bridge or build a lightweight custom consumer that reads Kafka topics and publishes messages to Pub/Sub using the Pub/Sub API.

Each Oracle table change was streamed into a specific Pub/Sub topic—one topic per table for simplicity.


Load into BigQuery

Now that the data was flowing into Pub/Sub, the next task was to load it into BigQuery.

I had two options:

  1. Use Cloud Dataflow
    • Wrote a Dataflow pipeline (Apache Beam) to:
      • Subscribe to the Pub/Sub topic
      • Parse CDC events (Debezium format)
      • Insert into the appropriate BigQuery table
    • This approach gave us full control over transformation and error handling.
  2. Use Pub/Sub to BigQuery directly
    • For simpler cases, you can use Pub/Sub subscription with a BigQuery sink
    • Just make sure to flatten the schema (Debezium JSON format can be nested)

In my case, I needed some transformation (like field renaming, audit columns), so I went with Dataflow.


Sample Data Flow

Here’s a simplified view of the data flow:

Oracle DB → Debezium → Kafka → Pub/Sub → Dataflow → BigQuery

Each component plays a key role:

  • Debezium captures CDC
  • Kafka buffers data
  • Pub/Sub streams to the cloud
  • Dataflow processes and writes to BigQuery

Results

With this setup:

  • Data latency reduced to under 10 seconds
  • We eliminated batch jobs and cron-based loads
  • Data teams now run queries in BigQuery almost in real time

Notes

  • Debezium’s Oracle connector is powerful but needs careful tuning (especially log mining configuration)
  • Monitoring is key—use Kafka Connect logs and Pub/Sub dead letter topics
  • Handle schema evolution carefully—Debezium provides schema history in the message envelope

Final Thoughts

This pipeline helped us to bring legacy on-prem Oracle data into the cloud. If you’re dealing with similar requirements and want a cost-effective, open-source, real-time CDC pipeline, the Debezium + Pub/Sub + BigQuery stack is a great choice.



 292 total views,  1 views today

Leave a Reply

Your email address will not be published. Required fields are marked *