How Oracle JDBC driver handles LOBs: counting the network roundtrips

This blog post is inspired by the misbehaving Kafka sink. When we traced the queries it sends to the database, the results were spectacular. There has to be some serious engineering going on:

Instead of looking at the source of said thing, I created a test case so it is possible to experiment with things in a controlled manner. TL;DR: root cause is described in Oracle documentation, of course, but it takes some fine print reading to figure out what is going wrong here.

Test data

INMTEST@localhost/orclpdb1 22-08-2023 17:03:47> CREATE SEQUENCE seq_json_test START WITH 80000;

Sequence created.

Elapsed: 00:00:00.01
INMTEST@localhost/orclpdb1 22-08-2023 17:03:54> CREATE TABLE json_test (
    id      INTEGER PRIMARY KEY,
    payload CLOB,
    CONSTRAINT ck_payload_json CHECK ( payload IS JSON )
);

Table created.

Elapsed: 00:00:00.02
INMTEST@localhost/orclpdb1 22-08-2023 17:04:01> INSERT INTO json_test
    SELECT
        ROWNUM,
        JSON_OBJECT(* RETURNING CLOB)
    FROM
        all_objects;

66193 rows created.

Elapsed: 00:00:06.30
INMTEST@localhost/orclpdb1 22-08-2023 17:04:16> COMMIT;

Commit complete.

Elapsed: 00:00:00.02

I’m using CLOB for JSON data because character data is easy to spot in the tcpdump output. It’s better to stick to BLOB or JSON data types for all other purposes.

Java code is at the end of this post. Roughly speaking, there are two ways to interact with LOBs, either through streams or reader/writer objects, or as a Clob/Blob. The main difference is that streaming allows to access LOBs without knowing their size or buffering them in the memory. But as it comes out later in the post, specifying the size of the LOB when writing is essential for the performance.

Row prefetch size, and therefore write batch size is set to default (10).

LOB prefetching

Starting with Oracle 11.1, the JDBC driver fetches LOBs with the rest of the row. The parameter oracle.jdbc.defaultLobPrefetchSize controls how many bytes are fetched. The default is 32k, and this is per single LOB. Size -1 turns off LOB metadata and data prefetching, and 0 enables metadata prefetching and turns off data prefetching. Default LOB prefetch size can be set globally, and it is possible to override it for the Statement.

If we set defaultLobPrefetchSize to -1, both metadata operations and reading Clob data create network roundtrip to the server:

Metadata isn’t cached either; each call to Clob.length() triggers a network roundtrip. Another interesting thing here is wait event SQL*Net vector data to client. According to MOS document 2133462.1, Oracle 11.2 introduced “vectored IO” for bulk data transfers, which ignores SDU.

When defaultLobPrefetchSize is smaller than LOB, the rest of the LOB is fetched from the DB when accessed. This means a network roundtrip for each remaining LOB piece.

Writing LOBs

Good news is that JDBC batching works. Avoiding excessive network roundtrips is much more trickier.

Naive case

As a simplest case, I can write back the same LOB I read from the database:

dummy = rset.getClob(2);
writeSt.setObject(1, dummy, OracleType.CLOB);
writeSt.addBatch();

As a result Oracle JDBC driver sends instead of the data LOB handles back to the database:

Trouble with the LOB size

When LOB size is known, Oracle uses either direct binding or stream binding. From a performance perspective, both are fine: statement batching works, and data goes to the database without extra network roundtrips. The documentation warns about possibly broken statement batching and additional network roundtrips, but I haven’t noticed any.

There are overloads for set*Stream() functions that omit the length. These functions use LOB binding no matter what the LOB size is. It might be tempting for the developer to ignore how many bytes are in the stream, but the result is that it takes a minimum of 3 network roundtrips to write one LOB. Oh, and it breaks statement batching: in my test code, Statement.addBatch() call triggers the network roundtrips.

LOBWRITE seems to send a maximum of 32k of data in one roundtrip, so for the large objects, network chatter will proliferate. Extra network roundtrips are not the case with the stream binding.

Note the second LOBWRITE operation in the end:

WAIT #0: nam='asynch descriptor resize' ela= 22 outstanding #aio=0 current aio limit=0 new aio limit=128 obj#=-1 tim=14811764719105
WAIT #0: nam='Disk file operations I/O' ela= 201 FileOperation=2 fileno=203 filetype=6 obj#=-1 tim=14811764719278
LOBTMPCREATE: type=TEMPORARY LOB,bytes=1,c=852,e=1072,p=0,cr=0,cu=2,tim=14811764719541
WAIT #0: nam='SQL*Net message to client' ela= 3 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764719610
WAIT #0: nam='SQL*Net message from client' ela= 872 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764720509
LOBPGSIZE: type=TEMPORARY LOB,bytes=8132,c=28,e=28,p=0,cr=0,cu=0,tim=14811764720591
WAIT #0: nam='SQL*Net message to client' ela= 3 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764720638
WAIT #0: nam='SQL*Net message from client' ela= 13575 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764734232
WAIT #0: nam='SQL*Net more data from client' ela= 58 driver id=1413697536 #bytes=8132 p3=0 obj#=-1 tim=14811764734999
WAIT #0: nam='SQL*Net more data from client' ela= 46 driver id=1413697536 #bytes=8082 p3=0 obj#=-1 tim=14811764735294
WAIT #0: nam='SQL*Net more data from client' ela= 9 driver id=1413697536 #bytes=8032 p3=0 obj#=-1 tim=14811764735386
WAIT #0: nam='SQL*Net more data from client' ela= 38 driver id=1413697536 #bytes=7982 p3=0 obj#=-1 tim=14811764735491
WAIT #0: nam='SQL*Net more data from client' ela= 10 driver id=1413697536 #bytes=39 p3=0 obj#=-1 tim=14811764735619
WAIT #0: nam='SQL*Net more data from client' ela= 4 driver id=1413697536 #bytes=7893 p3=0 obj#=-1 tim=14811764735641
WAIT #0: nam='SQL*Net more data from client' ela= 5 driver id=1413697536 #bytes=7843 p3=0 obj#=-1 tim=14811764735700
WAIT #0: nam='SQL*Net more data from client' ela= 8 driver id=1413697536 #bytes=7793 p3=0 obj#=-1 tim=14811764735809
WAIT #0: nam='SQL*Net more data from client' ela= 24 driver id=1413697536 #bytes=7743 p3=0 obj#=-1 tim=14811764735929
LOBWRITE: type=TEMPORARY LOB,bytes=32528,c=1891,e=1836,p=0,cr=0,cu=28,tim=14811764736184
WAIT #0: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764736275
WAIT #0: nam='SQL*Net message from client' ela= 10329 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764746630
WAIT #0: nam='SQL*Net more data from client' ela= 96 driver id=1413697536 #bytes=8132 p3=0 obj#=-1 tim=14811764747109
WAIT #0: nam='SQL*Net more data from client' ela= 13 driver id=1413697536 #bytes=8082 p3=0 obj#=-1 tim=14811764747304
WAIT #0: nam='SQL*Net more data from client' ela= 9 driver id=1413697536 #bytes=8032 p3=0 obj#=-1 tim=14811764747398
WAIT #0: nam='SQL*Net more data from client' ela= 9 driver id=1413697536 #bytes=7982 p3=0 obj#=-1 tim=14811764747484
WAIT #0: nam='SQL*Net more data from client' ela= 11 driver id=1413697536 #bytes=39 p3=0 obj#=-1 tim=14811764747617
WAIT #0: nam='SQL*Net more data from client' ela= 4 driver id=1413697536 #bytes=7893 p3=0 obj#=-1 tim=14811764747639
WAIT #0: nam='SQL*Net more data from client' ela= 16 driver id=1413697536 #bytes=7843 p3=0 obj#=-1 tim=14811764748351
WAIT #0: nam='SQL*Net more data from client' ela= 8 driver id=1413697536 #bytes=6873 p3=0 obj#=-1 tim=14811764748486
LOBWRITE: type=TEMPORARY LOB,bytes=28002,c=1512,e=1777,p=0,cr=1,cu=42,tim=14811764748594
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764748657
WAIT #0: nam='SQL*Net message from client' ela= 2574 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=14811764751255
LOBTMPCREATE: type=TEMPORARY LOB,bytes=1,c=22,e=22,p=0,cr=0,cu=0,tim=14811764751342

Test code

It contains some Spring magic, connection to the database is handled by Spring. I modified fetch size and LOB prefetch size through the connect string, which looks something like this:

jdbc:oracle:thin:@my-dev-env:1521/orclpdb1?defaultRowPrefetch=10&oracle.jdbc.defaultLobPrefetchSize=1000

@Component
@Profile({ProfileNames.LOB2})
public class Lob2 {

    @Autowired
    DataSource ds;

    @Value("${db.checker.lobs.query}")
    String query;

    @Value("${db.checker.lobs.set_size}")
    boolean setSize;

    @Value("${db.checker.lobs.naive_write}")
    boolean beNaive;

    static final String TRACE_ENABLE = "call dbms_monitor.session_trace_enable()";
    static final String TRACE_DISABLE = "call dbms_monitor.session_trace_disable()";
    static final String TRACE_PAYLOAD = "select payload from V$DIAG_SESS_SQL_TRACE_RECORDS";
    static final String WRITEBACK = "insert into json_test(id, payload)"
                                       + " values(seq_json_test.nextval, :1)";
    static final long MAX_LENGTH = 60000;

    void callStatement(Connection c, String query) throws SQLException {
      Statement stmt = c.createStatement();
      stmt.execute(query);
      stmt.close();
    }
    void printTrace(Connection c) throws SQLException {
      String dummy;
      Statement stmt = c.createStatement();
      ResultSet rset = stmt.executeQuery(TRACE_PAYLOAD);
      while (rset.next()) {
        dummy = rset.getString(1);
        System.out.println(dummy);
      }
      stmt.close();
    }

    String embiggen(String str) {
      StringBuilder sb = new StringBuilder("[");
      while (sb.length() < MAX_LENGTH) {
        sb.append(str + ",");
      }
      sb.append("{}]");
      return sb.toString();
    }
    public void doStuff() {
      Clob dummy;
      long len;
      try {
        Connection conn = ds.getConnection();
        conn.setAutoCommit(false);
        callStatement(conn, TRACE_ENABLE);

        PreparedStatement readSt = conn.prepareStatement(query);
        PreparedStatement writeSt = conn.prepareStatement(WRITEBACK);

        //Won't be visible in V$DIAG_SESS_SQL_TRACE_RECORDS
        callStatement(conn, "call sys.dbms_system.ksdwrt(1, 'Before executeQuery')");

        ResultSet rset = readSt.executeQuery();
        callStatement(conn, "call sys.dbms_system.ksdwrt(1, 'Before ResultSet processing')");

        while (rset.next()) {
          dummy = rset.getClob(2);
          len = dummy.length(); // comes from the metadata
          if (beNaive) {
            writeSt.setObject(1, dummy, OracleType.CLOB);
          } else {
            String s = embiggen(dummy.getSubString(1, (int) len));
            if (setSize) {
              writeSt.setCharacterStream(1, new StringReader(s), s.length());
            } else {
              writeSt.setCharacterStream(1, new StringReader(s));
            }
          }
          writeSt.addBatch();
        }
        callStatement(conn, "call sys.dbms_system.ksdwrt(1, 'ResultSet done')");
        writeSt.executeBatch();
        writeSt.close();
        readSt.close();
        conn.commit();
        callStatement(conn, TRACE_DISABLE);
        printTrace(conn);
        conn.close();
      } catch (SQLException e) {
        throw new RuntimeException(e);
      }
    }
  }
}

Oracle In-Memory for virtual columns

If you want to materialize the results of a computation in Oracle, function-based indexes and materialized views are the standard answer. However, in-Memory provides the third option: in-memory virtual columns and expressions. While virtual expressions somewhat supersede in-memory virtual columns, they provide a relatively simple introduction to how virtual expression infrastructure works.

As per documentation, every IMCU has an associated In-memory Expression Unit (IMEU). IMEUs are populated with IMCUs; for a virtual column or expression to be loaded to in-memory, the table must be (re)populated.

Setup

Let’s have a look at how this thing works. First, some test data:

INMTEST@localhost/orclpdb1> CREATE TABLE transactions (
  2      id                 INTEGER PRIMARY KEY,
  3      customer_id        INTEGER NOT NULL,
  4      account_id         INTEGER NOT NULL,
  5      transaction_type   NUMBER(2) NOT NULL,
  6      external_reference VARCHAR2(256),
  7      amount             NUMBER(36, 2) NOT NULL,
  8      created_ts         TIMESTAMP NOT NULL,
  9      system_ts          TIMESTAMP
 10  )
 11      PARTITION BY RANGE (
 12          created_ts
 13      ) INTERVAL ( numtodsinterval(1, 'DAY') ) ( PARTITION p_first
 14          VALUES LESS THAN ( TIMESTAMP '2020-03-19 00:00:00.00' )
 15      );

Table created.

Elapsed: 00:00:00.03
INMTEST@localhost/orclpdb1> INSERT /*+ parallel enable_parallel_dml APPEND */ INTO transactions
  2      WITH gen AS (
  3          SELECT
  4              level id
  5          FROM
  6              dual
  7          CONNECT BY
  8              level < 5000
  9      )
 10      SELECT
 11          ROWNUM,
 12          mod(1 + g1.id * g2.id, 36000),
 13          mod(1 + g1.id * g2.id, 36000),
 14          mod(g1.id, 8),
 15          sys_guid(),
 16          mod(1 + g1.id * g2.id, 36000) + round(g1.id /(1 + g2.id), 4),
 17          systimestamp AT TIME ZONE 'CET' - dbms_random.value(0, 366),
 18          systimestamp
 19      FROM
 20               gen g1
 21          CROSS JOIN gen g2;

24990001 rows created.

Elapsed: 00:11:27.71
INMTEST@localhost/orclpdb1 03-04-2023 09:50:46> commit;

Commit complete.

Elapsed: 00:00:00.38

Virtual column disabled for in-memory

To test if in-memory virtual columns have any effect, I’m using a relatively slow, deterministic function, and I will create a virtual column based on that function. My expectation is that once the virtual column is loaded to the in-memory, queries will become faster since function calls are avoided. The second thing that should happen is once the virtual column is enabled for in-memory, repopulating the in-memory area will take longer.

INMTEST@localhost/orclpdb1 03-04-2023 09:39:18> CREATE OR REPLACE FUNCTION slow_round (
  2              n NUMBER, m NUMBER
  3  ) RETURN NUMBER
  4      AUTHID DEFINER
  5      DETERMINISTIC
  6      PARALLEL_ENABLE
  7  IS
  8      PRAGMA udf;
  9      l_dummy PLS_INTEGER;
 10  BEGIN
 11      for i in 0 .. 400 loop
 12          l_dummy := mod(i, 255);
 13      end loop;
 14      RETURN round(n, m);
 15  END slow_round;
 16  /

Function created.

Elapsed: 00:00:00.01
INMTEST@localhost/orclpdb1 03-04-2023 09:50:46> alter table transactions add amount_2      AS (slow_round(amount, 2));

Table altered.

Elapsed: 00:00:00.02
INMTEST@localhost/orclpdb1 03-04-2023 09:50:46> alter table transactions inmemory;

Table altered.

Elapsed: 00:00:00.22
INMTEST@localhost/orclpdb1 03-04-2023 09:50:46> alter table transactions no inmemory (amount_2);

Table altered.

Elapsed: 00:00:00.01

My test environment has only on table marked for in-memory, so I can just reload all the segments:

INMTEST@localhost/orclpdb1 03-04-2023 09:50:47> declare
  2      l_out NUMBER;
  3  begin
  4      l_out := DBMS_INMEMORY_ADMIN.POPULATE_WAIT('NONE', force => true);
  5      dbms_output.put_line('populate_wait -> ' || l_out);
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:56.53

A little check if everything was populated:

INMTEST@localhost/orclpdb1 03-04-2023 09:51:43> SELECT
  2      populate_status,
  3      COUNT(*)
  4  FROM
  5      gv$im_user_segments
  6  WHERE
  7      segment_name = 'TRANSACTIONS'
  8  GROUP BY
  9      populate_status;

POPULATE_STAT   COUNT(*)
------------- ----------
COMPLETED            367

Elapsed: 00:00:00.01
INMTEST@localhost/orclpdb1 03-04-2023 09:51:43>
INMTEST@localhost/orclpdb1 03-04-2023 09:51:43> SELECT
  2      SUM(inmemory_size),
  3      SUM(bytes),
  4      SUM(bytes_not_populated)
  5  FROM
  6      gv$im_user_segments
  7  WHERE
  8      segment_name = 'TRANSACTIONS'
  9  ;

SUM(INMEMORY_SIZE) SUM(BYTES) SUM(BYTES_NOT_POPULATED)
------------------ ---------- ------------------------
        2783707136 3078619136                        0

Elapsed: 00:00:00.01
INMTEST@localhost/orclpdb1 03-04-2023 09:51:43> SELECT
  2      column_name,
  3      obj_num,
  4      segment_column_id,
  5      inmemory_compression
  6  FROM
  7      v$im_column_level
  8  WHERE
  9      table_name = 'TRANSACTIONS';

COLUMN_NAME                                                                                                                          OBJ_NUM SEGMENT_COLUMN_ID INMEMORY_COMPRESSION
--------------------------------------------------------------------------------------------------------------------------------- ---------- ----------------- --------------------------
AMOUNT_2                                                                                                                              100347                0 NO INMEMORY
<snip>

Query to check how slow is our slow rounding function:

INMTEST@localhost/orclpdb1 03-04-2023 09:51:43> set autotrace on
INMTEST@localhost/orclpdb1 03-04-2023 09:51:43> SELECT
  2      SUM(amount_2) amount
  3  FROM
  4      transactions
  5  WHERE
  6      trunc(created_ts) = trunc(systimestamp - 128)
  7  ;

    AMOUNT
----------
1219916975

Elapsed: 00:00:19.07

Execution Plan
----------------------------------------------------------
Plan hash value: 4195887397

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     1 |    24 |  7331  (58)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE              |              |     1 |    24 |            |          |       |       |
|   2 |   PARTITION RANGE ALL        |              |   249K|  5857K|  7331  (58)| 00:00:01 |     1 |1048575|
|*  3 |    TABLE ACCESS INMEMORY FULL| TRANSACTIONS |   249K|  5857K|  7331  (58)| 00:00:01 |     1 |1048575|
-------------------------------------------------------------------------------------------------------------

Virtual column enabled for in-memory

Next I will add column AMOUNT_2 to the in-memory and repopulate the segments.

INMTEST@localhost/orclpdb1 03-04-2023 09:52:21> alter table transactions inmemory (amount_2);

Table altered.

Elapsed: 00:00:00.01
INMTEST@localhost/orclpdb1 03-04-2023 09:52:21>
INMTEST@localhost/orclpdb1 03-04-2023 09:52:21> set serverout off
INMTEST@localhost/orclpdb1 03-04-2023 09:52:21> declare
  2      l_out NUMBER;
  3  begin
  4      l_out := DBMS_INMEMORY_ADMIN.POPULATE_WAIT('NONE', force => true);
  5      dbms_output.put_line('populate_wait -> ' || l_out);
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:39:09.75
INMTEST@localhost/orclpdb1 03-04-2023 10:31:31> SELECT
  2      populate_status,
  3      COUNT(*)
  4  FROM
  5      gv$im_user_segments
  6  WHERE
  7      segment_name = 'TRANSACTIONS'
  8  GROUP BY
  9      populate_status;

POPULATE_STAT   COUNT(*)
------------- ----------
COMPLETED            365
STARTED                2

Elapsed: 00:00:00.01

Reloading the segments took whopping 39 minutes, and somehow POPULATE_WAIT gave up before all segments got repopulated. But for our test case this should be ok:

INMTEST@localhost/orclpdb1 03-04-2023 10:31:31> SELECT
  2      SUM(amount_2) amount
  3  FROM
  4      transactions
  5  WHERE
  6      trunc(created_ts) = trunc(systimestamp - 128)
  7  ;

    AMOUNT
----------
1219916975

Elapsed: 00:00:06.85

Execution Plan
----------------------------------------------------------
Plan hash value: 4195887397

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     1 |    24 |  7331  (58)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE              |              |     1 |    24 |            |          |       |       |
|   2 |   PARTITION RANGE ALL        |              |   249K|  5857K|  7331  (58)| 00:00:01 |     1 |1048575|
|*  3 |    TABLE ACCESS INMEMORY FULL| TRANSACTIONS |   249K|  5857K|  7331  (58)| 00:00:01 |     1 |1048575|
-------------------------------------------------------------------------------------------------------------

Amazingly it works; with the in-memory virtual column, our query took around 7 seconds vs 19 seconds without.

Package DBMS_JSON made useful

As of Oracle 21c, there are two ways to map JSON data into SQL data types and structures:

Virtual columns through JSON_VALUE function.
Views or materialized views with function JSON_TABLE.

You can do all the typing by yourself or use DBMS_JSON.ADD_VIRTUAL_COLUMNS and DBMS_JSON.CREATE_VIEW procedures to make the computer do it for you. Unfortunately, the DBMS_JSON package uses JSON datatypes, which is suboptimal with SQL. But of course, there’s a way around it.

SETUP

First some table to keep our JSON data

CREATE TABLE transactions (
    payload json NOT NULL,
        id AS ( JSON_VALUE(payload, '$.ID' RETURNING NUMBER(38) ERROR ON ERROR ERROR ON MISMATCH) )
            CONSTRAINT transaction_pk PRIMARY KEY,
        transaction_ts AS ( JSON_VALUE(payload, '$.TRANSACTION_TS.timestamp()' ERROR ON ERROR ERROR ON MISMATCH) )
)
    PARTITION BY RANGE (
        transaction_ts
    ) INTERVAL ( numtodsinterval(1, 'DAY') ) ( PARTITION p_first
        VALUES LESS THAN ( TIMESTAMP '2021-02-26 00:00:00.00' )
    );

I’m using JSON_VALUE to create virtual columns for the primary key and partitioning. I know id’s are integers and transaction_ts are timestamps so that I can add proper data types in JSON_VALUE.

And some nonsense data:

INSERT INTO transactions ( payload )
    WITH gen AS (
        SELECT
            ROWNUM id
        FROM
            dual
        CONNECT BY
            level < 2000
    )
    SELECT
        JSON_OBJECT(*)
    FROM
        (
            SELECT
                ROWNUM                                   id,
                systimestamp - dbms_random.value() * 355 transaction_ts,
                sys_guid()                               external_reference,
                mod(g2.id, 1023)                         customer_id,
                mod(g1.id, 1023)                         account_id,
                mod(g2.id, 121) + round(dbms_random.value(), 2) amount,
                (
                    SELECT
                        JSON_ARRAYAGG(obj)
                    FROM
                        (
                            SELECT
                                    JSON_OBJECT(
                                        'id' VALUE ROWNUM,
                                        'customer_acc' VALUE mod(g1.id, 1023),
                                        'system_acc' VALUE 0,
                                        'amount' VALUE mod(g2.id, 121)
                                    ) obj
                            FROM
                                dual
                            CONNECT BY level < 1 + mod(g1.id, 5)
                        )
                )                                        transaction_legs
            FROM
                     gen g1 CROSS JOIN gen g2
        );

Search index just to create a data guide:

CREATE SEARCH INDEX ix_trans_payload_search ON transactions ( payload ) FOR JSON;

Rest of this blog post will focus on the three properties of the JSON document: ID, AMOUNT and TRANSACTION_TS, this is the relevant part of the hierarchical data guide:

{
    "o:path" : "$.ID",
    "type" : "number",
    "o:length" : 8,
    "o:preferred_column_name" : "PAYLOAD$ID"
  },
  {
    "o:path" : "$.AMOUNT",
    "type" : "number",
    "o:length" : 8,
    "o:preferred_column_name" : "PAYLOAD$AMOUNT"
  },
{
    "o:path" : "$.TRANSACTION_TS",
    "type" : "string",
    "o:length" : 32,
    "o:preferred_column_name" : "PAYLOAD$TRANSACTION_TS"
  },

JSON datatypes for ID and AMOUNT are float, even if they know that ID is an integer and AMOUNT has two digits from the decimal point. On the other hand, TRANSACTION_TS is a string since JSON does not have datetime data types.

DBMS_JSON.ADD_VIRTUAL COLUMNS and CREATE_VIEW procedures

The curious thing about these procedures is that you can’t specify your own data types or conversion methods. They take the data type from the data guide. Let’s create another virtual column for the TS_TRANSACTION property.

DECLARE
    l_dataguide CLOB := dbms_json.get_index_dataguide('TRANSACTIONS', 'PAYLOAD', dbms_json.format_hierarchical);
BEGIN
    dbms_json.add_virtual_columns('TRANSACTIONS', 'PAYLOAD', l_dataguide);
END;

The result is underwhelming. Our timestamp has VARCHAR2 datatype, integers accept floating point numbers, and because of NULL ON ERROR clause, these virtual columns do not enforce the schema.

Introducing the proper data types

JSON data guide is a JSON document, and editing it before passing it to the DBMS_JSON procedures is relatively easy. And as it appears, I can use data types returned by the type() SQL/JSON path expression item method (what a mouthful!). So let’s try it.

DECLARE
    l_dataguide CLOB;
BEGIN
    SELECT
        JSON_TRANSFORM(dbms_json.get_index_dataguide('TRANSACTIONS', 'PAYLOAD', dbms_json.format_hierarchical),
                       SET '$.properties.TRANSACTION_TS.type' = 'timestamp')
    INTO l_dataguide
    FROM
        dual;
    dbms_json.add_virtual_columns('TRANSACTIONS', 'PAYLOAD', l_dataguide);
END;

Magically our PAYLOAD$TRANSACTION_TS has become a TIMESTAMP:

What does not work is providing precision for the data types. For example, adding precision to the TIMESTAMP column will give ORA-40594: invalid data guide: TRANSACTION_TS has an invalid type timestamp(3). Unfortunately, I couldn’t find a way to change NULL ON ERROR clause, so you can’t use those virtual columns to enforce constraints or to partition the table.