Reading Oracle hanganalyze dumps for fun and profit

On Monday afternoon out of the sudden people started complaining about database being slow and Important Things being delayed. Database in question is fairly busy RAC cluster with fair amount of data and instances for processing. Indeed, on checking there were massive waits on SS enqueue. While queries involved were either big on sorting, or abusing cursor duration temporary tables, there wasn’t obvious reason why those queries would wait on sort segment enqueue.

So let’s try ASH wait chains. From the output some of the chains stuck out. (Following output is generated later from DBA_HIST_ACTIVE_SESS_HISTORY, didn’t think at the time whole incident merits a blog post.)

 -> DW_USER_REPORT:(oracle) enq: SS - contention [mode=6]  -> [idle blocker 3,295,65220]
 -> DW_USER_REPORT:(Jnnn) enq: SS - contention [mode=6]  -> DW_USER_REPORT:(oracle) DFS lock handle
 -> DW_USER_REPORT:(oracle) enq: SS - contention [mode=6]  -> [idle blocker 10,486,63050] 
 -> DW_USER_REPORT:(oracle) enq: SS - contention [mode=6]  -> DW_USER_REPORT:(Pnnn) DFS lock handle

First, there are sessions blocked by idle session. And second weird thing, there are cross-instance calls (DFS lock handle) going nowhere: there should be background process handling the call listed. So immediate suspicion is final blocker being idle and not present in the ASH snapshot(s).

On way to check idle blockers is GV$SESSION, but fortunately there’s more comprehensive way with less typing involved: hanganalyze dump and it’s little cousin V$WAIT_CHAINS.

V$WAIT_CHAINS

It is populated by DIA0, on every 3 seconds for local chains, and every 10 seconds for global chains (check MOS document 1428210.1 for details). As it is local view, it works even if global views are not responsive enough. In our case it yielded slightly longer chain:

'rdbms ipc message'<='DFS lock handle'<='enq: SS - contention'<='PX Deq Credit: send blkd'

Here we have final blocker, which happens to be idle. Unfortunately, V$WAIT_CHAINS contains only small subset of stuff that’s present in hanganalyze dump, so let’s do some dumping.

Hanganalyze dumps

By default, hanganalyze seems to include only local instance, which isn’t very useful in RAC. So trick is to get the chains from all instances:

SYS @ dwh21:>oradebug setmypid
Statement processed.
SYS @ dwh21:>oradebug setinst all
Statement processed.
SYS @ dwh21:>oradebug -g all hanganalyze 3
Hang Analysis in /u01/app/oracle/diag/rdbms/dwh2/dwh21/trace/dwh21_diag_1877.trc

Heavily edited snippet from the hang analysis dump is in the end of this post.

While Oracle thinks naughty chains could be 1 .. 3, interesting chain happens to be 5th. Reading from right to left

  • First and second session are from instance #1, executing the same parallel query. First is the consumer waiting for the rows from the second.
  • Second session is waiting on SS enqueue (Sort Segment), which is held by the session on instance #3. That session is executing INSERT … SELECT statement, but somehow ended up doing cross instance call (DFS lock handle). According to dump, third session is blocking 37 other sessions, I would expect there will be intersecting wait chains with sessions waiting in CI enqueue, but this is not the case.

    Another fun fact, stack dump of the second session seems to point to the hash join trying to spill something to the disk (ktstgrowseg()+4094 <-ktstadr_addextent()+853 <-ktrsexec()+454 <-ktataddextent()+124 <-kxhfmGetExtent()+87 <-kxhfWrite()+755 <-kxhfNewBuffer()+756 <-qerhnGetNewBuffer()+70<-ksxb1bqb()+170)
  • Fourth and final session is DBW0 in instance #2. This is idle session waiting for the work (rdbms ipc message).

Why would idle database writer block something, is a good question indeed. In our case blocker moved around between the instances, so it wasn’t one wedged instance screwing up the whole cluster. There is similar bug for LGW0 (2699312.1), but nothing about DBW0. Another mystery, why is database writer involved in sort segment management in the first place.

Back to our original problem

Sort segments are maintained per instance basis in RAC, and space allocated to one instance isn’t visible to the others. This means ORA-01652 errors are handled differently in RAC: instead of returning error to the caller cross instance call is initiated first, to see if any other instance can free allocated but not used temp space. (GV$TEMP_EXTENT_POOL gives per instance overview on what is allocated and what is used.)

In our case, subset of the instances exhausted available temp space and somehow cross instance calls got stuck. And as a result, queries using temp segments became really slow.

Further reading

Snippet from hanganalyze dump

*** 2020-11-02T21:05:01.992759+01:00
PORADEBUG BEGIN ORIGINATING INST:7 SERIAL:0 PID:29440
PORADATA COMMAND:hanganalyze 3
2020-11-02 21:05:02.000 :kjzdregsw(): Session slot[0] (kjzddbx) switches to a new action, cbctx:0x7f33e41db160 clnup:0xb02c200 flag:3 sessid:0
PORADATA TRACEFILE /u01/app/oracle/diag/rdbms/<redacted>
PORADEBUG END ORIGINATING INST:7 SERIAL:0 PID:29440
****************************************************************
*** 2020-11-02T21:05:05.768950+01:00
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): <ten instances, redacted>
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 21:05:05 ]
      NOTE: scheduling delay has not been sampled for 0.445928 secs
    0.000000 secs from [ 21:05:01 - 21:05:06 ], 5 sec avg
    0.000000 secs from [ 21:04:05 - 21:05:06 ], 1 min avg
    0.000000 secs from [ 21:00:06 - 21:05:06 ], 5 min avg
  vktm time drift history
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'PX Deq: Table Q Normal'<='PX Deq Credit: send blkd'
     Chain 1 Signature Hash: 0x507d111
 [b] Chain 2 Signature: 'gc cr multi block request'<='PX Deq Credit: send blkd'
     Chain 2 Signature Hash: 0xf7d43820
 [c] Chain 3 Signature: 'gc cr multi block request'<='enq: WF - contention'
     Chain 3 Signature Hash: 0x806234ed
 
===============================================================================
Non-intersecting chains:
<snip>
 
-------------------------------------------------------------------------------
Chain 5:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (<redacted>)
                   os id: 23221
              process id: 65, oracle@oc1n11.example.com (P00D)
              session id: 16
        session serial #: 16422
             module name: 0 (DBMS_SCHEDULER)
    }
    is waiting for 'PX Deq Credit: send blkd' with wait info:
    {
                      p1: 'sleeptime/senderid'=0x10010012
                      p2: 'passes'=0x66
                      p3: 'qref'=0x19d2f26a98
            time in wait: 0.924122 sec (last interval)
            time in wait: 51.343573 sec (total)
           timeout after: never
                 wait id: 4014518
                blocking: 0 sessions
          current sql_id: 3430640901
             current sql: With player_info as (
                 <snip>
             short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-epoll_wait()+19<-ipcor_wseti_wait()+94<-ipclw_wait()+520<-ksxpwait_ipclw()+919<-ksxpwait_int()+155<-ksxpwait()+319<-ksliwat()+2324<-kslwaitctx()+200<-ksxprcv_int()+5361<-ksxprcvimdwctx()+95<-kxfpqidqr()+13080<-kxfprienq()+7059<-kxfpqrenq()+252<-kxfqeqb()+127<-kxfqbroadcastall()+1238<-kxfqenq()+971<-qertqoRop()+197<-qerhnProbeInner()+1143<-kdstf110010100000000km()+910<-kdsttgr()+2208<-qertbFetch()+1089<-qergiFetch()+602<-rwsfcd()+113<-qerhnF
            wait history:
              * time between current wait and wait #1: 0.000000 sec
              1.       event: 'PX Deq: reap credit'
                 time waited: 0.000006 sec
                     wait id: 4014621         
              * time between wait #1 and #2: 0.000000 sec
              2.       event: 'PX Deq Credit: send blkd'
                 time waited: 0.573351 sec (last interval)
                 time waited: 50.419445 sec (total)
                     wait id: 4014518          p1: 'sleeptime/senderid'=0x10010012
                                               p2: 'passes'=0x65
                                               p3: 'qref'=0x19d2f26a98
              * time between wait #2 and #3: 0.000000 sec
              3.       event: 'PX Deq: reap credit'
                 time waited: 0.000009 sec
                     wait id: 4014620         
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (<redacted>)
                   os id: 23282
              process id: 120, oracle@oc1n11.example.com (P00I)
              session id: 293
        session serial #: 31198
             module name: 0 (DBMS_SCHEDULER)
    }
    which is waiting for 'enq: SS - contention' with wait info:
    {
                      p1: 'name|mode'=0x53530006
                      p2: 'tablespace #'=0x5b
                      p3: 'dba'=0x2
            time in wait: 51.222013 sec
           timeout after: never
                 wait id: 864202
                blocking: 5 sessions
          current sql_id: 3430640901
             current sql: With player_info as (
                          <snip>
             short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-epoll_wait()+19<-ipcor_wseti_wait()+94<-ipclw_wait()+520<-ksxpwait_ipclw()+919<-ksxpwait_int()+155<-ksxpwait()+319<-ksliwat()+2324<-kslwaitctx()+200<-kjuscv_wait()+1650<-kjuscv()+1433<-ksipcon()+561<-ksqcmi()+10488<-ksqcnv()+809<-ktstspacerls()+538<-ktstgrowseg()+4094<-ktstadr_addextent()+853<-ktrsexec()+454<-ktataddextent()+124<-kxhfmGetExtent()+87<-kxhfWrite()+610<-kxhfNewBuffer()+756<-qerhnGetNewBuffer()+70<-ksxb1bqb()+170<-qerhn_kxhrPac
            wait history:
              * time between current wait and wait #1: 0.000039 sec
              1.       event: 'enq: SS - contention'
                 time waited: 0.000574 sec
                     wait id: 864201           p1: 'name|mode'=0x53530001
                                               p2: 'tablespace #'=0x5b
                                               p3: 'dba'=0x2
              * time between wait #1 and #2: 0.000677 sec
              2.       event: 'buffer busy waits'
                 time waited: 0.002828 sec
                     wait id: 864200           p1: 'file#'=0xf7
                                               p2: 'block#'=0x2
                                               p3: 'class#'=0xd
              * time between wait #2 and #3: 0.000159 sec
              3.       event: 'buffer busy waits'
                 time waited: 0.001735 sec
                     wait id: 864199           p1: 'file#'=0xf2
                                               p2: 'block#'=0x2
                                               p3: 'class#'=0xd
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 3 (<redacted>)
                   os id: 22495
              process id: 103, oracle@oc1n13.example.com (P008)
              session id: 1148
        session serial #: 9960
             module name: 0 (DBMS_SCHEDULERc8-447d-99a0-a003dd18ae783))
    }
    which is waiting for 'DFS lock handle' with wait info:
    {
                      p1: 'type|mode'=0x43490005
                      p2: 'id1'=0xe
                      p3: 'id2'=0x2
            time in wait: 53.325184 sec
           timeout after: never
                 wait id: 66290
                blocking: 37 sessions
          current sql_id: 1159177187
             current sql: INSERT 
                   INTO <snip>
                        SELECT ...
             short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-epoll_wait()+19<-ipcor_wseti_wait()+94<-ipclw_wait()+520<-ksxpwait_ipclw()+919<-ksxpwait_int()+155<-ksxpwait()+319<-ksliwat()+2324<-kslwaitctx()+200<-kjusuc()+4645<-ksigeti()+1478<-ksbcic_int()+4785<-ktstspacerls()+1227<-ktstgrowseg()+4094<-ktstadr_addextent()+853<-ktrsexec()+454<-ktataddextent()+124<-kxhfmGetExtent()+87<-kxhfWrite()+755<-kxhfNewBuffer()+756<-qerhnGetNewBuffer()+70<-ksxb1bqb()+170<-qerhn_kxhrPack()+1585<-qerhnSplitBuild()+3
            wait history:
              * time between current wait and wait #1: 0.000170 sec
              1.       event: 'DFS lock handle'
                 time waited: 0.001276 sec
                     wait id: 66289            p1: 'type|mode'=0x43490005
                                               p2: 'id1'=0xe
                                               p3: 'id2'=0x3
              * time between wait #1 and #2: 0.000032 sec
              2.       event: 'DFS lock handle'
                 time waited: 0.001159 sec
                     wait id: 66288            p1: 'type|mode'=0x43490005
                                               p2: 'id1'=0xe
                                               p3: 'id2'=0x1
              * time between wait #2 and #3: 0.000039 sec
              3.       event: 'DFS lock handle'
                 time waited: 0.000611 sec
                     wait id: 66287            p1: 'type|mode'=0x43490005
                                               p2: 'id1'=0xe
                                               p3: 'id2'=0x5
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (<snip>)
                   os id: 16370
              process id: 32, oracle@oc1n12.example.com (DBW0)
              session id: 568
        session serial #: 8812
    }
    which is waiting for 'rdbms ipc message' with wait info:
    {
                      p1: 'timeout'=0x79
            time in wait: 0.876014 sec
      heur. time in wait: 0.876268 sec
           timeout after: 0.333986 sec
                 wait id: 4994326
                blocking: 38 sessions
          current sql_id: 0
             current sql: <none>
             short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-semtimedop()+10<-skgpwwait()+187<-ksliwat()+2224<-kslwaitctx()+200<-ksarcv()+320<-ksbabs()+602<-ksbrdp()+1123<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
            wait history:
              * time between current wait and wait #1: 0.000003 sec
              1.       event: 'rdbms ipc message'
                 time waited: 0.000082 sec
                     wait id: 4994325          p1: 'timeout'=0x79
              * time between wait #1 and #2: 0.000004 sec
              2.       event: 'rdbms ipc message'
                 time waited: 0.000165 sec
                     wait id: 4994324          p1: 'timeout'=0x79
              * time between wait #2 and #3: 0.000012 sec
              3.       event: 'rdbms ipc message'
                 time waited: 0.036154 sec
                     wait id: 4994323          p1: 'timeout'=0x7c
    }
 
Chain 5 Signature: 'rdbms ipc message'<='DFS lock handle'<='enq: SS - contention'<='PX Deq Credit: send blkd'
Chain 5 Signature Hash: 0xb4402dff
-------------------------------------------------------------------------------
<snip>
 
===============================================================================
Intersecting chains:
 -------------------------------------------------------------------------------
Chain 13:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (<redacted>)
                   os id: 23258
              process id: 118, oracle@oc1n11.example.com (P00G)
              session id: 107
        session serial #: 17672
             module name: 0 (DBMS_SCHEDULER)
    }
    is waiting for 'PX Deq Credit: send blkd' with wait info:
    {
                      p1: 'sleeptime/senderid'=0x10010012
                      p2: 'passes'=0x66
                      p3: 'qref'=0x19d2f27260
            time in wait: 0.924493 sec (last interval)
            time in wait: 51.344392 sec (total)
           timeout after: never
                 wait id: 3580786
                blocking: 0 sessions
          current sql_id: 3430640901
             current sql: With player_info as (
     select <snip>
             short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-epoll_wait()+19<-ipcor_wseti_wait()+94<-ipclw_wait()+520<-ksxpwait_ipclw()+919<-ksxpwait_int()+155<-ksxpwait()+319<-ksliwat()+2324<-kslwaitctx()+200<-ksxprcv_int()+5361<-ksxprcvimdwctx()+95<-kxfpqidqr()+13080<-kxfprienq()+7059<-kxfpqrenq()+252<-kxfqeqb()+127<-kxfqbroadcastall()+1238<-kxfqenq()+971<-qertqoRop()+197<-qerhnProbeInner()+1143<-kdstf110010100000000km()+910<-kdsttgr()+2208<-qertbFetch()+1089<-qergiFetch()+602<-rwsfcd()+113<-qerhnF
            wait history:
              * time between current wait and wait #1: 0.000000 sec
              1.       event: 'PX Deq: reap credit'
                 time waited: 0.000007 sec
                     wait id: 3580889         
              * time between wait #1 and #2: 0.000000 sec
              2.       event: 'PX Deq Credit: send blkd'
                 time waited: 0.572950 sec (last interval)
                 time waited: 50.419892 sec (total)
                     wait id: 3580786          p1: 'sleeptime/senderid'=0x10010012
                                               p2: 'passes'=0x65
                                               p3: 'qref'=0x19d2f27260
              * time between wait #2 and #3: 0.000000 sec
              3.       event: 'PX Deq: reap credit'
                 time waited: 0.000098 sec
                     wait id: 3580888         
    }
    and is blocked by 'instance: 1, os id: 23282, session id: 293',
    which is a member of 'Chain 5'.
 
Chain 13 Signature: 'rdbms ipc message'<='DFS lock handle'<='enq: SS - contention'<='PX Deq Credit: send blkd'
Chain 13 Signature Hash: 0xb4402dff
-------------------------------------------------------------------------------

Reading Oracle hanganalyze dumps for fun and profit

Setting network timeouts with Oracle JDBC

With Statement and its subinterfaces it is possible to set query timeouts with setQueryTimeout method. For the queries it is fine, but this does not help with the rest of the database operations, for example with commits and metadata operations. One way to address the issue is to set timeout for the socket read through Connection.setNetworkTimeout method. Documentation talks about rare network conditions, but if you’re blessed with RAC there are additional error scenarios that look like long reads from the socket. Another positive thing about setNetworkTimeout is that it allows millisecond resolution for the timeout, while setQueryTimeout does only seconds.

Let’s have an example:

import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import javax.sql.DataSource;


import org.junit.runner.RunWith;
import org.junit.Test;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;
import org.springframework.test.context.junit4.SpringRunner;

@RunWith(SpringRunner.class)
@SpringBootTest
public class JDBCStatement {

        protected final Logger logger = LoggerFactory.getLogger(getClass());

        @Autowired
        DataSource dataSource;

        @Value("${spring.datasource.use_autocommit}")
        private boolean autocommit;

        static final String DEL_STATEMENT = "BEGIN DELETE FROM TX; DELETE FROM T; END;";
        static final String INS_STATEMENT = "INSERT INTO T VALUES(?, ?, ?)";
        static final String CLIENT_INFO = "BEGIN DBMS_SESSION.SET_IDENTIFIER(?); END;";
        static final String CLIENT_NAME = "C_RK";
        static final int BATCH_SIZE = 1000000;

        @Autowired
        ThreadPoolTaskExecutor ex;

        @Test
        public void singleStatement() {
                try (
                        Connection conn = dataSource.getConnection();
                        PreparedStatement ds = conn.prepareStatement(DEL_STATEMENT);
                        PreparedStatement is = conn.prepareStatement(INS_STATEMENT);
                ) {
                        ds.execute();

//                      Just for testing
//                      conn.setNetworkTimeout(Executors.newFixedThreadPool(1), 10000);

//                      Sets 10s timeout
                        conn.setNetworkTimeout(ex.getThreadPoolExecutor(), 10000);
                        for(int i = 0; i <= BATCH_SIZE; i++){
                                is.setInt(1, i);
                                is.setInt(2, 1);
                                is.setString(3, "BOOYA!." + i);
                                is.execute();
                                // Commit in a loop is to help out with testing. Don't try this at home!
                                conn.commit();
                        }
                }
                catch (SQLException ex) {
                        throw new RuntimeException("", ex);
                }

        }
}

Testing

Easiest for me is to stop the Oracle log writer during the test run. This makes commit hang with log file sync wait event and our timeout to fire. So in the database while test case is running

SYS@roadkill 12-03-2021 13:45:23> !ps -u oracle|grep lgwr
30317 ?        00:03:36 ora_lgwr_roadki

SYS@roadkill 12-03-2021 13:45:36> oradebug setospid 30317
Oracle pid: 13, Unix process pid: 30317, image: oracle@roadkill (LGWR)
SYS@roadkill 12-03-2021 13:46:38> oradebug suspend
Statement processed.
SYS@roadkill 12-03-2021 13:46:49> select state, seq#, event from v$session where username ='C_RK';

STATE                     SEQ# EVENT
------------------- ---------- ----------------------------------------------------------------
WAITING                  27879 log file sync
Elapsed: 00:00:00.01
SYS@roadkill 12-03-2021 13:47:03>

(There’s a way to run oradebug from event handler, but I haven’t get that one working ;))

After a few seconds our timeout fired:

java.lang.RuntimeException:
        at <redacted>.JDBCStatement.singleStatement(JDBCStatement.java:77)
Caused by: java.sql.SQLRecoverableException: IO Error: Socket read timed out
        at <redacted>.JDBCStatement.singleStatement(JDBCStatement.java:71)
Caused by: oracle.net.nt.TimeoutInterruptHandler$IOReadTimeoutException: Socket read timed out
        at <redacted>.JDBCStatement.singleStatement(JDBCStatement.java:71)

Setting network timeouts with Oracle JDBC

Reducing chattiness of JDBC applications: batching be good!

When reusing PreparedStatement with the new set of bind variables, SQL statement and first row is sent to the database with first PreparedStatement.execute*() call. This involves one network roundtrip, and each consecutive call triggers another roundtrip. These network roundtrips can add up fast when operating on high latency networks. (Say hi! to the public cloud.) Let’s consider following snippet. (Autocommit is everywhere set to false unless otherwise noted.)

        static final String DEL_STATEMENT = "DELETE FROM T;";
        static final String INS_STATEMENT = "INSERT INTO T VALUES(?, ?)";

        static final int BATCH_SIZE = 1000;
        @Test
        public void rowByRow() {
                try (
                        Connection conn = dataSource.getConnection();
                        PreparedStatement ds = conn.prepareStatement(DEL_STATEMENT);
                        PreparedStatement is = conn.prepareStatement(INS_STATEMENT);
                ) {

                        ds.execute();
                        for(int i = 0; i <= BATCH_SIZE; i++){
                                is.setInt(1, i);
                                is.setString(2, "BOOYA!." + i);
                                is.execute();
                        }
                        conn.commit();
                }
                catch (SQLException ex) {
                        throw new RuntimeException("", ex);
                }

        }

As expected, inserting one row will involve one network roundtrip:

    localhost.44672 > localhost.ncube-lm: Flags [P.], cksum 0xfeb8 (incorrect -> 0xe4c3), seq 5010:5154, ack 5775, win 2478, options [nop,nop,TS val 104535079
3 ecr 1045350771], length 144
E...k.@.@..P...............2...\..      ........
>N..>N.s...........^...)......................................INSERT INTO T VALUES(:1 , :2 ).............................i....... .......i......BOOYA!.0
12:42:35.064825 IP (tos 0x0, ttl 64, id 3758, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.ncube-lm > localhost.44672: Flags [.], cksum 0xfe28 (incorrect -> 0x516e), seq 5775, ack 5154, win 2497, options [nop,nop,TS val 1045350793 ecr 1045350793], length 0
E..4..@.@..................\......      ..(.....
>N..>N..
12:42:35.065721 IP (tos 0x0, ttl 64, id 3759, offset 0, flags [DF], proto TCP (6), length 145)
    localhost.ncube-lm > localhost.44672: Flags [P.], cksum 0xfe85 (incorrect -> 0xe9f0), seq 5775:5868, ack 5154, win 2497, options [nop,nop,TS val 1045350794 ecr 1045350793], length 93
E.....@.@.-................\......      ........
>N..>N.....].............0.........................................%.}..u.................%.}.u......
12:42:35.066227 IP (tos 0x0, ttl 64, id 27618, offset 0, flags [DF], proto TCP (6), length 134)
    localhost.44672 > localhost.ncube-lm: Flags [P.], cksum 0xfe7a (incorrect -> 0x671b), seq 5154:5236, ack 5868, win 2478, options [nop,nop,TS val 1045350794 ecr 1045350794], length 82
E...k.@.@.........................      ..z.....
>N..>N.....R.......^... ..........................................................BOOYA!.1
12:42:35.066666 IP (tos 0x0, ttl 64, id 3760, offset 0, flags [DF], proto TCP (6), length 145)
    localhost.ncube-lm > localhost.44672: Flags [P.], cksum 0xfe85 (incorrect -> 0xe63e), seq 5868:5961, ack 5236, win 2497, options [nop,nop,TS val 1045350795 ecr 1045350794], length 93
E.....@.@.-.......................      ........
>N..>N.....].............0.........................................%.}..v.................%.}.v......
<and so on...>

Great way to reduce the network roundtrips is executing DML in a single batch:

        static final String DEL_STATEMENT = "BEGIN DELETE FROM TX; DELETE FROM T; END;";
        static final String INS_STATEMENT = "INSERT INTO T VALUES(?, ?)";
        static final int BATCH_SIZE = 1000;
        @Test
        public void batchStatement() {
                try (
                        Connection conn = dataSource.getConnection();
                        PreparedStatement ds = conn.prepareStatement(DEL_STATEMENT);
                        PreparedStatement is = conn.prepareStatement(INS_STATEMENT);
                ) {
                        ds.execute();

                        for(int i = 0; i <= BATCH_SIZE; i++){
                                is.clearParameters();
                                is.setInt(1, i);
                                is.setString(2, "BOOYA!." + i);
                                is.addBatch();
                        }
                        int[] r = is.executeBatch();
                        conn.commit();
                }
                catch (SQLException ex) {
                        throw new RuntimeException("", ex);
                }
                                                   

And indeed, all the 1k rows are sent with initial SQL statement.(Due to the packet size limits batch could be split into multiple packets introducing additional network roundtrips. 1k rows here were actually sent in two packets/roundtrips.)

    localhost.45610 > localhost.ncube-lm: Flags [P.], cksum 0x1e29 (incorrect -> 0x2ccc), seq 5010:13202, ack 5774, win 2478, options [nop,nop,TS val 1049319975 ecr 1049319910], length 8192
E. 4.|@.@.8E.........*.. ...q+3...      ..).....
>.Z'>.Y... ........^...).......................................INSERT INTO T VALUES(:1 , :2 )..............................i.......,.......i......BOOYA!.0.....BOOYA!.1.....BOOYA!.2.....BOOYA!.3.....BOOYA!.4.....BOOYA!.5.....BOOYA!.6.....BOOYA!.7...        .BOOYA!.8...
.BOOYA!.9....   BOOYA!.10....   BOOYA!.11....   BOOYA!.12....   BOOYA!.13....   BOOYA!.14....   BOOYA!.15....   BOOYA!.16....   BOOYA!.17....   BOOYA!.18....
   BOOYA!.19....   BOOYA!.20....   BOOYA!.21....   BOOYA!.22....   BOOYA!.23....   BOOYA!.24....   BOOYA!.25....   BOOYA!.26....   BOOYA!.27....   BOOYA!.28....   BOOYA!.29....   BOOYA!.30...    BOOYA!.31...!   BOOYA!.32..."   BOOYA!.33...#   BOOYA!.34...$   BOOYA!.35...%   BOOYA!.36...&   BOOYA!.37...'   BOOYA!.38...(   BOOYA!.39...)   BOOYA!.40...*   BOOYA!.41...+   BOOYA!.42...,   BOOYA!.43...-   BOOYA!.44....   BOOYA!.45.../   BOOYA!.46...0   BOOYA!.47...1   BOOYA!.48...2   BOOYA!.49...3   BOOYA!.50...4   BOOYA!.51...5   BOOYA!.52...6   BOOYA!.53...7   BOOYA!.54...8   BOOYA!.55...9   BOOYA!.56...:   BOOYA!.57...;   BOOYA!.58...<   BOOYA!.59...=   BOOYA!.60...>   BOOYA!.61...?   BOOYA!.62...@   BOOYA!.63...A   BOOYA!.64...B   BOOYA!.65...C   BOOYA!.66...D   BOOYA!.67...E   BOOYA!.68...F   BOOYA!.69...G   BOOYA!.70...H   BOOYA!.71...I   BOOYA!.72...J   BOOYA!.73...K   BOOYA!.74...L   BOOYA!.75...M   BOOYA!.76...N   BOOYA!.77...O   BOOYA!.78...P   BOOYA!.79...Q   BOOYA!.80...R   BOOYA!.81...S   BOOYA!.82...T   BOOYA!.83...U   BOOYA!.84...V   BOOYA!.85...W   BOOYA!.86...X   BOOYA!.87...Y
   BOOYA!.88...Z   BOOYA!.89...[   BOOYA!.90...\   BOOYA!.91...]   BOOYA!.92...^   BOOYA!.93..._   BOOYA!.94...`   BOOYA!.95...a   BOOYA!.96...b   BOOYA!.97...c   BOOYA!.98...d   BOOYA!.99....
<and so on...>

In most of the real world situations transactions can contain multiple DML statements executed in specific order. As of Oracle JDBC 19.3 or newer, PreparedStatement and CallableStatement do not support more than one SQL statements in the batch. Attempt to add another will throw java.sql.SQLFeatureNotSupportedException: Unsupported feature: addBatch exception.

Fortunately there’s a way to combine multiple DML statements into a single one. Consider this:

        static final String DEL_STATEMENT = "BEGIN DELETE FROM TX; DELETE FROM  T; END;";
        static final String INS_STATEMENT = "BEGIN INSERT INTO T VALUES(?, ?); INSERT INTO TX VALUES(?, ?, ?); INSERT INTO TX VALUES(?, ?, ?); END;";
        static final int BATCH_SIZE = 10000;
        @Test
        public void compoundStatement() {
                try (
                        Connection conn = dataSource.getConnection();
                        CallableStatement csDelete = conn.prepareCall(DEL_STATEMENT);
                        CallableStatement csInsert = conn.prepareCall(INS_STATEMENT);
                ) {

                        csDelete.execute();

                        csInsert.setInt(1, i);
                        csInsert.setString(2, "BOOYA!." + i);
                        csInsert.setInt(3, 2*i);
                        csInsert.setInt(4, i);
                        csInsert.setString(5, "FUBAR." + i);
                        csInsert.setInt(6, 2*i + 1);
                        csInsert.setInt(7, i);
                        csInsert.setString(8, "FUBAZ." + i);

                        csInsert.execute();
                        conn.commit();
                }
                catch (SQLException ex) {
                        throw new RuntimeException("", ex);
                }

        }

Magic incantation in INS_STATEMENT is actually anonymous PL/SQL block:

BEGIN
    INSERT INTO t VALUES (?,?);
    INSERT INTO tx VALUES (?,?,?);
    INSERT INTO tx VALUES (?,?,?);
END;

As expected, statement and data is sent to the database in single network packet

14:20:57.714505 IP (tos 0x0, ttl 64, id 47883, offset 0, flags [DF], proto TCP (6), length 401)
    localhost.46060 > localhost.ncube-lm: Flags [P.], cksum 0xff85 (incorrect -> 0x0031), seq 5024:5373, ack 5789, win 2478, options [nop,nop,TS val 105125344
3 ecr 1051253279], length 349
E.....@.@..Y................J]G...      ........
>...>......].......^...)...v...............................BEGIN INSERT INTO T VALUES(:1 , :2 ); INSERT INTO TX VALUES(:3 , :4 , :5 ); INSERT INTO TX VALUES(:
6 , :7 , :8 ); END;.............................i.......>........i..............i..............i.......>........i..............i..............i.......>.......
.i......BOOYA!.0.....FUBAR.0......FUBAZ.0

Batching is good, and SmartDB is cool, what’s about combining those two things? Can we send anonymous PL/SQL block and batch of data in one network packet? Let’s try.

        static final String DEL_STATEMENT = "BEGIN DELETE FROM TX; DELETE FROM  T; END;";
        static final String INS_STATEMENT = "BEGIN INSERT INTO T VALUES(?, ?); INSERT INTO TX VALUES(?, ?, ?); INSERT INTO TX VALUES(?, ?, ?); END;";       
        @Test
        public void batchedCompoundStatement() {
                try (
                        Connection conn = dataSource.getConnection();
                        CallableStatement csDelete = conn.prepareCall(DEL_STATEMENT);
                        CallableStatement csInsert = conn.prepareCall(INS_STATEMENT);
                ) {
                        csDelete.execute();
                        for(int i = 0; i <= BATCH_SIZE; i++){
                                csInsert.clearParameters();
                                csInsert.setInt(1, i);
                                csInsert.setString(2, "BOOYA!." + i);
                                csInsert.setInt(3, 2*i);
                                csInsert.setInt(4, i);
                                csInsert.setString(5, "FUBAR." + i);
                                csInsert.setInt(6, 2*i + 1);
                                csInsert.setInt(7, i);
                                csInsert.setString(8, "FUBAZ." + i);
                                csInsert.addBatch();
                        }
                        int[] r = csInsert.executeBatch();
                        conn.commit();
                }

Unfortunately, computer says no. Rows are sent to the database slow-by-slow:

14:20:57.714505 IP (tos 0x0, ttl 64, id 47883, offset 0, flags [DF], proto TCP (6), length 401)
    localhost.46060 > localhost.ncube-lm: Flags [P.], cksum 0xff85 (incorrect -> 0x0031), seq 5024:5373, ack 5789, win 2478, options [nop,nop,TS val 105125344
3 ecr 1051253279], length 349
E.....@.@..Y................J]G...      ........
>...>......].......^...)...v...............................BEGIN INSERT INTO T VALUES(:1 , :2 ); INSERT INTO TX VALUES(:3 , :4 , :5 ); INSERT INTO TX VALUES(:
6 , :7 , :8 ); END;.............................i.......>........i..............i..............i.......>........i..............i..............i.......>.......
.i......BOOYA!.0.....FUBAR.0......FUBAZ.0
14:20:57.716987 IP (tos 0x0, ttl 64, id 59405, offset 0, flags [DF], proto TCP (6), length 135)
    localhost.ncube-lm > localhost.46060: Flags [P.], cksum 0xfe7b (incorrect -> 0xeeba), seq 5789:5872, ack 5373, win 2497, options [nop,nop,TS val 105125344
5 ecr 1051253443], length 83
E.....@.@.Ta............J]G....d..      ..{.....
>...>......S................        .............................../.......................
14:20:57.717010 IP (tos 0x0, ttl 64, id 47884, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.46060 > localhost.ncube-lm: Flags [.], cksum 0xfe28 (incorrect -> 0x747e), seq 5373, ack 5872, win 2478, options [nop,nop,TS val 1051253445 ecr
1051253445], length 0
E..4..@.@..................dJ]HO..      ..(.....
>...>...
14:20:57.717640 IP (tos 0x0, ttl 64, id 47885, offset 0, flags [DF], proto TCP (6), length 159)
    localhost.46060 > localhost.ncube-lm: Flags [P.], cksum 0xfe93 (incorrect -> 0x5471), seq 5373:5480, ack 5872, win 2478, options [nop,nop,TS val 1051253446 ecr 1051253445], length 107
E.....@.@..I...............dJ]HO..      ........
>...>......k.......^    .. .......................................................BOOYA!.1.......FUBAR.1.......FUBAZ.1
14:20:57.719178 IP (tos 0x0, ttl 64, id 59406, offset 0, flags [DF], proto TCP (6), length 135)
    localhost.ncube-lm > localhost.46060: Flags [P.], cksum 0xfe7b (incorrect -> 0xebf7), seq 5872:5955, ack 5480, win 2497, options [nop,nop,TS val 1051253447 ecr 1051253446], length 83
E.....@.@.T`............J]HO......      ..{.....
>...>......S................        ......................      ......../............   ..........
14:20:57.719560 IP (tos 0x0, ttl 64, id 47886, offset 0, flags [DF], proto TCP (6), length 159)
    localhost.46060 > localhost.ncube-lm: Flags [P.], cksum 0xfe93 (incorrect -> 0x4dab), seq 5480:5587, ack 5955, win 2478, options [nop,nop,TS val 1051253447 ecr 1051253447], length 107
E.....@.@..H................J]H...      ........
>...>......k.......^
.. .......................................................BOOYA!.2.......FUBAR.2.......FUBAZ.2

It does not matter if we use stored procedure instead of anonymous PL/SQL block, or PreparedStatement instead of CallableStatement. Behavior remains exactly the same, batching does not work as expected.

Reducing chattiness of JDBC applications: batching be good!

Mixing TIMESTAMPs with TIMESTAMPs WITH TIME ZONE considered harmful

TIMESTAMP and TIMESTAMP WITH TIME ZONE are different data types: they have different internal representations and their behavior has subtle differences. But somehow Oracle still feels like providing implicit data type conversion for these two. So let’s check how implicit data type conversion works in real life.

First some test data:

PRIITP@testpdb > CREATE TABLE t (
    id           INTEGER PRIMARY KEY,
    created      TIMESTAMP NOT NULL,
    created_tz   TIMESTAMP WITH TIME ZONE NOT NULL
);
Table created.

Elapsed: 00:00:00.03
PRIITP@testpdb > insert into t values (1, systimestamp, systimestamp);

1 row created.

Elapsed: 00:00:00.01
PRIITP@testpdb > select * from t;

        ID CREATED                                                                     CREATED_TZ
---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
         1 21-SEP-20 11.27.35.281367 AM                                                21-SEP-20 11.27.35.281367 AM +00:00

Now some comparison between TIMESTAMP column and TIMESTAMP WITH TIME ZONE literal.

PRIITP@testpdb > select * from t where created > TIMESTAMP '2020-09-21 10:27:35 +03:00';

        ID CREATED                                                                     CREATED_TZ
---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
         1 21-SEP-20 11.27.35.281367 AM                                                21-SEP-20 11.27.35.281367 AM +00:00

Still fabulous. But there’s a way to bring in some fun, let’s try to adjust session time zone.

PRIITP@testpdb > alter session set time_zone = '06:00';

Session altered.

Elapsed: 00:00:00.01
PRIITP@testpdb > select * from t where created > TIMESTAMP '2020-09-21 10:27:35 +03:00';

no rows selected

Elapsed: 00:00:00.00

Oops. Somehow our data disappeared. Hint why such a mysterious behavior comes from the execution plan, specifically from the ‘filter’ section

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    41 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |    41 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(SYS_EXTRACT_UTC(INTERNAL_FUNCTION("CREATED"))>TIMESTAMP'
              2020-09-21 07:27:35.000000000')

First Oracle converts TIMESTAMP WITH TIME ZONE literal into TIMESTAMP literal in UTC. Which is all good and well since internally operations on different time zones are done in UTC only, so conversion into UTC is expected.

Much more interesting is this part: SYS_EXTRACT_UTC(INTERNAL_FUNCTION("CREATED")) . There is no way to know in which time zone those TIMESTAMPs are, so Oracle takes a leap of faith and assumes it should be in the session time zone. This is exactly how SYS_EXTRACT_UTC operates on TIMESTAMPS: since time zone information isn’t available it assumes it should be in session time zone. Unfortunately this does not explain nested INTERNAL_FUNCTION, it looks like a data type conversion, but it’s not actually needed.

While doing the data type conversion Oracle probably ends up with construct like sys_extract_utc(from_tz(created, '06:00')), and with expression like this our query, of course, does not return anything:

PRIITP@testpdb > select created, sys_extract_utc(from_tz(created, '06:00')) from t;

CREATED                                                                     SYS_EXTRACT_UTC(FROM_TZ(CREATED,'06:00'))
--------------------------------------------------------------------------- ---------------------------------------------------------------------------
21-SEP-20 11.27.35.281367 AM                                                21-SEP-20 05.27.35.281367 AM

For me it seems to be a violation of POLA. Since lossless data type conversion isn’t always possible between TIMESTAMPs and TIMESTAMPs WITH TIME ZONE, Oracle should throw an error instead of assuming things and making queries nondetermnistic. And speaking of assumptions, TIMESTAMPs in UTC are making much more sense compared to session time zone, at least for me.

Mixing TIMESTAMPs with TIMESTAMPs WITH TIME ZONE considered harmful

Measuring latency in Python: implementation details

Getting correct time and measuring things in sub-millisecond range is hairy and complicated topic. Most of a time it boils down to the low level details in hardware, but sometimes tools you have to use add their own weird twist. Let’s consider Python 3 as as an example. According to the documentation things couldn’t be more simple. What we have is package time and five functions to return desired reading:

  • time.monotonic() — Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards. The clock is not affected by system clock updates. (Well, not exactly, see later.)
  • time.perf_counter() — Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration.
  • time.process_time() — Return the value (in fractional seconds) of the sum of the system and user CPU time of the current process.
  • time.thread_time()  — Return the value (in fractional seconds) of the sum of the system and user CPU time of the current thread. 
  • time.time() — Return the time in seconds since the epoch as a floating point number. 

Now let’s check what the underlying implementation is across the platforms. First on FreeBSD 12.1:

priitp@roadkill2:~ % python3.7
Python 3.7.8 (default, Jul 19 2020, 23:46:27)
[Clang 8.0.1 (tags/RELEASE_801/final 366581)] on freebsd12
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> [print("{}: {}".format(ci, time.get_clock_info(ci))) for ci in ('monotonic', 'perf_counter', 'process_time', 'thread_time', 'time')]
monotonic: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=2.8e-07)
perf_counter: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=2.8e-07)
process_time: namespace(adjustable=False, implementation='clock_gettime(CLOCK_PROF)', monotonic=True, resolution=0.01)
thread_time: namespace(adjustable=False, implementation='clock_gettime(CLOCK_THREAD_CPUTIME_ID)', monotonic=True, resolution=1.0000000000000002e-06)
time: namespace(adjustable=True, implementation='clock_gettime(CLOCK_REALTIME)', monotonic=False, resolution=2.8e-07)

In Linux I have slightly older Python, so no thread_time here:

[pripii@priitp-roadkill ]$ python3
Python 3.6.8 (default, Apr  2 2020, 13:34:55)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> [print("{}: {}".format(ci, time.get_clock_info(ci))) for ci in ('monotonic', 'perf_counter', 'process_time', 'time')]
monotonic: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
perf_counter: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
process_time: namespace(adjustable=False, implementation='clock_gettime(CLOCK_PROCESS_CPUTIME_ID)', monotonic=True, resolution=1e-09)
time: namespace(adjustable=True, implementation='clock_gettime(CLOCK_REALTIME)', monotonic=False, resolution=1e-09)

And since code I’m involved with can theoretically run on Windows as well, let’s include it for the sake of lulz:

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> [print("{}: {}".format(ci, time.get_clock_info(ci))) for ci in ('monotonic', 'perf_counter', 'process_time', 'time')]
monotonic: namespace(adjustable=False, implementation='GetTickCount64()', monotonic=True, resolution=0.015625)
perf_counter: namespace(adjustable=False, implementation='QueryPerformanceCounter()', monotonic=True, resolution=1e-07)
process_time: namespace(adjustable=False, implementation='GetProcessTimes()', monotonic=True, resolution=1e-07)
time: namespace(adjustable=True, implementation='GetSystemTimeAsFileTime()', monotonic=False, resolution=0.015625)

In both FreeBSD and Linux time.monotonic() and time.perf()_counter are using clock_gettime(CLOCK_MONOTONIC) with precision 1ns. But on Windows you can get precision of whopping 16us from time.monotonic(). So when measuring latencies in microseconds, time.perf_counter() is a way to go, with additional loss of precision of 100ns in Windows.

While CLOCK_MONOTONIC doesn’t suffer from jumps backward in time, it can be incrementally adjusted by adjtime(3) and NTP. While monotonic, consecutive calls can return same value, and it does not count time while system was suspended. Value returned by CLOCK_MONOTONIC represents seconds either from the system boot (Linux) or from epoch (FreeBSD).

CLOCK_REALTIME represents second since epoch and has all the problems a broken timer can have: it can be adjusted by privileged user, or by NTP/PTP/whatnot, it’s readings can jump backward and forward in time, and has limited accuracy. So unless you’re interested in slightly wrong wall clock time, time.time() isn’t that interesting.

Does precision matter?

Since Python isn’t exactly the fastest language humanity has invented, why such a panic about the milli- and nanoseconds? Let’s check how long calling various time.* functions will take. Following is from the Linux system with 1900MHZ CPU and kvm-clock as system clock source.

#!/usr/bin/python3

from hdrh.histogram import HdrHistogram
import time

loops = 1000

def measure(timefunc):
        h = HdrHistogram(1, 10000, 4)
        for l in range(0, loops):
                start = time.perf_counter()
                dummy = timefunc()
                h.record_value(1000000000*(time.perf_counter() - start))
        return h

if __name__ == '__main__':
        for func in (time.monotonic, time.perf_counter, time.process_time, time.time):
                print("Histogram for {}".format(func))
                hist = measure(func)
                [print("\t{}th = {}ns".format(ct, hist.get_value_at_percentile(ct))) for ct in (50, 98, 99, 99.9, 99.99)]

And the winner is…

Histogram for <built-in function monotonic>
        50th = 324ns
        98th = 356ns
        99th = 413ns
        99.9th = 5288ns
        99.99th = 7132ns
Histogram for <built-in function perf_counter>
        50th = 350ns
        98th = 389ns
        99th = 396ns
        99.9th = 1569ns
        99.99th = 14392ns
Histogram for <built-in function process_time>
        50th = 1289ns
        98th = 1599ns
        99th = 2730ns
        99.9th = 10459ns
        99.99th = 21132ns
Histogram for <built-in function time>
        50th = 315ns
        98th = 346ns
        99th = 354ns
        99.9th = 570ns
        99.99th = 1700ns

Median call times are around 300ns, except for time.process_time(), and high values are into microseconds. So for events taking hundreds of microseconds, calling time.perf_counter()/time.monotonic() is just fine.

See also

For the general background: Measuring Latency in Linux.

Measuring latency in Python: implementation details

How to add Oracle specific SQL extensions with Liquibase

Now that sqlcl has acquired support for Liquibase, it might feel unnecessary to use vanilla LB with Oracle. But still, bundling sqlcl with applications takes extra effort, and sometimes you’re suck with vanilla Liquibase on unrelated reasons. And what’s more, adding SQL extensions like global temporary tables, index oriented tables and partitioning is actually quite straightforward with LB.

Let’s start with a simple CREATE TABLE statement:

CREATE TABLE partitioning_test (
    id           INTEGER
        PRIMARY KEY USING INDEX local,
    username     VARCHAR2(128) NOT NULL,
    created      TIMESTAMP WITH TIME ZONE NOT NULL,
    created_by   VARCHAR2(32) NOT NULL,
    updated      TIMESTAMP WITH TIME ZONE,
    updated_by   VARCHAR2(32)
)
    PARTITION BY HASH ( id ) PARTITIONS 16;

To turn this statement into Liquibase changeSet I will

  • Create barebones changeSet without any funny extensions.
  • Check with updateSQL command what kind of statement Liquibase generates from that changeSet.
  • Modify the generated statement with modifySQL sub-tag to the exact liking.

Our initial changeSet looks like this

        <createTable tableName="PARTITIONING_TEST">
            <column name="ID" type="INTEGER" >
                <constraints primaryKey="true"/>
            </column>
            <column name="USERNAME" type="VARCHAR2(128)">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED" type="TIMESTAMP WITH TIME ZONE">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED_BY" type="VARCHAR2(32)">
                <constraints nullable="false"/>
            </column>
            <column name="UPDATED" type="TIMESTAMP WITH TIME ZONE"/>
            <column name="UPDATED_BY" type="VARCHAR2(32)"/>
        </createTable>
    </changeSet>

From which Liquibase generates following SQL statement (formatted for better readability):

CREATE TABLE priitp.partitioning_test (
    id           INTEGER NOT NULL,
    username     VARCHAR2(128) NOT NULL,
    created      TIMESTAMP WITH TIME ZONE NOT NULL,
    created_by   VARCHAR2(32) NOT NULL,
    updated      TIMESTAMP WITH TIME ZONE,
    updated_by   VARCHAR2(32),
    CONSTRAINT pk_partitioning_test PRIMARY KEY ( id )
);

Partitioning the table is really simple, just add the partitioning clause to the end of the statement:

<modifySql dbms="oracle">
                <append value=" PARTITION BY HASH (id) PARTITIONS 16 "/>
        </modifySql>

Changing the primary key clause is trickier, we have to actually modify it:

<replace replace="PRIMARY KEY (ID)" with="PRIMARY KEY (ID) USING INDEX LOCAL">

With those additions final changeset looks like this:

<changeSet author="priitp" id="part-test">
        <createTable tableName="PARTITIONING_TEST">
            <column name="ID" type="INTEGER" >
                <constraints primaryKey="true"/>
            </column>
            <column name="USERNAME" type="VARCHAR2(128)">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED" type="TIMESTAMP WITH TIME ZONE">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED_BY" type="VARCHAR2(32)">
                <constraints nullable="false"/>
            </column>
            <column name="UPDATED" type="TIMESTAMP WITH TIME ZONE"/>
            <column name="UPDATED_BY" type="VARCHAR2(32)"/>
        </createTable>
        <modifySql dbms="oracle">
                <append value=" PARTITION BY HASH (id) PARTITIONS 16 "/>
                <replace replace="PRIMARY KEY (ID)" with="PRIMARY KEY (ID) USING INDEX LOCAL"/>
        </modifySql>
    </changeSet>

And generated SQL statement (once again, formatted for readability) contains all the things we want:

CREATE TABLE partitioning_test (
    id           INTEGER NOT NULL,
    username     VARCHAR2(128) NOT NULL,
    created      TIMESTAMP WITH TIME ZONE NOT NULL,
    created_by   VARCHAR2(32) NOT NULL,
    updated      TIMESTAMP WITH TIME ZONE,
    updated_by   VARCHAR2(32),
    CONSTRAINT pk_partitioning_test PRIMARY KEY ( id )
        USING INDEX local
)
    PARTITION BY HASH ( id ) PARTITIONS 16;

Caveats and Features

In addition to appending and modifying the statement modifySql sub-tag supports prepending and replace using the regular expressions.

modifySql operates on all SQL statements generated by changeSet, including statements coming from sqlFile. For example, if I add createSequence tag to the changeSet, result is something funny:

CREATE SEQUENCE PART_TEST_SEQ
PARTITION BY HASH (id) PARTITIONS 16 ;

How to add Oracle specific SQL extensions with Liquibase

Fast Connection Failover in Action

Oracle Fast Application Notification makes possible for database client to be aware of what is happening in the database cluster. Main use case for FAN is Fast Connection Failover, which can help database client a great deal when it comes to acting on the problems in the database layer. Following is a short explanation how FCF works, using Oracle Universal Connection Pool as an example. Some of the discussion should be relevant to other Oracle clients as well, since FCF works in similar way whether the client is thick, thin, or ODP.NET.

Connection Pooling 101

Oracle database is optimized to work with long running sessions, creating new connection is expensive in terms of latency and resources needed. On the other hand, web service operating in request-response model can have expected response times in single milliseconds. So creating a new session for each and every request is time consuming and not especially smart.

Solution is to create a pool of database connections in the application server. Initially connection pool creates a bunch of available connections, which application code can borrow. When application code is done with the database, it returns borrowed connection to the pool, and returned connection is marked as available.

Another nice thing with the connection pool is that if properly configured, it creates a circuit breaker between the application and database. Maximum number of connections that pool can create effectively restricts the number of sessions that can be created in the database. It helps to limit the damage in case of sudden and unexpected rush hour.

Unfortunately these nice properties introduce some side effects. From high availability standpoint connection pool creates two big problems. It is impossible to tell without testing if connection to the database has died, and in case of networking troubles testing those connections can take ages. Second thing, there is no way to load balance the connections, connection pool does not know which nodes in the cluster are overloaded and which nodes could provide better response time for the application.

UCP: handling the outages

When UCP receives message from the ONS, it inspects the connections available in the connection pool and takes action if needed. Whole process is rather simple:

  1. UCP receives FCF message from ONS.
  2. UCP inspects connections in the pool and if connection does not have service in the cluster available, UCP kills the physical connection.

Main difference between planned and unplanned outages is how borrowed connections are handled. In case of unplanned downtime UCP kills all the affected connections, including connections borrowed by the application code. If application tries to use the reaped connection, it receives the javax.util.SQLException.

In case of planned downtime UCP acts only on available connections. Borrowed connections will stay intact, these are closed when application returns them to the pool. To avoid logon storms draining period can be set with oracle.ucp.PlannedDrainingPeriod​ parameter.

UCP: load balancing and connection affinity

With FAN, load in the database cluster becomes observable for the client. This, in turn, enables connection time load balancing on the client side. Work can be distributed either by throughput or response time. Load balancing messages are configured on the server side, on the service involved (and whole thing uses RAC Load Balancer Advisory in the background). If UCP receives relevant messages, load balancing is done automatically.

As mentioned, load balancing happens when the new connection is created. Which means applications with steady number of connections does not benefit from load balancing capabilities. To avoid this issue one can set timeouts on the connections:

  • InactiveConnectionTimeout — how long available connection can be idle before it is removed from the connection pool.
  • MaxConnectionReuseTime — how long connection can be used before being gracefully removed from the pool.

Unfortunately there are applications which borrow connections and never return them back to the pool. For these greedy applications there are two more timeouts that operate on borrowed connections:

  • AbandonConnectionTimeout — Connection is considered abandoned when borrowed connection hasn’t done any database calls within the set timeout. When the timeout expires, tangling local transactions are either cancelled or rolled back, and connection is closed.
  • Time-to-Live — sets upper limit on how long connection can be borrowed by the application. When timeout expires borrowed connection is forcibly closed.

Setting those two timeouts can cause SQLExeptionS be thrown in the application when application code tries to use borrowed but already aborted connection. So, beware!

NB! There’s quite a lot more to optimizing UCP behavior. This was just a quick expose into parameters relevant to FCF behavior.

And where is the cool demo?

Test client we used for testing and debugging the Fast Connection Failover unfortunately contains proprietary bits. Making it shareable isn’t probably worth the pain. Another problem is, when it comes to FAN events and FCF, UCP is a black box and does not log anything, even on a DEBUG level. All you can divine of actions taken by UCP is your application behavior. In that sense testing your application under failure conditions is essential.

Maximum that is possible to coax out of UCP is a few lines if FCF is enabled

2020-05-19 13:32:09,295 DEBUG [main] o.ucp.jdbc oracle.ucp.jdbc.PoolDataSourceImpl:createPool:oracle.ucp.jdbc.PoolDataSourceImpl@7098b907:Connection pool instance is created
2020-05-19 13:32:10,514 DEBUG [main] oracle.ucp :::Heuristically determine whether to enable FAN
2020-05-19 13:32:10,515 DEBUG [main] oracle.ucp :::RAC/GDS 12.x, FAN is heuristically enabled

Why UCP talks here about RAG/GDS 12.x is another mystery, since JDBC, UCP and database cluster have all version 19.3.

Fast Connection Failover in Action

Oracle Notification Service

Fast Application Notification uses ONS as a transport mechanism. ONS receives subset of events propagated in Clusterware by evmd, and RLB events generated in the database by MMON. It runs as apart of nodeapps, there is one ons daemon per cluster node. This is ONS topology from 5 node cluster:

 

[grid@oc3n13.example.com ~]$ onsctl debug comp=ons[topology]
HTTP/1.1 200 OK
Connection: close
Content-Type: text/html
Response:

== oc3n13.example.com:6200 19597 (56) 20/01/12 17:01:36 ==
Build: ONS_19.0.0.0.0_LINUX.X64_190725 2019/7/25 13:0:12 UTC
Home: /u01/app/grid/product/19c

======== ONS ========

           IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
                          10.69.128.133  6200 5e16ce70 00000009 00000008

Connection Topology: (6)

                IP                      PORT   VERS  TIME
--------------------------------------- ----- ----- --------
                          10.69.128.136  6200     5 5e16ce74
                           **                           10.69.128.132 6200
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.135 6200
                          10.69.128.135  6200     5 5e16ce71
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.132 6200
                           **                           10.69.128.136 6200
                          10.69.128.134  6200     5 5e16ce70
                           **                           10.69.128.133 6200
                           **                           10.69.128.132 6200
                           **                           10.69.128.135 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.136 6200
                          10.69.128.133  6200     5 5e16ce70=
                           **                           10.69.128.132 6200
                           **                           10.69.128.135 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.136 6200
                          10.69.128.132  6200     5 5e16ce72
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.135 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.136 6200
                          10.69.128.131  6200     5 5e16ce73
                           **                           10.69.128.135 6200
                           **                           10.69.128.132 6200
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.136 6200

Configured Nodes: (5)

                ADDRESS                    PORT   FLAGS
---------------------------------------- -------- -----
                                  oc3n11     6200 00000
                                  oc3n12     6200 00000
                                  oc3n14     6200 00000
                                  oc3n15     6200 00000
                                  oc3n16     6200 00000

ONS forwards database status change events and RLB events to subscribers, primarly to FAN clients. In following listing client with id 1fd is my test client which subscribes to events matching “database/events”

[grid@oc3n13.example.com ~]$ onsctl debug comp=ons[client,subscribe]
HTTP/1.1 200 OK
Connection: close
Content-Type: text/html
Response:
 
== oc3n13.example.com:6200 19597 (56) 20/01/09 15:18:08 ==
Build: ONS_19.0.0.0.0_LINUX.X64_190725 2019/7/25 13:0:12 UTC
Home: /u01/app/grid/product/19c
 
======== ONS ========
 
           IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
                          10.69.128.133  6200 5e16ce70 00000009 00000008
 
Client connections: (6)
 
   ID            CONNECTION ADDRESS              PORT  FLAGS  SNDQ REF PHA SUB
-------- --------------------------------------- ----- ------ ---- --- --- ---
       0                                internal     0 00044a    0   1  IO   1
       2                                     ::1 19308 40041a    0   1  IO   1
       3                                     ::1 19310 40041a    0   1  IO   1
       4                                     ::1 19312 40041a    0   1  IO   0
     1fd                     ::ffff:10.100.68.84 57068 48042a    0   1  IO   1
 request                                     ::1 28244 400e1a    0   1  IO   0
 
Subscriptions:
 
  Count 3 Evaluators 0 Updaters 0
 
    Ref 1: "eventType=ONSupdate"
    Ref 2: "database/event/host"
    Ref 1: "database/event/"

Client side configuration

In good old days ONS configuration used to be hardcoded in the client side, but since Oracle 12.1.0.2.0 FAN is enabled automatically when appropriate and ONS configuration happens behind the scene. But sometimes it is good to know where the client is connecting to, especially with large clusters. In case of JDBC and UCP it is possible to fetch ONS configuration from the database connection. Following is heavily abridged example:

import java.sql.SQLException;
import java.util.Properties;
import javax.sql.DataSource;
import oracle.jdbc.internal.OracleConnection;

// Snip...
DataSource ds;
// Some magic that creates DataSource using UCP

try {
    OracleConnection conn = (OracleConnection)ds.getConnection();
    Properties props = conn.getServerSessionInfo();
    String conf = props.getProperty("AUTH_ONS_CONFIG");
    logger.info("AUTH_ONS_CONFIG = " + conf);
}
catch (SQLException e) {
    // Handle the exception
}


Some FAN clients, for example UCP, automatically register to the three ONS servers, even if there’s more servers available in the cluster.

Connecting to ONS

Sometimes it’s good to connect to ONS directly, either for monitoring or to debug Fast Application Notification/Fast Connection Failover. For that end there is public API available, but I haven’t had much of a success with it. Fortunately, there’s fanWatcher, which uses undocumented API from ons.jar.

Debugging ONS client

ons.jar checks for oracle.ons.debug system property, if set to true (this is case insensitive) it enables debug logging and sends results to the stdout. Easiest to set the parameter is probably from the command line, with -Doracle.ons.debug=true switch. ons.jar uses java.util.logging (JUL) for logging, so in theory it should be possible to send debug log into your Java logging framework of choice.

On initialization onj.jar creates notification network and sends sunscription messages to the ONS servers.

FINE: Creating notification network (implicit) : {ss:0;ff:;conf:[{max:3;a:true;n:["10.69.128.151:6200";]},{max:3;a:true;n:["10.69.128.152:6200";]},{max:3;a:true;n:["10.69.128.153:6200";]},]}
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork scanExpandNodeList
FINEST: 10.69.128.151:6200 resolves to 10.69.128.151:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager connect
FINEST: Trying node : 10.69.128.151:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node register
FINEST: Network oracle.ons.NotificationNetwork@b404f35 is registering at node {address: 10.69.128.151:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.Node connect
FINE: Creating connection to node 10.69.128.151:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork scanExpandNodeList
FINEST: 10.69.128.152:6200 resolves to 10.69.128.152:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager connect
FINEST: Trying node : 10.69.128.152:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node register
FINEST: Network oracle.ons.NotificationNetwork@b404f35 is registering at node {address: 10.69.128.152:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager onNodeUp
FINE: ONS node up : {address: 10.69.128.153:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINE: Status message :
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager onNodeUp
FINE: ONS node up : {address: 10.69.128.151:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork demand
FINE: ONS network up request : 0
Jan 18, 2020 4:07:13 PM oracle.ons.Subscriber register
FINE: ONS Registering ONSSubscription : { Subscription : " "; Id : 1 }
Jan 18, 2020 4:07:13 PM oracle.ons.Node$ServerSubscriptionProxy
FINEST: creating proxy: fakeId=1
Jan 18, 2020 4:07:13 PM oracle.ons.Node$ServerSubscriptionProxy
FINEST: creating proxy: fakeId=2
Jan 18, 2020 4:07:13 PM oracle.ons.Node$ServerSubscriptionProxy
FINEST: creating proxy: fakeId=3
Jan 18, 2020 4:07:13 PM oracle.ons.Node addSubscriber
FINEST: new proxy: fakeId=2
Jan 18, 2020 4:07:13 PM oracle.ons.Node addSubscriber
FINEST: new proxy: fakeId=3
Jan 18, 2020 4:07:13 PM oracle.ons.Node addSubscriber
FINEST: new proxy: fakeId=1
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.153:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;SubscriberID:2;{})
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINE: Status message :
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.152:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.151:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;SubscriberID:1;{})
Jan 18, 2020 4:07:13 PM oracle.ons.Subscriber
Jan 18, 2020 4:07:13 PM oracle.ons.Node connect
FINE: Creating connection to node 10.69.128.152:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork scanExpandNodeList
FINEST: 10.69.128.153:6200 resolves to 10.69.128.153:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager connect
FINEST: Trying node : 10.69.128.153:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node register
FINEST: Network oracle.ons.NotificationNetwork@b404f35 is registering at node {address: 10.69.128.153:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.Node connect
FINE: Creating connection to node 10.69.128.153:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.153:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.152:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.151:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;instanceName:dbInstance_oc3n11.example.com_6200;{hostName=oc3n11.example.com})
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;instanceName:dbInstance_oc3n12.example.com_6200;{hostName=oc3n12.example.com})
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINE: Status message :
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;instanceName:dbInstance_oc3n16.example.com_6200;{hostName=oc3n16.example.com})
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager onNodeUp
FINE: ONS node up : {address: 10.69.128.152:6200}

Interesting thing to note is that same message is received from the all ONS server nodes:

FINER: Notification message on node 10.69.128.151:6200 of type event
(event origin:00000000.00000000.00000000.0a458085!6200;stamp:000000005e216052.000010d4;Version:5;eventType:database/event/servicemetrics/fan_test.example.com;generatingComponent:database/rac/service;eventId:10767@oc3n13.example.com:0:4757:1579360646;Content-Length:200;SubscriberID:2;{})
Jan 18, 2020 4:17:27 PM oracle.ons.Node processMessage
FINEST: Message : (event origin:00000000.00000000.00000000.0a458085!6200;stamp:000000005e216052.000010d4;Version:5;eventType:database/event/servicemetrics/fan_test.example.com;generatingComponent:database/rac/service;eventId:10767@oc3n13.example.com:0:4757:1579360646;Content-Length:200;SubscriberID:3;{})
Jan 18, 2020 4:17:27 PM oracle.ons.Node processMessage
FINEST: Message : (event origin:00000000.00000000.00000000.0a458085!6200;stamp:000000005e216052.000010d4;Version:5;eventType:database/event/servicemetrics/fan_test.example.com;generatingComponent:database/rac/service;eventId:10767@oc3n13.example.com:0:4757:1579360646;Content-Length:200;SubscriberID:1;{})

Oracle Notification Service

Oracle Fast Application Notification: Introduction

We used to do with Ilmar Kerm a presentation about how database failover looks from application perspective. I planned to write a paper on the subject, but of course not every plan is meant to succeed. So at this point it’s probably better to cut the losses and publish a bunch of blog posts instead. This is the introduction.

***

When people think of loss of availability, first thing that usually comes to mind is some cosmic event wiping out the data center with surrounding areas. Or biblical flood that washes away everything powered by electricity. Or someone’s evil fat finger that deletes important data. But in modern environments main sources of chaos tends to be planned events.

For example, in heavily regulated industries main source of grief can be mandatory (security) patching and software upgrades. Oracle still can’t do major version upgrades online, and probably never will. And of course there’s “legacy” — ancient errors in infrastructure that have to be rebuilt.

None of the availability technologies Oracle provides are exactly new, both RAC and DataGuard had their beginnings back in 90ies (whatever Oracle marketing believes these days). As party line goes, once you’ve implemented Oracle MAA, you’re highly available. Well, not exactly…

For the application connecting to the database, things are not as simple as they might be. Application still connects to the specific database instance, whether it’s one instance in RAC cluster, or DataGuard installation. If this database instance will be shut down, or role transition happens, connections to the affected instance will break, and applications can start behaving in strange and wonderful ways. Problems are magnified by the connection pools, which can keep around number of available connections. How long cleaning up the dead network connections will take depends on TCP retransmission timeout on your host. It depends on application what impact waiting on dead network connections has, but in most cases waiting 127s just to clear out dead connections, is way too much.

 

Transparent Application Failover

First attempt from Oracle to help applications out was TAF. It is a feature of Oracle OCI driver (OCI means Oracle Call Interface, and not that cloud thingie.) It can detect dead connections, it silently restarts SELECT statements if needed. Transactions are, of course, lost. Another limitation of TAF is that JDBC thin driver does not support it. But instead of adding TAF support to the thin driver, Oracle came out with something new.

 

Fast Application Notification

FAN is a messaging process that can notify subscribers of state changes in the database layer. It provides visibility into Oracle database cluster: if state of any of the services instances or nodes changes, status message is published. By now all major Oracle client drivers support it as a client, there’s FAN API, and server-side callouts. From database side RAC, DataGuard and GDS support it.

Direct knowledge of what happens in the database layer can be really helpful for the application. In rather obvious way, those dreaded network timeouts can be avoided if application knows which node is down. Same goes with services and service members. And if application knows which database nodes are overloaded, new connections can be sent to the nodes with less load.

 

Oracle Notification Service

In 11.2 FAN used Advanced Queueing as a message transport mechanism, but starting with 12c everything is standardized on ONS. It operates in publish-subscribe mode, clients subscribe to the specific event type(s).

This FAN message is generated in Oracle 12.1 cluster, slightly edited for better readability:

(event origin:00000000.00000000.00000000.0a45800e!6200;
stamp:5b9b8241.000f6b2d;
Version:1.0;
eventType:database/event/service;
affectedComponents:;
affectedNodes:;
generatingComponent:database/rac/service;
generatingProcess:7239@oc1n13.example.com;
generatingNode:oc1n13.example.com;
eventId:28468@oc1n14.example.com:0:1010440:1541760309;
creationTime:1541760309;
clusterId:dbInstance_oc1n13.example.com_6200;
clusterName:dbInstance_oc1n13.example.com_6200;
instanceId:dbInstance_oc1n13.example.com_6200;
instanceName:dbInstance_oc1n13.example.com_6200;
LocalOnly:false;
numberOfProperties:11;
Content-Length:210;
SubscriberID:1;
{reason=USER, database=ptecdb2, instance=ptecdb24, event_type=SERVICEMEMBER, timezone=+01:00, service=fan_test.example.com, db_domain=pte.example.com, host=oc1n12, timestamp=2018-11-09 11:45:09, status=down})

Every FAN message has header and body, header contains among other cool things cluster id and name, this is rather handy if application is receiving messages from multiple clusters through GDS.

Event type in message body is either service/service member, node, instance or database. Additionally FAN messages can contain service metrics. Reason is one of

  • USER,
  • FAILURE,
  • member_leave,
  • public_nw_down, or
  • BOOT.

USER denotes some action from database management tool: it does not matter if user executing the tool is a carbon based entity or OEM. When cluster node boots up and services which were available before restart became operational again, generated FAN messages will have reason code BOOT.

Status in the message body is either

  • up,
  • down,
  • nodedown or
  • not_restarting.

If resource failed to restart three times during past 30 minutes,  resulting FAN message will have status code not_restarting.

 

Fast Connection Failover

Sending messages is really cool, but something has to receive them and take action as well. FCF is preconfigured client-side integration which is opaque to the application. FCF client can automatically subscribe to the specific events, database sessions are matched with events using connection signatures. It will act based on event type, status and reason.

status=down, reason=FAILURE — Something failed in the database server, FCF aborts connections immediately and dead connections are removed from the pooled clients.

status=down, reason=PLANNED — Denotes planned maintenance, so FCF retires affected sessions as they’re become unused.

status=up — Service or service member has resumed, so new connections are sent to specific service or node where service member became available.

Load % — enables connection time load balancing and QoS features.

By now all major Oracle client drivers (DBC, UCP, OCI, ODP.NET) support FCF, and starting with 12.2, FCF can be enabled by the client driver automatically when appropriate.

Oracle Fast Application Notification: Introduction