Mixing TIMESTAMPs with TIMESTAMPs WITH TIME ZONE considered harmful

TIMESTAMP and TIMESTAMP WITH TIME ZONE are different data types: they have different internal representations and their behavior has subtle differences. But somehow Oracle still feels like providing implicit data type conversion for these two. So let’s check how implicit data type conversion works in real life.

First some test data:

PRIITP@testpdb > CREATE TABLE t (
    id           INTEGER PRIMARY KEY,
    created      TIMESTAMP NOT NULL,
    created_tz   TIMESTAMP WITH TIME ZONE NOT NULL
);
Table created.

Elapsed: 00:00:00.03
PRIITP@testpdb > insert into t values (1, systimestamp, systimestamp);

1 row created.

Elapsed: 00:00:00.01
PRIITP@testpdb > select * from t;

        ID CREATED                                                                     CREATED_TZ
---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
         1 21-SEP-20 11.27.35.281367 AM                                                21-SEP-20 11.27.35.281367 AM +00:00

Now some comparison between TIMESTAMP column and TIMESTAMP WITH TIME ZONE literal.

PRIITP@testpdb > select * from t where created > TIMESTAMP '2020-09-21 10:27:35 +03:00';

        ID CREATED                                                                     CREATED_TZ
---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
         1 21-SEP-20 11.27.35.281367 AM                                                21-SEP-20 11.27.35.281367 AM +00:00

Still fabulous. But there’s a way to bring in some fun, let’s try to adjust session time zone.

PRIITP@testpdb > alter session set time_zone = '06:00';

Session altered.

Elapsed: 00:00:00.01
PRIITP@testpdb > select * from t where created > TIMESTAMP '2020-09-21 10:27:35 +03:00';

no rows selected

Elapsed: 00:00:00.00

Oops. Somehow our data disappeared. Hint why such a mysterious behavior comes from the execution plan, specifically from the ‘filter’ section

Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    41 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |    41 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(SYS_EXTRACT_UTC(INTERNAL_FUNCTION("CREATED"))>TIMESTAMP'
              2020-09-21 07:27:35.000000000')

First Oracle converts TIMESTAMP WITH TIME ZONE literal into TIMESTAMP literal in UTC. Which is all good and well since internally operations on different time zones are done in UTC only, so conversion into UTC is expected.

Much more interesting is this part: SYS_EXTRACT_UTC(INTERNAL_FUNCTION("CREATED")) . There is no way to know in which time zone those TIMESTAMPs are, so Oracle takes a leap of faith and assumes it should be in the session time zone. This is exactly how SYS_EXTRACT_UTC operates on TIMESTAMPS: since time zone information isn’t available it assumes it should be in session time zone. Unfortunately this does not explain nested INTERNAL_FUNCTION, it looks like a data type conversion, but it’s not actually needed.

While doing the data type conversion Oracle probably ends up with construct like sys_extract_utc(from_tz(created, '06:00')), and with expression like this our query, of course, does not return anything:

PRIITP@testpdb > select created, sys_extract_utc(from_tz(created, '06:00')) from t;

CREATED                                                                     SYS_EXTRACT_UTC(FROM_TZ(CREATED,'06:00'))
--------------------------------------------------------------------------- ---------------------------------------------------------------------------
21-SEP-20 11.27.35.281367 AM                                                21-SEP-20 05.27.35.281367 AM

For me it seems to be a violation of POLA. Since lossless data type conversion isn’t always possible between TIMESTAMPs and TIMESTAMPs WITH TIME ZONE, Oracle should throw an error instead of assuming things and making queries nondetermnistic. And speaking of assumptions, TIMESTAMPs in UTC are making much more sense compared to session time zone, at least for me.

Mixing TIMESTAMPs with TIMESTAMPs WITH TIME ZONE considered harmful

Measuring latency in Python: implementation details

Getting correct time and measuring things in sub-millisecond range is hairy and complicated topic. Most of a time it boils down to the low level details in hardware, but sometimes tools you have to use add their own weird twist. Let’s consider Python 3 as as an example. According to the documentation things couldn’t be more simple. What we have is package time and five functions to return desired reading:

  • time.monotonic() — Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards. The clock is not affected by system clock updates. (Well, not exactly, see later.)
  • time.perf_counter() — Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration.
  • time.process_time() — Return the value (in fractional seconds) of the sum of the system and user CPU time of the current process.
  • time.thread_time()  — Return the value (in fractional seconds) of the sum of the system and user CPU time of the current thread. 
  • time.time() — Return the time in seconds since the epoch as a floating point number. 

Now let’s check what the underlying implementation is across the platforms. First on FreeBSD 12.1:

priitp@roadkill2:~ % python3.7
Python 3.7.8 (default, Jul 19 2020, 23:46:27)
[Clang 8.0.1 (tags/RELEASE_801/final 366581)] on freebsd12
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> [print("{}: {}".format(ci, time.get_clock_info(ci))) for ci in ('monotonic', 'perf_counter', 'process_time', 'thread_time', 'time')]
monotonic: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=2.8e-07)
perf_counter: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=2.8e-07)
process_time: namespace(adjustable=False, implementation='clock_gettime(CLOCK_PROF)', monotonic=True, resolution=0.01)
thread_time: namespace(adjustable=False, implementation='clock_gettime(CLOCK_THREAD_CPUTIME_ID)', monotonic=True, resolution=1.0000000000000002e-06)
time: namespace(adjustable=True, implementation='clock_gettime(CLOCK_REALTIME)', monotonic=False, resolution=2.8e-07)

In Linux I have slightly older Python, so no thread_time here:

[pripii@priitp-roadkill ]$ python3
Python 3.6.8 (default, Apr  2 2020, 13:34:55)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> [print("{}: {}".format(ci, time.get_clock_info(ci))) for ci in ('monotonic', 'perf_counter', 'process_time', 'time')]
monotonic: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
perf_counter: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
process_time: namespace(adjustable=False, implementation='clock_gettime(CLOCK_PROCESS_CPUTIME_ID)', monotonic=True, resolution=1e-09)
time: namespace(adjustable=True, implementation='clock_gettime(CLOCK_REALTIME)', monotonic=False, resolution=1e-09)

And since code I’m involved with can theoretically run on Windows as well, let’s include it for the sake of lulz:

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> [print("{}: {}".format(ci, time.get_clock_info(ci))) for ci in ('monotonic', 'perf_counter', 'process_time', 'time')]
monotonic: namespace(adjustable=False, implementation='GetTickCount64()', monotonic=True, resolution=0.015625)
perf_counter: namespace(adjustable=False, implementation='QueryPerformanceCounter()', monotonic=True, resolution=1e-07)
process_time: namespace(adjustable=False, implementation='GetProcessTimes()', monotonic=True, resolution=1e-07)
time: namespace(adjustable=True, implementation='GetSystemTimeAsFileTime()', monotonic=False, resolution=0.015625)

In both FreeBSD and Linux time.monotonic() and time.perf()_counter are using clock_gettime(CLOCK_MONOTONIC) with precision 1ns. But on Windows you can get precision of whopping 16us from time.monotonic(). So when measuring latencies in microseconds, time.perf_counter() is a way to go, with additional loss of precision of 100ns in Windows.

While CLOCK_MONOTONIC doesn’t suffer from jumps backward in time, it can be incrementally adjusted by adjtime(3) and NTP. While monotonic, consecutive calls can return same value, and it does not count time while system was suspended. Value returned by CLOCK_MONOTONIC represents seconds either from the system boot (Linux) or from epoch (FreeBSD).

CLOCK_REALTIME represents second since epoch and has all the problems a broken timer can have: it can be adjusted by privileged user, or by NTP/PTP/whatnot, it’s readings can jump backward and forward in time, and has limited accuracy. So unless you’re interested in slightly wrong wall clock time, time.time() isn’t that interesting.

Does precision matter?

Since Python isn’t exactly the fastest language humanity has invented, why such a panic about the milli- and nanoseconds? Let’s check how long calling various time.* functions will take. Following is from the Linux system with 1900MHZ CPU and kvm-clock as system clock source.

#!/usr/bin/python3

from hdrh.histogram import HdrHistogram
import time

loops = 1000

def measure(timefunc):
        h = HdrHistogram(1, 10000, 4)
        for l in range(0, loops):
                start = time.perf_counter()
                dummy = timefunc()
                h.record_value(1000000000*(time.perf_counter() - start))
        return h

if __name__ == '__main__':
        for func in (time.monotonic, time.perf_counter, time.process_time, time.time):
                print("Histogram for {}".format(func))
                hist = measure(func)
                [print("\t{}th = {}ns".format(ct, hist.get_value_at_percentile(ct))) for ct in (50, 98, 99, 99.9, 99.99)]

And the winner is…

Histogram for <built-in function monotonic>
        50th = 324ns
        98th = 356ns
        99th = 413ns
        99.9th = 5288ns
        99.99th = 7132ns
Histogram for <built-in function perf_counter>
        50th = 350ns
        98th = 389ns
        99th = 396ns
        99.9th = 1569ns
        99.99th = 14392ns
Histogram for <built-in function process_time>
        50th = 1289ns
        98th = 1599ns
        99th = 2730ns
        99.9th = 10459ns
        99.99th = 21132ns
Histogram for <built-in function time>
        50th = 315ns
        98th = 346ns
        99th = 354ns
        99.9th = 570ns
        99.99th = 1700ns

Median call times are around 300ns, except for time.process_time(), and high values are into microseconds. So for events taking hundreds of microseconds, calling time.perf_counter()/time.monotonic() is just fine.

See also

For the general background: Measuring Latency in Linux.

Measuring latency in Python: implementation details

How to add Oracle specific SQL extensions with Liquibase

Now that sqlcl has acquired support for Liquibase, it might feel unnecessary to use vanilla LB with Oracle. But still, bundling sqlcl with applications takes extra effort, and sometimes you’re suck with vanilla Liquibase on unrelated reasons. And what’s more, adding SQL extensions like global temporary tables, index oriented tables and partitioning is actually quite straightforward with LB.

Let’s start with a simple CREATE TABLE statement:

CREATE TABLE partitioning_test (
    id           INTEGER
        PRIMARY KEY USING INDEX local,
    username     VARCHAR2(128) NOT NULL,
    created      TIMESTAMP WITH TIME ZONE NOT NULL,
    created_by   VARCHAR2(32) NOT NULL,
    updated      TIMESTAMP WITH TIME ZONE,
    updated_by   VARCHAR2(32)
)
    PARTITION BY HASH ( id ) PARTITIONS 16;

To turn this statement into Liquibase changeSet I will

  • Create barebones changeSet without any funny extensions.
  • Check with updateSQL command what kind of statement Liquibase generates from that changeSet.
  • Modify the generated statement with modifySQL sub-tag to the exact liking.

Our initial changeSet looks like this

        <createTable tableName="PARTITIONING_TEST">
            <column name="ID" type="INTEGER" >
                <constraints primaryKey="true"/>
            </column>
            <column name="USERNAME" type="VARCHAR2(128)">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED" type="TIMESTAMP WITH TIME ZONE">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED_BY" type="VARCHAR2(32)">
                <constraints nullable="false"/>
            </column>
            <column name="UPDATED" type="TIMESTAMP WITH TIME ZONE"/>
            <column name="UPDATED_BY" type="VARCHAR2(32)"/>
        </createTable>
    </changeSet>

From which Liquibase generates following SQL statement (formatted for better readability):

CREATE TABLE priitp.partitioning_test (
    id           INTEGER NOT NULL,
    username     VARCHAR2(128) NOT NULL,
    created      TIMESTAMP WITH TIME ZONE NOT NULL,
    created_by   VARCHAR2(32) NOT NULL,
    updated      TIMESTAMP WITH TIME ZONE,
    updated_by   VARCHAR2(32),
    CONSTRAINT pk_partitioning_test PRIMARY KEY ( id )
);

Partitioning the table is really simple, just add the partitioning clause to the end of the statement:

<modifySql dbms="oracle">
                <append value=" PARTITION BY HASH (id) PARTITIONS 16 "/>
        </modifySql>

Changing the primary key clause is trickier, we have to actually modify it:

<replace replace="PRIMARY KEY (ID)" with="PRIMARY KEY (ID) USING INDEX LOCAL">

With those additions final changeset looks like this:

<changeSet author="priitp" id="part-test">
        <createTable tableName="PARTITIONING_TEST">
            <column name="ID" type="INTEGER" >
                <constraints primaryKey="true"/>
            </column>
            <column name="USERNAME" type="VARCHAR2(128)">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED" type="TIMESTAMP WITH TIME ZONE">
                <constraints nullable="false"/>
            </column>
            <column name="CREATED_BY" type="VARCHAR2(32)">
                <constraints nullable="false"/>
            </column>
            <column name="UPDATED" type="TIMESTAMP WITH TIME ZONE"/>
            <column name="UPDATED_BY" type="VARCHAR2(32)"/>
        </createTable>
        <modifySql dbms="oracle">
                <append value=" PARTITION BY HASH (id) PARTITIONS 16 "/>
                <replace replace="PRIMARY KEY (ID)" with="PRIMARY KEY (ID) USING INDEX LOCAL"/>
        </modifySql>
    </changeSet>

And generated SQL statement (once again, formatted for readability) contains all the things we want:

CREATE TABLE partitioning_test (
    id           INTEGER NOT NULL,
    username     VARCHAR2(128) NOT NULL,
    created      TIMESTAMP WITH TIME ZONE NOT NULL,
    created_by   VARCHAR2(32) NOT NULL,
    updated      TIMESTAMP WITH TIME ZONE,
    updated_by   VARCHAR2(32),
    CONSTRAINT pk_partitioning_test PRIMARY KEY ( id )
        USING INDEX local
)
    PARTITION BY HASH ( id ) PARTITIONS 16;

Caveats and Features

In addition to appending and modifying the statement modifySql sub-tag supports prepending and replace using the regular expressions.

modifySql operates on all SQL statements generated by changeSet, including statements coming from sqlFile. For example, if I add createSequence tag to the changeSet, result is something funny:

CREATE SEQUENCE PART_TEST_SEQ
PARTITION BY HASH (id) PARTITIONS 16 ;

How to add Oracle specific SQL extensions with Liquibase

Fast Connection Failover in Action

Oracle Fast Application Notification makes possible for database client to be aware of what is happening in the database cluster. Main use case for FAN is Fast Connection Failover, which can help database client a great deal when it comes to acting on the problems in the database layer. Following is a short explanation how FCF works, using Oracle Universal Connection Pool as an example. Some of the discussion should be relevant to other Oracle clients as well, since FCF works in similar way whether the client is thick, thin, or ODP.NET.

Connection Pooling 101

Oracle database is optimized to work with long running sessions, creating new connection is expensive in terms of latency and resources needed. On the other hand, web service operating in request-response model can have expected response times in single milliseconds. So creating a new session for each and every request is time consuming and not especially smart.

Solution is to create a pool of database connections in the application server. Initially connection pool creates a bunch of available connections, which application code can borrow. When application code is done with the database, it returns borrowed connection to the pool, and returned connection is marked as available.

Another nice thing with the connection pool is that if properly configured, it creates a circuit breaker between the application and database. Maximum number of connections that pool can create effectively restricts the number of sessions that can be created in the database. It helps to limit the damage in case of sudden and unexpected rush hour.

Unfortunately these nice properties introduce some side effects. From high availability standpoint connection pool creates two big problems. It is impossible to tell without testing if connection to the database has died, and in case of networking troubles testing those connections can take ages. Second thing, there is no way to load balance the connections, connection pool does not know which nodes in the cluster are overloaded and which nodes could provide better response time for the application.

UCP: handling the outages

When UCP receives message from the ONS, it inspects the connections available in the connection pool and takes action if needed. Whole process is rather simple:

  1. UCP receives FCF message from ONS.
  2. UCP inspects connections in the pool and if connection does not have service in the cluster available, UCP kills the physical connection.

Main difference between planned and unplanned outages is how borrowed connections are handled. In case of unplanned downtime UCP kills all the affected connections, including connections borrowed by the application code. If application tries to use the reaped connection, it receives the javax.util.SQLException.

In case of planned downtime UCP acts only on available connections. Borrowed connections will stay intact, these are closed when application returns them to the pool. To avoid logon storms draining period can be set with oracle.ucp.PlannedDrainingPeriod​ parameter.

UCP: load balancing and connection affinity

With FAN, load in the database cluster becomes observable for the client. This, in turn, enables connection time load balancing on the client side. Work can be distributed either by throughput or response time. Load balancing messages are configured on the server side, on the service involved (and whole thing uses RAC Load Balancer Advisory in the background). If UCP receives relevant messages, load balancing is done automatically.

As mentioned, load balancing happens when the new connection is created. Which means applications with steady number of connections does not benefit from load balancing capabilities. To avoid this issue one can set timeouts on the connections:

  • InactiveConnectionTimeout — how long available connection can be idle before it is removed from the connection pool.
  • MaxConnectionReuseTime — how long connection can be used before being gracefully removed from the pool.

Unfortunately there are applications which borrow connections and never return them back to the pool. For these greedy applications there are two more timeouts that operate on borrowed connections:

  • AbandonConnectionTimeout — Connection is considered abandoned when borrowed connection hasn’t done any database calls within the set timeout. When the timeout expires, tangling local transactions are either cancelled or rolled back, and connection is closed.
  • Time-to-Live — sets upper limit on how long connection can be borrowed by the application. When timeout expires borrowed connection is forcibly closed.

Setting those two timeouts can cause SQLExeptionS be thrown in the application when application code tries to use borrowed but already aborted connection. So, beware!

NB! There’s quite a lot more to optimizing UCP behavior. This was just a quick expose into parameters relevant to FCF behavior.

And where is the cool demo?

Test client we used for testing and debugging the Fast Connection Failover unfortunately contains proprietary bits. Making it shareable isn’t probably worth the pain. Another problem is, when it comes to FAN events and FCF, UCP is a black box and does not log anything, even on a DEBUG level. All you can divine of actions taken by UCP is your application behavior. In that sense testing your application under failure conditions is essential.

Maximum that is possible to coax out of UCP is a few lines if FCF is enabled

2020-05-19 13:32:09,295 DEBUG [main] o.ucp.jdbc oracle.ucp.jdbc.PoolDataSourceImpl:createPool:oracle.ucp.jdbc.PoolDataSourceImpl@7098b907:Connection pool instance is created
2020-05-19 13:32:10,514 DEBUG [main] oracle.ucp :::Heuristically determine whether to enable FAN
2020-05-19 13:32:10,515 DEBUG [main] oracle.ucp :::RAC/GDS 12.x, FAN is heuristically enabled

Why UCP talks here about RAG/GDS 12.x is another mystery, since JDBC, UCP and database cluster have all version 19.3.

Fast Connection Failover in Action

Oracle Notification Service

Fast Application Notification uses ONS as a transport mechanism. ONS receives subset of events propagated in Clusterware by evmd, and RLB events generated in the database by MMON. It runs as apart of nodeapps, there is one ons daemon per cluster node. This is ONS topology from 5 node cluster:

 

[grid@oc3n13.example.com ~]$ onsctl debug comp=ons[topology]
HTTP/1.1 200 OK
Connection: close
Content-Type: text/html
Response:

== oc3n13.example.com:6200 19597 (56) 20/01/12 17:01:36 ==
Build: ONS_19.0.0.0.0_LINUX.X64_190725 2019/7/25 13:0:12 UTC
Home: /u01/app/grid/product/19c

======== ONS ========

           IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
                          10.69.128.133  6200 5e16ce70 00000009 00000008

Connection Topology: (6)

                IP                      PORT   VERS  TIME
--------------------------------------- ----- ----- --------
                          10.69.128.136  6200     5 5e16ce74
                           **                           10.69.128.132 6200
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.135 6200
                          10.69.128.135  6200     5 5e16ce71
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.132 6200
                           **                           10.69.128.136 6200
                          10.69.128.134  6200     5 5e16ce70
                           **                           10.69.128.133 6200
                           **                           10.69.128.132 6200
                           **                           10.69.128.135 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.136 6200
                          10.69.128.133  6200     5 5e16ce70=
                           **                           10.69.128.132 6200
                           **                           10.69.128.135 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.136 6200
                          10.69.128.132  6200     5 5e16ce72
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.135 6200
                           **                           10.69.128.131 6200
                           **                           10.69.128.136 6200
                          10.69.128.131  6200     5 5e16ce73
                           **                           10.69.128.135 6200
                           **                           10.69.128.132 6200
                           **                           10.69.128.133 6200
                           **                           10.69.128.134 6200
                           **                           10.69.128.136 6200

Configured Nodes: (5)

                ADDRESS                    PORT   FLAGS
---------------------------------------- -------- -----
                                  oc3n11     6200 00000
                                  oc3n12     6200 00000
                                  oc3n14     6200 00000
                                  oc3n15     6200 00000
                                  oc3n16     6200 00000

ONS forwards database status change events and RLB events to subscribers, primarly to FAN clients. In following listing client with id 1fd is my test client which subscribes to events matching “database/events”

[grid@oc3n13.example.com ~]$ onsctl debug comp=ons[client,subscribe]
HTTP/1.1 200 OK
Connection: close
Content-Type: text/html
Response:
 
== oc3n13.example.com:6200 19597 (56) 20/01/09 15:18:08 ==
Build: ONS_19.0.0.0.0_LINUX.X64_190725 2019/7/25 13:0:12 UTC
Home: /u01/app/grid/product/19c
 
======== ONS ========
 
           IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
                          10.69.128.133  6200 5e16ce70 00000009 00000008
 
Client connections: (6)
 
   ID            CONNECTION ADDRESS              PORT  FLAGS  SNDQ REF PHA SUB
-------- --------------------------------------- ----- ------ ---- --- --- ---
       0                                internal     0 00044a    0   1  IO   1
       2                                     ::1 19308 40041a    0   1  IO   1
       3                                     ::1 19310 40041a    0   1  IO   1
       4                                     ::1 19312 40041a    0   1  IO   0
     1fd                     ::ffff:10.100.68.84 57068 48042a    0   1  IO   1
 request                                     ::1 28244 400e1a    0   1  IO   0
 
Subscriptions:
 
  Count 3 Evaluators 0 Updaters 0
 
    Ref 1: "eventType=ONSupdate"
    Ref 2: "database/event/host"
    Ref 1: "database/event/"

Client side configuration

In good old days ONS configuration used to be hardcoded in the client side, but since Oracle 12.1.0.2.0 FAN is enabled automatically when appropriate and ONS configuration happens behind the scene. But sometimes it is good to know where the client is connecting to, especially with large clusters. In case of JDBC and UCP it is possible to fetch ONS configuration from the database connection. Following is heavily abridged example:

import java.sql.SQLException;
import java.util.Properties;
import javax.sql.DataSource;
import oracle.jdbc.internal.OracleConnection;

// Snip...
DataSource ds;
// Some magic that creates DataSource using UCP

try {
    OracleConnection conn = (OracleConnection)ds.getConnection();
    Properties props = conn.getServerSessionInfo();
    String conf = props.getProperty("AUTH_ONS_CONFIG");
    logger.info("AUTH_ONS_CONFIG = " + conf);
}
catch (SQLException e) {
    // Handle the exception
}


Some FAN clients, for example UCP, automatically register to the three ONS servers, even if there’s more servers available in the cluster.

Connecting to ONS

Sometimes it’s good to connect to ONS directly, either for monitoring or to debug Fast Application Notification/Fast Connection Failover. For that end there is public API available, but I haven’t had much of a success with it. Fortunately, there’s fanWatcher, which uses undocumented API from ons.jar.

Debugging ONS client

ons.jar checks for oracle.ons.debug system property, if set to true (this is case insensitive) it enables debug logging and sends results to the stdout. Easiest to set the parameter is probably from the command line, with -Doracle.ons.debug=true switch. ons.jar uses java.util.logging (JUL) for logging, so in theory it should be possible to send debug log into your Java logging framework of choice.

On initialization onj.jar creates notification network and sends sunscription messages to the ONS servers.

FINE: Creating notification network (implicit) : {ss:0;ff:;conf:[{max:3;a:true;n:["10.69.128.151:6200";]},{max:3;a:true;n:["10.69.128.152:6200";]},{max:3;a:true;n:["10.69.128.153:6200";]},]}
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork scanExpandNodeList
FINEST: 10.69.128.151:6200 resolves to 10.69.128.151:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager connect
FINEST: Trying node : 10.69.128.151:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node register
FINEST: Network oracle.ons.NotificationNetwork@b404f35 is registering at node {address: 10.69.128.151:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.Node connect
FINE: Creating connection to node 10.69.128.151:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork scanExpandNodeList
FINEST: 10.69.128.152:6200 resolves to 10.69.128.152:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager connect
FINEST: Trying node : 10.69.128.152:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node register
FINEST: Network oracle.ons.NotificationNetwork@b404f35 is registering at node {address: 10.69.128.152:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager onNodeUp
FINE: ONS node up : {address: 10.69.128.153:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINE: Status message :
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager onNodeUp
FINE: ONS node up : {address: 10.69.128.151:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork demand
FINE: ONS network up request : 0
Jan 18, 2020 4:07:13 PM oracle.ons.Subscriber register
FINE: ONS Registering ONSSubscription : { Subscription : " "; Id : 1 }
Jan 18, 2020 4:07:13 PM oracle.ons.Node$ServerSubscriptionProxy
FINEST: creating proxy: fakeId=1
Jan 18, 2020 4:07:13 PM oracle.ons.Node$ServerSubscriptionProxy
FINEST: creating proxy: fakeId=2
Jan 18, 2020 4:07:13 PM oracle.ons.Node$ServerSubscriptionProxy
FINEST: creating proxy: fakeId=3
Jan 18, 2020 4:07:13 PM oracle.ons.Node addSubscriber
FINEST: new proxy: fakeId=2
Jan 18, 2020 4:07:13 PM oracle.ons.Node addSubscriber
FINEST: new proxy: fakeId=3
Jan 18, 2020 4:07:13 PM oracle.ons.Node addSubscriber
FINEST: new proxy: fakeId=1
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.153:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;SubscriberID:2;{})
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINE: Status message :
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.152:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.151:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;SubscriberID:1;{})
Jan 18, 2020 4:07:13 PM oracle.ons.Subscriber
Jan 18, 2020 4:07:13 PM oracle.ons.Node connect
FINE: Creating connection to node 10.69.128.152:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationNetwork scanExpandNodeList
FINEST: 10.69.128.153:6200 resolves to 10.69.128.153:6200
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager connect
FINEST: Trying node : 10.69.128.153:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node register
FINEST: Network oracle.ons.NotificationNetwork@b404f35 is registering at node {address: 10.69.128.153:6200}
Jan 18, 2020 4:07:13 PM oracle.ons.Node connect
FINE: Creating connection to node 10.69.128.153:6200
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.153:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.152:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINER: Notification message on node 10.69.128.151:6200 of type status
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;instanceName:dbInstance_oc3n11.example.com_6200;{hostName=oc3n11.example.com})
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;instanceName:dbInstance_oc3n12.example.com_6200;{hostName=oc3n12.example.com})
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINE: Status message :
Jan 18, 2020 4:07:13 PM oracle.ons.Node processMessage
FINEST: Message : (status Result:success;Message:;Version:5;instanceName:dbInstance_oc3n16.example.com_6200;{hostName=oc3n16.example.com})
Jan 18, 2020 4:07:13 PM oracle.ons.NotificationManager onNodeUp
FINE: ONS node up : {address: 10.69.128.152:6200}

Interesting thing to note is that same message is received from the all ONS server nodes:

FINER: Notification message on node 10.69.128.151:6200 of type event
(event origin:00000000.00000000.00000000.0a458085!6200;stamp:000000005e216052.000010d4;Version:5;eventType:database/event/servicemetrics/fan_test.example.com;generatingComponent:database/rac/service;eventId:10767@oc3n13.example.com:0:4757:1579360646;Content-Length:200;SubscriberID:2;{})
Jan 18, 2020 4:17:27 PM oracle.ons.Node processMessage
FINEST: Message : (event origin:00000000.00000000.00000000.0a458085!6200;stamp:000000005e216052.000010d4;Version:5;eventType:database/event/servicemetrics/fan_test.example.com;generatingComponent:database/rac/service;eventId:10767@oc3n13.example.com:0:4757:1579360646;Content-Length:200;SubscriberID:3;{})
Jan 18, 2020 4:17:27 PM oracle.ons.Node processMessage
FINEST: Message : (event origin:00000000.00000000.00000000.0a458085!6200;stamp:000000005e216052.000010d4;Version:5;eventType:database/event/servicemetrics/fan_test.example.com;generatingComponent:database/rac/service;eventId:10767@oc3n13.example.com:0:4757:1579360646;Content-Length:200;SubscriberID:1;{})

Oracle Notification Service

Oracle Fast Application Notification: Introduction

We used to do with Ilmar Kerm a presentation about how database failover looks from application perspective. I planned to write a paper on the subject, but of course not every plan is meant to succeed. So at this point it’s probably better to cut the losses and publish a bunch of blog posts instead. This is the introduction.

***

When people think of loss of availability, first thing that usually comes to mind is some cosmic event wiping out the data center with surrounding areas. Or biblical flood that washes away everything powered by electricity. Or someone’s evil fat finger that deletes important data. But in modern environments main sources of chaos tends to be planned events.

For example, in heavily regulated industries main source of grief can be mandatory (security) patching and software upgrades. Oracle still can’t do major version upgrades online, and probably never will. And of course there’s “legacy” — ancient errors in infrastructure that have to be rebuilt.

None of the availability technologies Oracle provides are exactly new, both RAC and DataGuard had their beginnings back in 90ies (whatever Oracle marketing believes these days). As party line goes, once you’ve implemented Oracle MAA, you’re highly available. Well, not exactly…

For the application connecting to the database, things are not as simple as they might be. Application still connects to the specific database instance, whether it’s one instance in RAC cluster, or DataGuard installation. If this database instance will be shut down, or role transition happens, connections to the affected instance will break, and applications can start behaving in strange and wonderful ways. Problems are magnified by the connection pools, which can keep around number of available connections. How long cleaning up the dead network connections will take depends on TCP retransmission timeout on your host. It depends on application what impact waiting on dead network connections has, but in most cases waiting 127s just to clear out dead connections, is way too much.

 

Transparent Application Failover

First attempt from Oracle to help applications out was TAF. It is a feature of Oracle OCI driver (OCI means Oracle Call Interface, and not that cloud thingie.) It can detect dead connections, it silently restarts SELECT statements if needed. Transactions are, of course, lost. Another limitation of TAF is that JDBC thin driver does not support it. But instead of adding TAF support to the thin driver, Oracle came out with something new.

 

Fast Application Notification

FAN is a messaging process that can notify subscribers of state changes in the database layer. It provides visibility into Oracle database cluster: if state of any of the services instances or nodes changes, status message is published. By now all major Oracle client drivers support it as a client, there’s FAN API, and server-side callouts. From database side RAC, DataGuard and GDS support it.

Direct knowledge of what happens in the database layer can be really helpful for the application. In rather obvious way, those dreaded network timeouts can be avoided if application knows which node is down. Same goes with services and service members. And if application knows which database nodes are overloaded, new connections can be sent to the nodes with less load.

 

Oracle Notification Service

In 11.2 FAN used Advanced Queueing as a message transport mechanism, but starting with 12c everything is standardized on ONS. It operates in publish-subscribe mode, clients subscribe to the specific event type(s).

This FAN message is generated in Oracle 12.1 cluster, slightly edited for better readability:

(event origin:00000000.00000000.00000000.0a45800e!6200;
stamp:5b9b8241.000f6b2d;
Version:1.0;
eventType:database/event/service;
affectedComponents:;
affectedNodes:;
generatingComponent:database/rac/service;
generatingProcess:7239@oc1n13.example.com;
generatingNode:oc1n13.example.com;
eventId:28468@oc1n14.example.com:0:1010440:1541760309;
creationTime:1541760309;
clusterId:dbInstance_oc1n13.example.com_6200;
clusterName:dbInstance_oc1n13.example.com_6200;
instanceId:dbInstance_oc1n13.example.com_6200;
instanceName:dbInstance_oc1n13.example.com_6200;
LocalOnly:false;
numberOfProperties:11;
Content-Length:210;
SubscriberID:1;
{reason=USER, database=ptecdb2, instance=ptecdb24, event_type=SERVICEMEMBER, timezone=+01:00, service=fan_test.example.com, db_domain=pte.example.com, host=oc1n12, timestamp=2018-11-09 11:45:09, status=down})

Every FAN message has header and body, header contains among other cool things cluster id and name, this is rather handy if application is receiving messages from multiple clusters through GDS.

Event type in message body is either service/service member, node, instance or database. Additionally FAN messages can contain service metrics. Reason is one of

  • USER,
  • FAILURE,
  • member_leave,
  • public_nw_down, or
  • BOOT.

USER denotes some action from database management tool: it does not matter if user executing the tool is a carbon based entity or OEM. When cluster node boots up and services which were available before restart became operational again, generated FAN messages will have reason code BOOT.

Status in the message body is either

  • up,
  • down,
  • nodedown or
  • not_restarting.

If resource failed to restart three times during past 30 minutes,  resulting FAN message will have status code not_restarting.

 

Fast Connection Failover

Sending messages is really cool, but something has to receive them and take action as well. FCF is preconfigured client-side integration which is opaque to the application. FCF client can automatically subscribe to the specific events, database sessions are matched with events using connection signatures. It will act based on event type, status and reason.

status=down, reason=FAILURE — Something failed in the database server, FCF aborts connections immediately and dead connections are removed from the pooled clients.

status=down, reason=PLANNED — Denotes planned maintenance, so FCF retires affected sessions as they’re become unused.

status=up — Service or service member has resumed, so new connections are sent to specific service or node where service member became available.

Load % — enables connection time load balancing and QoS features.

By now all major Oracle client drivers (DBC, UCP, OCI, ODP.NET) support FCF, and starting with 12.2, FCF can be enabled by the client driver automatically when appropriate.

Oracle Fast Application Notification: Introduction