Friday, July 24, 2015

Maximize insert throughput in Oracle RAC system (1of3)

Many systems use surrogate primary keys generated by a sequence generator. It's simple, easy, and guarantees unique values. Unfortunately in a multithreaded environment it leads to issues with concurrent memory structures updates. The problem is additionally magnified in distributed environment with cache coherency. This analysis presents the problem and aims to find solutions.
The basic nature of database sequence is delivering a monotonically ascending integer value. It's good for us - humans as creates impression of right orders of things. Typically we are thinking in a single threaded way: I'm going to put objects in an order, and stamp them with unique identifiers: 1, 2, 3, .... I will do it by taking one object after another, will write a number on it, and put in a storage slot. It's ok, but imagine that you are trying to speed up the work by asking 5 colleagues to work with you. Single threaded approach will not work. Each of you will start fighting for identifiers, each of you will try to put object in the same slot to keep things in a required order. The same applies to computer systems, but somehow we are typically overlooking it. To speed up the work, we use parallel workers. Parallel workers having some work to do, get work identifier from shared logic generating subsequent numbers. Quite common this generator is one for an object type e.g. invoices are getting ascending numbers. To be honest it's the easiest way of generating identifiers - taking next values of some integer guarantees uniqueness. On the other hand massive generation of invoice identification will typically use some form of prefixes. It may be for example customer segment.

It's hard to believe, but sometimes one sequence may be used in a whole system. I saw it once in my life. Developer asked about this weird situation told me that he received requirement specifying that each object in the system should be stamped with unique identifier. Oh no! It cannot be true, I said to myself. This little and non usual situation opened my mind for non standard perception of surrogate identifiers. One of techniques says to use UUID. I understood that it must be just an unique value, and it's up to our imagination how to construct it. 

Computer systems keep information in data structures making it possible to quickly lookup required information. For databases, storing information on a dose drives, typical data structure is a B-Tree or its variations as B+Tree. It's a general data structure invented in 1971 by Bayer/McCreight 12. By definition it's a sorted data structure, what means that each insertion goes to the right place - the leaf with the largest value. With ascending identifier it means they all insertions goes to the last leaf 1.

Picture. Add & split effect of adding largest number to B-Tree 6

Oracle database uses variation of B-Tree called B+Tree. In this variation different types of nodes are used to keep (a) pointers, and (b) data. Data nodes are called "leaf nodes". If you are interested in details related to B+Tree implementation in Oracle, look into great article by Gomez 15.

Picture. B+Tree used in Oracle database 15

This nature of B-Tree/B+Tree creates situation, when massive data insertion performed by multiple threads, using monotonically ascending sequence, leads to performance problem, known as "index leaf block contention". From technical perspective multiple threads are blocked on access to a single memory block with additional possible delays due to block split operation. 

Picture. Unbalanced insertion into B-Tree

The problem is additionally magnified in Oracle RAC environment due to a need of maintaining cache coherency between RAC nodes. Once application uses blind, round robin, load balancing to distribute load between RAC nodes, Cache Fusion starts utilizing Interconnect to transport buffers between RAC nodes. 

Picture. Unbalanced insertion to B-Tree on load balanced RAC system 2

In the worst situation, when application inserts rows using multiple connections, balanced among RAC nodes, most-right index block is, permanently, moved among nodes, to add a single row into it after each of transfer. It creates high frequency data movements inside of Cache Fusion, what is visible in Oracle Wait Events. On not tuned system you will see Cluster wait events on top of the list.

The problem
Summarizing above introduction, massive insert with monotonically ascending sequence in a RAC environment is related to two major problems:
1. multi process/threaded insert leads to index leaf block contention,
2. multi RAC node insert leads to heavy use of interconnect.

Picture. Root cause diagram of slow data insert in WebLogic/Oracle RAC environment



Desire of data affinity
To operate flawlessly, RAC system requires that insert stream, owned by a single connection, writes rows into exclusively accessed physical structure. Such requirement is related to stability of buffer blocks, as no other connection will attempt to change blocks being populated from other nodes. It's a special form of data to physical location affinity. It's a quite fragile situation, as any request retrieving latest data will disturb it, but this kind of affinity is good enough for data load.

Picture. Data affinity and distant primary keys as remedy for block contention and interconnect burning 4

Reading data
Unfortunately reading data  in RAC environment is related to increased transfer of undo blocks, where all not yet committed, but changed data is stored. To guarantee consistent reads, Oracle must ship this data to reading instance. It's important to remember that interconnect is used to transport both data and undo blocks. Reading data may be even more complicated. To keep consistent reads Oracle RDBMS may decide to perform undo on a changed block, before shipping data block to requesting node. Finally you may see not only latency related to buffer and undo transport, but a latency related to preparing required version of blocks.


Picture. Interconnect used by both  data and undo transfers 5

Potentially good solutions
To minimize index leaf block contention, primary keys' values must be spread across multiple blocks. One may achieve this using Reverse Key index or a hash partitioned index. Both techniques will help in a single node, multi there
aded environment. However both introduce costly side-effect e.g. eliminating possibility of using range scans. It's expected with hash partitioning, but Reverse Key index may surprise few of us. Additionally huge Reverse Key index, which does not fit into cache buffer, will lead to I/O contention 10. Moreover Reverse Key technique do not help in multi node RAC system. It will provide high level of randomization in block access, but it's a matter of luck to doesn't need to transport buffer block over Interconnect during data append. Hash partition seems to be working fine in the cluster, however I've found no information about compatibility of hash algorithm with RAC systems. It's interesting to find out if inserted rows are stored in partitions pined to RAC node. Preparing a test sounds like a very good idea. Oracle Real World Performance Group presented in one of videos that partitioned index provides reduced performance 10

Good solutions
One of good solutions is based on sequence cache with no order, where each connection requesting a sequence receives own range of integers. With a big enough (bigger than rows in index leaf block?) segment size i.e. segment of integers assigned to get sequence request it's guaranteed that nodes will not share the same blocks among RAC nodes. There is a problem, however, with dynamic connections, load balanced by application server among both RAC nodes. The same thread executing subsequent INSERT requests will be connected to a different database node. With such configuration application never knows at which RAC node the data is really being inserted. 
You may check database connection attributes somehow, but most probably it will be related to a trip to a database. You can use database sequence.NEXTVAL each time, but most probably (hmmm....it may be cached in client driver) it's a network call as well. On the other hand, sequence cache seems to be the best solution for sessions connected to one of RAC nodes in a stable way. It's good that in WebLogic we may achieve this by pinning connection to the execution thread. Using load balanced (with round robin approach) Multi Data Source, connections will be spread across all RAC nodes and will be kept until thread pool will be shrunken.

Another good solution is based on pre insert trigger - node affinity guaranteed as sequence is used at database level, but it's not the most performant setup. We would rather not use triggers, especially that middleware e.g. Hibernate need to be aware of primary key values before inserting data.


Desire of a Smart Primary Key
It looks that the solution is a more complex primary key than just a monotonically increasing integer. Primary key must be a unique and not too big identifier - that's all we require from this field. In a multithreaded and multimode environment each thread must write information to exclusively accessed data structures. To guarantee this we need to stop using monotonically ascending integer used as a unique identifier. It's the first thing to do, as we know this kind of identifier is a major source of issues in multithreaded/multinode environment. 

Good identifier should guarantee distribution of client requests changing data among different blocks. We can guarantee it by using partitioned database objects or leveraging sorted and layered nature of B-Tree. In each case we do need a smart key based on unique elements guaranteeing uniqueness of the whole key and compatibility with underlaying physical data structures. Let's focus on B+Tree. It's sorted data structure, thus introducing mutually exclusive ranges of identifiers will lead into putting data in different set of blocks. There will be some contention at beginning of data add, but after initial tree splits, each range of identifiers will start using own exclusive set of blocks.  
WebLogic/RAC environment runs multiple application server nodes, each executing many threads. Additionally, there are multiple database nodes. Assuming load balancing on each level, such environment requires a lot of synchronization. First thing to be done, to stabilize data paths, is to pin WebLogic threads to RAC nodes to eliminate the need of RAC Interconnect use for a stream of data generated by each single thread. 

Picture. Multiple threads taking to a database in a load balancing and pined mode

With above, stable data paths, it's possible to force each thread to write data to exclusively accessed data structures by leveraging physical aspects of B-Tree. To achieve this system should use primary keys formed from three elements: node id, thread id, and sequence. Let's call it a smart key.

Picture. Primary key constructed with an integer prefixed by node and thread number

By using a smart key with node and thread id at the front of the primary key, B-Tree index will naturally get balanced after some time of initial contention. Node id and thread id, being most significant part of the primary key, will split all the blocks without a need to partition indices. Note that this solution works for both multiple threads and multiple nodes. Using WebLogic ability if pining connections to threads, each thread gets stable connection to RAC node. Load balancing distributes load during a first connection to the database, to guarantee the same network path from application thread to RAC node for whole lifetime of a connection. The system will stabilize after initial data transfers over Interconnect, servicing data request from local buffer cache for each properly addressed request. Most important is that inserts performed on a high rate by threads will not influence each other. Such preparation of a primary key eliminates both index leaf block and global cache contention. Such technique is proposed by Oracle Real World Performance group 10. It was reported as being used by CERN 11 to pump data into RAC on a highest possible speed.


Table. Comparison of different techniques mitigating issues related to leaf block/global cache contention


Engineer a hash partitioning compatible key
Intuition tells us that identifier keys will be spread among partitions - possibly using modulo function. Having 2 partitions, the key with value of 1 will go to bucket #1, value 2 to bucket#2, 3 again to bucket #1, 4 to #2, and so on. It's true only to some extent. Oracle documentation says that number of partitions is a different parameter than a hash partitioning itself 16. So it's not a simple modulo - system is really hashing value of primary key by a hash function.

Code. Example of table with hash partitioned index 

It's a good question, what is purpose of a hash function. There is a very good and historical writing about hash functions by Bob Jenkins - Oracle engineer focusing on this subject 14. The document explains complexity and the purpose of hashing - creating unique and distributed 32 bit number representation of a variable. The hash function increases entropy of given piece of data, guaranteeing it's uniqueness. Knowing this, it's more clear that row distribution among partitions is made after computing hash value of primary key. We have two steeps: hashing, and routing. It's possible that Oracle applies modulo function on a hash value, as there is known "power of 2" rule associated to hash partitioning 18, and known fast modulo algorithm when computer for "power of 2" values 17. The "power of 2" rule is important to guarantee equal distribution of data among all the partitions. To be honest - I do not understand why, but it something out of scope of this elaboration.

Picture. Engineering primary key to control location of data in partitions is not a good idea 

Anyway it shows that it's not easy, not possible, simply pointless, and in opposition to the idea of basing to engineer special primary key for hash partitioning. To control in which partition a row will be stored, it would be required to reverse engineer a hash function. It not only sounds too tricky, but it would be a real Oracle Hack - not good for enterprise application. If partitioning should be used in the application, we should go to roots of partitioning use cases, and use it for operational purposes, not to fight with modern systems' devil - monotonically ascending primary key.

Picture. Modern systems' devil - monotonically ascending primary key

Other indices
There is no silver bullet. Unfortunately the "sequence devil" cannot be eliminated by using one universal tool. We can influence primary keys; it's not a problem with surrogate keys, possibly natural keys may be changed as well.  However there is a real problem with secondary indices with natural values. One of good examples is a timestamp. If you need to index the timestmap e.g. in an audit table, you have a problem. On one hand, you can use hash partition, but it will eliminate possibility of using indexed ranges scans - it may be a quite important issue for the audit table, because typically you want to have range scan on a timestamp. Reverse Key index has similar limitations introducing bigger space consuption. If you are ok with this - go with one of listed solution, if not you need to change your application to write data from single thread. Unfortunately it's typically always a perfect solution to design systems understanding both business and technology. Design w/o awareness of both leads to unavoidable problems.

Design - WebLogic, data, and sequence
Design of a proper solution is based on preparing right identifier patterns built out of node, process, and sequence. We must guarantee that once load balanced connection will stay stable, thus WebLogic must be configured with "pin connection to thread". Ideally database connection pool size must stay untouched, what means that min is qual to max, and shrinking of connections is disabled. System must keep dedicated database connection for each thread.

Please find following list of technical notes to build smart primary key in WebLogic/RAC/Hibernate environment:

1. WebLogic pins database connection to thread
-> once joined RAC node always stays untouched pinned to this thread
-> initial round robin load balance distributes connections among cluster nodes

2. System provides continuous sequence by the means of NEXTVAL
-> sequence should use cache big enough to limit calls to a database up to one per few seconds
-> big cache may lead to sequence gaps in case of frequent system restarts
-> each WebLogic node may use own sequence as WLS node id is unique and makes overall identifier unique
-> note that thread id is not unique, thus threads must share single sequence or ranges off sequences
-> continuous sequence added to the end of primary key guarantees that each stream of data will be added:
-> to one block
-> to subsequent blocks
-> blocks will never be shared among threads

3. Each thread gets unique identifier
-> thread identifier taken from its name: "ExecuteThread: '2' for queue: 'weblogic.kernel.System'"
-> thread identifier may be taken from Thread.getId() -  faster method should be chosen
-> thread identifier is used as a distributing factor for B-Tree structure
-> final sequence for single node, but multithreaded distributed primary key: thread_id (+) seq

4. Each node gets unique identifier
-> node identifier is taken from application configuration
-> node identifier is used to pin data stream to RAC node
-> final smart identifier looks following: node_id (+) thread_id (+) seq

5. Configuration of Hibernate
-> standard sequence generator is synchronized - "public synchronized Serializable generate(...)"
-> synchronized method to get sequence value limits performance
-> with hundreds of threads it may be a limiting factor
-> synchronization may be eliminated by storing sequence value per thread
-> each thread have to use the named sequence for each entity
-> unique value will be guaranteed by node id / thread id
-> in case of system restart, new sequence range will be taken from database 
-> unique value will be guaranteed by new sequence range
-> note that thread id is not unique:
-> it's duplicated on each WLS node
-> combined with node id gives unique value
-> combined with node id and named sequence gives system wide unique value
-> uniqueness will be maintained also following the restarts

6. Configuration of application
-> each node must own exclusive range of node identifiers
-> it's good idea to have it in a shared data structure
-> it may be a table with e.g. 100 rows of pairs wls_node_name, sytem_node_id per each primary key
-> finally it will be sequence_name (String), wls_node_name (String), sytem_node_id (Integer)
-> this configuration will be read once during system startup
-> sequence generator may be notified somehow about change of mapping to avoid system restart in case of extending nodes
-> it's possible to add new identifiers for newly added WLS node and keep system unbalanced for some time

Scalability of the system
Presented key generation approach works well in WLS/RAC environment only to some extent. In case of extending WebLogic cluster by new nodes system will be unbalanced as original identifiers were built on original node identifiers. It may be good for shortly living data, however it's not good for long living objects. Imagine that during first period with 2 nodes, each of them created 100 objects. After adding new node to the system we have situation when first two nodes owned 100 objects, and the newly added node keeps 0. 
Remedy for this is to use set of node identifiers assigned to physical nodes. With assumption of WLS scalability up to 100 nodes, 2 node system should start using all 100 identifiers. Let's say first WLS will use 1-50, and second one 51-100. After adding one more node, id distribution will change to 1-33, 34-66, 67-100. Each node will receive equal load. This is important for potential distribution of reads and data processing - following this scheme each WLS will handle potentially equal load. 

RAC scalability is guaranteed by it's nature, and load balancing of WebLogic Multi Data Source. However to work in such environment, application should always relay on known primary keys, and dispatching requests to nodes being owners of primary key ranges. In case of wrong routing node should be able to dispatch request to a proper node or to an error queue. It may be implemented with Service Bus located at the front of application and different work queues assigned to each node. On JMS level it may be probably implemented with message selectors as well, but it may not work properly in distributed messaging system.

Picture. Work distribution in data affinity aware application

Other problems
Distributed data insertion is related to more problems. One of them is too frequent commit. Oracle, by definition, forces LGWR process to flush transaction data into redo files during each commit. In environment protected by synchronous Data Guard, data must be additionally pushed to the standby system via LNSn process. LGWR is able to group multiple transactions into one I/O request19, however small and frequent commits should be considered as point of additional analysis. In enterprise environment LGWR typically maintains multiple copies of redo log file, what impacts database performance, as is visible to clients as additional latency during commit operation20


Picture. Synchronous Data Guard adds latency to commits


Another problem is a latency of RAC private interconnect. Oracle RAC uses interconnect to transfer data blocks (data) and messages e.g. to request locks (signaling). The former requires bandwidth, but the latter - latency. 


Picture. Block request message, and transfer 21


Latency is additionally incremented by a need to provide consistent reads - due to this Oracle need sometimes to modify block data by using redo information. Before each block transfer RAC node need to flush LGWR. This problem is emphasized in environment pushing data from multiple threads to the same blocks. It's the basically problem of monotonically ascending sequence. In such system, network latency is a minor problem.

Picture. Block request message, block preparation, and transfer 21 

Above problems may be easily overlooked in enterprise level software, but they are critical for performance of RAC based system. There is open question how to eliminate transfer of Redo blocks during reads, and effect of blocks undo to achieve consistent read. Can it be done by limiting select by a form of insert's timestamp - 2s? Does strict data affinity eliminate this by definition?

Other sources of knowledge
Look into references, where I've collected sources of information used to write this article. In below set of quite interesting articles and notes, there is a last position - link to Oracle Real-World Performance Group's videos about database performance. It's on the last position, but the first to watch22. There are two historical, must read, documents about B-Tree12, and hashing14. Enjoy!

###

References
11. Real life experiences of RAC scalability with a 6 node cluster, Grancher, CERN, 2008, http://www.oracleracsig.org/pls/apex/rac_sig.download_my_file?p_file=1001900 
12. Organization and Maintenance of Large Ordered Indexes, Bayer, McCreight, Springer-Verlang, 1972, http://www.minet.uni-jena.de/dbis/lehre/ws2005/dbs1/Bayer_hist.pdf  
13. Practical Performance Management for Oracle RAC, Oracle, 2008, http://www.oracle.com/technetwork/products/clustering/overview/s298716-oow2008-perf-130776.pdf 
14. A Hash Function for Hash Table Lookup, Jenkins, Oracle, 1997, http://burtleburtle.net/bob/hash/doobs.html 
15. Oracle B-Tree Index: From the concept to Internals, Gómez, ToadWorld, 2014, http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals.aspx 
16. Creating Hash-Partitioned Tables and Global Indexes, Oracle, http://docs.oracle.com/cd/E18283_01/server.112/e16541/part_admin001.htm#i1006508 
17. Fast algorithm to find a number modulo a power of two, Wikipedia, https://en.wikipedia.org/wiki/Power_of_two#Fast_algorithm_to_find_a_number_modulo_a_power_of_two 
20. Overview of the Online Redo Log, Oracle 11g, http://docs.oracle.com/cd/E25178_01/server.1111/e25789/physical.htm#i1006187 
21. Practical Performance Management for Oracle RAC, Oracle, 2008, http://www.oracle.com/technetwork/products/clustering/overview/s298716-oow2008-perf-130776.pdf  
22. Real-World Performance Tuning Techniques, Oracle, 2014, http://www.oracle.com/goto/oll/rwp 

Change log
1.  24.07 - fixed paragraph "Rading data" to proper interpretation of Undo block location

1 comment: