├── .gitignore ├── SecurityRolesPermissions.png ├── Design-Wiki-Rules.md ├── Hot-cache-via-Debezium.asciidoc ├── Remote-Admin-Client-Library.md ├── InfinispanAPIObject.adoc ├── Spring-5-features,-ideas-and-integration.md ├── Cluster-Registry.md ├── Alias-Caches.md ├── XSite-Failover-for-Hot-Rod-clients.md ├── DataStore.adoc ├── Dynamic-JMX-exposer-for-Configuration.md ├── Fine-grained-security-for-caches.md ├── Graceful-shutdown-&-restore.md ├── Conflict-resolution.md ├── TopologyId.asciidoc ├── Clustered-cache-configuration-state.asciidoc ├── Kubernetes-CLI.adoc ├── Task-Execution-Design.md ├── Smoke-Testsuite.md ├── Cache-Store-Subsystems.md ├── Create-Cache-over-HotRod.md ├── TransportSecurity.asciidoc ├── Lock-Reordering-For-Avoiding-Deadlocks.asciidoc ├── Authorization.adoc ├── CacheMultimap.adoc ├── Custom-Cache-stores-(deployable).md ├── README.md ├── Server Restructuring.md ├── Distributed-Stream-Sorting.md ├── Infinispan-CLI.asciidoc ├── Near-Caching.md ├── Off-Heap-Data-Container.md ├── Total-Order-non-Transactional-Cache.md ├── Multimap-As-A-First-Class-Data-Structure.asciidoc ├── Off-Heap-Implementation.md ├── Deelog:-direct-integration-with-Debezium.asciidoc ├── Incremental-Optimistic-Locking.asciidoc ├── Remote-Iterator.md ├── scaling-without-state-transfer.asciidoc ├── cluster-backup-tool.md ├── Asymmetric-Caches-and-Manual-Rehashing-Design.asciidoc ├── Schemas-API.adoc ├── Optimistic-Locking-In-Infinispan.asciidoc ├── ServerNG.md ├── Single_port.adoc ├── Scattered-Cache-design-doc.md ├── Remote Command Handler.md ├── Clustered-listeners.md ├── A-continuum-of-data-structure-and-query-complexity.asciidoc ├── Conflict-resolution-perf-improvements.md ├── RAC-implementation.md ├── Handling-cluster-partitions.md ├── Multi-tenancy-for-Hotrod-Server.asciidoc └── Infinispan-query-language-syntax-and-considerations.asciidoc /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | *.iml 3 | .eclipse 4 | .project 5 | 6 | -------------------------------------------------------------------------------- /SecurityRolesPermissions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/infinispan/infinispan-designs/HEAD/SecurityRolesPermissions.png -------------------------------------------------------------------------------- /Design-Wiki-Rules.md: -------------------------------------------------------------------------------- 1 | The design wiki is meant to hold design documentation. 2 | 3 | # Writing effective designs 4 | Designs should be written in Markdown (supports syntax highlight) or AsciiDoc (use external syntax highlighter such as [gist] (https://gist.github.com) until github [fixes](https://github.com/gollum/gollum/issues/280) the highlighting). 5 | 6 | # What to include 7 | Try and make sure your design docs have references to JIRAs, as well as versions affected and target versions. 8 | 9 | # Images and diagrams 10 | Pictures speak a thousand words. Use [Google Drive](http://drive.google.com) to draw and host your images, and link to them from here. -------------------------------------------------------------------------------- /Hot-cache-via-Debezium.asciidoc: -------------------------------------------------------------------------------- 1 | Use Debezium to keep the Infinispan cache data up to date compared to the database structure. 2 | 3 | Design questions and choices to be debated: 4 | 5 | * no Kafka infrastructure, that means Debezium should be in embedded mode in JDG 6 | * decide of the Debezium topology: one instance on the coordinator (or a fixed position in the hash wheel) 7 | * deciding of a clean stop for Debezium and a fresh start on a new node when a topology change occurs. 8 | * decide how the Debezium internal states are stored (dedicated cache?, started automatically and transparently?) 9 | * decide what Infinispan makes from these Debezium events and how that is configured 10 | ** all via configuration with a table to cache transfert 11 | ** have it programmatically pluggable to add the conversion logic? 12 | * how should the end value look like: Protobuf autogenerated from the Table structure? From the DDL stream Debezium offers? What about plain Java? 13 | -------------------------------------------------------------------------------- /Remote-Admin-Client-Library.md: -------------------------------------------------------------------------------- 1 | We should provide a client library which is capable of performing remote administration operations using the DMR. 2 | This would be used in a number of places: as a general remote admin client, as a helper for remote JCache, to allow Teiid to perform alias switching in remote configurations. 3 | 4 | ## Functionality 5 | The client should provide a subset of management operations which are available within a server: 6 | 7 | * creation/removal of caches 8 | * creation/removal of cache configurations 9 | * cache lifecycle management (start/stop) 10 | * switching cache aliases 11 | * etc 12 | 13 | ## API 14 | If possible we should reuse the embedded configuration API. 15 | 16 | ## Security 17 | The client will use the security policies enabled in the management security domain of the server. 18 | 19 | ## Packaging 20 | The remote admin client should be packaged as a separate jar from the infinispan-remote jar: infinispan-remote-admin-$VERSION.jar. This uberjar would contain the necessary jboss client, xnio and other dependencies required to communicate with the admin endpoint. -------------------------------------------------------------------------------- /InfinispanAPIObject.adoc: -------------------------------------------------------------------------------- 1 | == Infinispan API 2 | 3 | === Motivation 4 | 5 | Being able to offer a simple API to create clustered caches and improve API usability 6 | 7 | 8 | == Infinispan 9 | 10 | This factory will help to create a cluster of Infinispan. 11 | 12 | Whenever a new member is added on the VM, the implicit configuration will handle the visibility between members. 13 | 14 | ```java 15 | 16 | public final class Infinispan { 17 | 18 | private Infinispan() {} 19 | 20 | //Constructs and starts a new instance of the CacheManager, using the system defaults. 21 | public static EmbeddedCacheManager createClusteredCache() {} 22 | 23 | public static EmbeddedCacheManager createClusteredCache(Configuration configuration) {} 24 | 25 | public static EmbeddedCacheManager createClusteredCache(GlobalConfiguration globalConfig, Configuration configuration) {} 26 | 27 | // Return the running CacheManagers on the cluster in the calling VM 28 | public static Set getAllMembers() {} 29 | 30 | // Stop all the members on the VM 31 | public static void stopAll() {} 32 | } 33 | 34 | ``` 35 | 36 | 37 | -------------------------------------------------------------------------------- /Spring-5-features,-ideas-and-integration.md: -------------------------------------------------------------------------------- 1 | Spring 5 will have a couple of [interesting features](https://jira.spring.io/browse/SPR-13574?jql=project%20%3D%20SPR%20AND%20issuetype%20%3D%20%22New%20Feature%22%20AND%20fixVersion%20in%20(%225.0%20M2%22%2C%20%225.0%20M3%22%2C%20%225.x%20Backlog%22)%20ORDER%20BY%20issuetype%20DESC): 2 | 3 | * [CDI support (JSR 330)](https://jira.spring.io/browse/SPR-12211). We may enhance our CDI Extension to work correctly with Spring and propose this configuration for our users. Even though our extension has no hard dependency to Weld (apart from tests) I believe we will need to adjust it a little bit. 4 | * [JCache 2.0 support](https://jira.spring.io/browse/SPR-13574). Implementing this issue will probably require adding few bits into Spring module and verifying if everything works smoothly. 5 | 6 | A few ideas not necessarily related to Spring 5: 7 | * Implement a "bridge" between [Spring tasks](http://docs.spring.io/spring/docs/current/spring-framework-reference/htmlsingle/#scheduling) our [Infinispan distributed tasks](http://infinispan.org/docs/dev/user_guide/user_guide.html#DistributedExecutor). 8 | 9 | Scheduled bigger features: 10 | * [Spring Data-like use case](https://issues.jboss.org/browse/ISPN-1068) 11 | * [Spring starter](https://issues.jboss.org/browse/ISPN-6561) 12 | * [Define how to specify configuration](https://issues.jboss.org/browse/ISPN-3337) 13 | -------------------------------------------------------------------------------- /Cluster-Registry.md: -------------------------------------------------------------------------------- 1 | Currently the ClusterRegistry provides some syntactic sugar around a replicated cache which can be used as an internal repository for service-data. 2 | To differentiate between different scopes and uses, the ClusterRegistry uses a composite key composed of a scope and a key. 3 | While useful, the current design is very limited: 4 | * it doesn't allow configuration of the underlying replicated cache (i.e. persistence, security, eviction, etc) 5 | * some use cases might prefer a more efficient invalidation approach instead of a replicated one 6 | * the wrapping of the key adds an extra object and it makes it cumbersome to expose this cache to users (Remote query would need to) 7 | * many internals avoid using preferring custom solutions (e.g. Map/Reduce temporary caches, Remote Query schema cache) 8 | 9 | The new ClusterRegistry should behave as follows: 10 | * create a dedicated per-service cache instead of a single catch-all cache 11 | * the characteristics of a registry cache are specified by the requiring service 12 | * persistent/volatile 13 | * replication/invalidation 14 | * eviction/expiration 15 | * security 16 | * naming strategy (i.e. static name, derived name, etc) 17 | * dependencies (so that a registry cache's lifecycle is bound to another cache) 18 | * custom settings (Remote Query schema cache needs a custom interceptor) 19 | 20 | Potential users of registry caches 21 | * Remote Query schema cache 22 | * Query index caches 23 | * Map/Reduce temporary caches 24 | * Security ACL cache 25 | * ... 26 | 27 | -------------------------------------------------------------------------------- /Alias-Caches.md: -------------------------------------------------------------------------------- 1 | In order to support materialized views, Teiid uses the following technique with relational databases: 2 | - populates a "hidden" table with the new data 3 | - removes the old "visible" table and renames the new "hidden" table to the "visible" name 4 | 5 | To support this in Infinispan, instead of renaming caches, we should support Alias Caches. 6 | 7 | ## Definition 8 | An Alias Cache is a named cache which acts as a delegate to a concrete named cache. 9 | It is configured only with the name of a cache to which it will act as delegate. On top of the standard Cache API it also provides an additional void switchCache(String name) method with which it is possible to change the delegate. Switching is allowed only if the delegate caches mode is compatible, i.e. local <> local, clustered <> clustered. Switching between local and clustered caches is not allowed. Switching a clustered cache will be performed cluster-wide so that all aliases switch at the same time. 10 | 11 | ## Configuration 12 | The declarative configuration, common to both embedded and server, is as follows: 13 | 14 | `<alias-cache name="alias-cache-name" delegate-cache="delegate-cache-name" /> 15 | 16 | The programmatic configuration, for embedded mode, is as follows: 17 | 18 | `ConfigurationBuilder builder = new ConfigurationBuilder();` 19 | `builder.aliasCache("delegate-cache-name");` 20 | `cacheManager.defineConfiguration("alias-cache-name", builder.build());` 21 | 22 | ## Management 23 | In embedded mode the switch operation is exposed via JMX. In server mode, the switch operation is exposed via a management operation. 24 | 25 | -------------------------------------------------------------------------------- /XSite-Failover-for-Hot-Rod-clients.md: -------------------------------------------------------------------------------- 1 | This wiki describes the design of the cross-site failover for Hot Rod clients ([ISPN-5520](https://issues.jboss.org/browse/ISPN-5520)). 2 | 3 | Hot Rod clients currently support failover to nodes within the cluster to which the client is connected. To support that, Hot Rod servers send topology information to clients as part of the responses as topology changes happen. When an operation to a clustered node fails due to transport or cluster rebalancing issues, the client automatically retries the operation in a different cluster node. This node is elected using a configured load balance policy. 4 | 5 | To add basic cross-site failover support, the following changes are required: 6 | 7 | * Client configuration needs to be enhanced to have 0-N cross-site static configuration, where each cross-site configuration would have 1-N host information of nodes in that site. 8 | * In its simplest form, a Hot Rod client should failover to the nodes defined in cross-site failover section, if all nodes in the main cluster have failed (after a complete retry). The first site where nodes are available becomes the the client's main site, working as usual with the topology of this site. 9 | 10 | Cross-site failover could be improved further if clients would be able to failover when the site is offline while still being accessible remotely. To achieve this, there'd need to be a way for the server to tell the clients that the site is offline. The simplest way to do so would be to add a new response status that indicates that the site is offline, so next time the client sends an operation and gets offline status, it fails over to a different site. -------------------------------------------------------------------------------- /DataStore.adoc: -------------------------------------------------------------------------------- 1 | == Data Store 2 | 3 | In scenarios where Infinispan is used as a persistent data store, the elasticity provided by rebalancing on scaling down (either voluntarily or because of node failure) can lead to data loss, even with persistent caches if all the owners of a segment leave the cluster before rebalancing can be completed. The remaining cluster should prevent writes to the lost segments until the nodes that own them are restarted. 4 | 5 | It should be possible to configure Infinispan so that elasticity only applies when scaling up, i.e. adding a node will cause a rebalance. 6 | 7 | === Partition Handling 8 | 9 | Partition handling should be configured with `when-split=deny_read_writes` so that the nodes prevent reading/writing to the lost segments. 10 | 11 | === Topology 12 | 13 | Rebalancing should have the following configurations: 14 | 15 | * ALWAYS: rebalancing happens both on nodes joining and leaving the cluster. 16 | * SCALE_UP: rebalancing only happens when nodes join. 17 | * NEVER: rebalancing is disabled in all situations. 18 | 19 | Every time a stable topology is installed in the cluster, it should be persisted to the global state. Currently this only happens on shutdown. 20 | 21 | === Restarting nodes 22 | 23 | When a node is restarted it should read the topology stored in the global state and join only if it matches the stable topology of the existing cluster. 24 | 25 | === Open issues 26 | 27 | When the segment owners are missing, an application will not be able to read/write any data belonging to those segments even if it is "new" data (i.e. keys that were not in those segments when they disappeared). 28 | We could create temporary segments on the remaining nodes which would then be merged back into the real owners using the conflict resolution logic. 29 | -------------------------------------------------------------------------------- /Dynamic-JMX-exposer-for-Configuration.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | Currently configuration values are exposed in JMX as a summary list (see CacheImpl#getConfigurationAsProperties). This approach covers only read-only access and some of the Configuration's attributes could have been modified in runtime. This requires a new approach - scanning and exposing Configuration attributes dynamically. 4 | 5 | More information might be found in tickets: 6 | * https://issues.jboss.org/browse/ISPN-5343 7 | * https://issues.jboss.org/browse/ISPN-5340 8 | 9 | # User Perspective 10 | 11 | Configuration attributes will be exposed as a new Node in JMX MBean tree on the same level as Activation, Cache, LockManager etc. 12 | 13 | # Implementation details 14 | 15 | A new Dynamic MBean [1] will be created as a standard Infinispan component and will have a reference to Cache Configuration. During the startup it will scan the Configuration recursively and construct a `Map>`. The map's key will contain path to the attribute, for example: `clusteringConfiguration.asyncConfiguration.replicationQueueInterval` and the map's value will have all necessary information for making reflexive calls (a class and a field and possibly some additional metadata). Those attributes will used to provide MBean metadata and perform reflexive calls in the runtime. 16 | 17 | The newly created Dynamic MBean will be discovered by `ComponentsJmxRegistration` and will be registered in MBean Server. Note that currently this class uses `ResourceDMBean`s for this, so additional scanning for Dynamic MBeans would have to be performed there. 18 | 19 | # FAQ 20 | 21 | Q: Shouldn't we use `ResourceDMBean` directly? 22 | 23 | A: No. `ResourceDMBean` implementation is focused on invoking Managed Operation on a single class level. We need something more - we need to scan a bunch of configuration classes recursively. 24 | 25 | Q: Shouldn't we prepare metadata when parsing classes (using `ComponentMetadataPersister`)? 26 | 27 | A: We could but the parsing logic would have to fit into `ResourceDMBean` then and we would need to do a lot of code refactoring to support it. I'm not sure if it's worth it. 28 | 29 | [1] http://docs.oracle.com/cd/E19698-01/816-7609/6mdjrf83d/index.html -------------------------------------------------------------------------------- /Fine-grained-security-for-caches.md: -------------------------------------------------------------------------------- 1 | We would like to enable fine-grained security to data within caches. 2 | 3 | Two approaches: 4 | 5 | ## Authorization Callback 6 | Via some kind of user-provided callback which would intercept all calls. The callback would receive the key, value and the required permission (READ, WRITE, etc) and allow/deny the operation. 7 | 8 | ### Pros 9 | * no additional memory is required to store the ACL 10 | * can implement per-user checks. 11 | 12 | ### Cons 13 | * requires custom code. 14 | 15 | Proposed interface: 16 | 17 | ``` 18 | public interface AuthorizationCallback { 19 | void authorize(Subject subject, Cache cache, K key, V value, AuthorizationPermission permission) throws SecurityException; 20 | } 21 | ``` 22 | 23 | Global cache operations (i.e. execute, listen, etc) would still go through the existing security checks. 24 | 25 | ## Authorization metadata 26 | Store ACLs within each entry's metadata. This would require adding an additional field to the metadata to store the ACL information. The ACL would consist of: 27 | * the owner of the entry. This needs to be as compact as possible while preserving uniqueness, and we need some kind of global mapper which can extract this information from the source of user information (e.g. a UID, GUID, etc) 28 | * a set of authorization roles (as already defined for the container). Each role would imply specific permissions, so that a user would be allowed access/manipulation of an entry only if it is the owner, an administrator, or it belongs to one of the roles. 29 | 30 | Pros 31 | * does not require custom code (i.e. avoids having to deploy code to server) 32 | * for the basic use-case (only the owner and the admin can manipulate an entry) nothing else needs to be done aside from enabling. A special ownership manipulation API would need to be defined for other cases. 33 | 34 | Cons 35 | * requires additional memory per entry. This should be mitigated using a BitSet. 36 | * the granularity is per-role and not per-user 37 | * ownership manipulation over remote needs protocol changes 38 | 39 | Proposed API: 40 | 41 | ``` 42 | interface AuthorizationManager { 43 | ... 44 | Set getEntryRoles(); 45 | void setEntryRoles(K key, String... role); // this operation can be performed either by the owner or an ADMIN 46 | } 47 | ``` 48 | -------------------------------------------------------------------------------- /Graceful-shutdown-&-restore.md: -------------------------------------------------------------------------------- 1 | ## Purpose 2 | Implementing graceful shutdown and restore is essential for a datagrid which needs to be "cycled" for maintenance: 3 | * shutdown: stop all nodes in a cluster in a safe way by persisting state and data 4 | * restore: restart the whole cluster, apply previous state and preload the data from the storage (no data loss) 5 | 6 | ### State configuration 7 | State configuration should be done at the GlobalConfiguration level. It should include the following settings: 8 | * a state directory, where state will be persisted 9 | * the expected number of nodes in the cluster. State transfer will not initiate until all nodes have joined the cluster. The default is 0, which corresponds to the current behaviour. 10 | 11 | ### Persistent State 12 | The persistent state will be stored in the state directory. State should consist of multiple state files: 13 | * Global State File (___global.state) 14 | * localUUID 15 | * Per-cache State File (cachename.state) 16 | * hash.function = hash class name 17 | * hash.numOwners = number of owners in the hash 18 | * hash.numSegments = number of segments in the hash 19 | * hash.members = number of members in the hash 20 | * hash.member.n.uuid = the UUID of each member 21 | The state file will use the standard Java property file format for simplicity 22 | 23 | ### Startup sequence 24 | if statedir != null { 25 | check for state existence stored in statedir 26 | set UUID of local node to stored value 27 | start transport 28 | wait for expected view 29 | coordinator pushes configuration to all 30 | send consistent hash 31 | potentially stagger cache starts to avoid storm 32 | } else { 33 | start transport 34 | if !coordinator { 35 | send join request 36 | } else { 37 | wait for numInitialMembers to join 38 | send consistent hash 39 | } 40 | } 41 | 42 | ### Shutdown sequence 43 | for each running cache { 44 | put cache in shutting down state 45 | disallow new local ops/txs 46 | optionally wait for pending ops/txs 47 | send "ready to shutdown" to coord 48 | coord sends ack and does not process any more CH updates 49 | flush stores (passivation / async) 50 | } 51 | save state 52 | -------------------------------------------------------------------------------- /Conflict-resolution.md: -------------------------------------------------------------------------------- 1 | On-demand operations: 2 | ``` 3 | interface ConflictResolutionManager { 4 | Map getAllVersions(Object key); // can run from any node 5 | Stream> getConflicts(); // can run from any node, chunked? 6 | void resolveConflict(Object key); // applies EntryMergePolicy and updates all owners using putIfAbsent/replace/(versioned replace), always runs from primary 7 | void resolveConflicts(); 8 | } 9 | 10 | // Not only for partition handling = ALLOW_ALL but also fixing failed writes 11 | @Experimental // don't set it in stone now 12 | interface EntryMergePolicy { 13 | CacheEntry merge(CacheEntry preferredEntry, CacheEntry otherEntry); 14 | } 15 | // OOTB implementations: 16 | // version-based: any version is better than null, possibility to store timestamp in version to implement last update policy? 17 | // preferred-always (return null if preferred is null) 18 | // preferred-non-null (return preferred if none is null) 19 | 20 | enum PartitionHandling { 21 | DENY_ALL, // if the partition does not have all owners, both reads and writes are allowed 22 | ALLOW_READS, // degraded but allows stale reads 23 | ALLOW_ALL // allow the values to diverge and resolve the conflicts later 24 | } 25 | ``` 26 | 27 | Selecting preferred partition for given segment: 28 | 29 | 1. (discutable) The partition that had all owners before split? 30 | 1. Overall size of partition 31 | 1. Order in members list (just make the choice deterministic) 32 | 33 | Selecting preferred entry when there's no split brain (failed operation kept the cluster inconsistent): preferred entry is the one from primary owner. 34 | 35 | Automatic merging after split-brain heal: 36 | 37 | * before the merged entry is written to all partitions, partition A should not see modifications from B and vice versa (merge happens before installing a merged topology) 38 | * merge uses EntryMergePolicy from configuration 39 | 40 | Configuration (): 41 | 42 | * deprecate `enabled="true|false"` 43 | * add `type="deny-all|allow-reads|allow-all"` 44 | * add `entry-merge-policy="version-based|preferred-always|preferred-non-null|custom class name"` 45 | 46 | Other thoughts: 47 | 48 | * for all-keys conflict resolution the primary needs to request whole segment, not just presently owned keys because it may miss some - we could reuse InboundTransferTask? 49 | -------------------------------------------------------------------------------- /TopologyId.asciidoc: -------------------------------------------------------------------------------- 1 | == Topology Id Rework 2 | 3 | The Infinispan Hot Rod client is able to know which node to direct requests to based on the key. 4 | This has to be updated every time the topology changes (new node leaves or joins the cluster). 5 | When the topology changes a counter is incremented that the server is aware of to signal this. 6 | This client then is able to tell the server which topology id it last knew about and in the case it is 7 | old the server will reply with the new id and the members. 8 | 9 | Unfortunately, this has an interesting problem if the entire server cluster is shut down but a client 10 | is still around. 11 | The client will of course not function when the cluster is gone, however when the server starts back up 12 | the server topology id will be reset back to 0. 13 | This means a client will have a _higher_ topology so the server will never send the cluster information 14 | and the client will, until the server toplogy catches up, no longer know what servers a key map to and 15 | also send to servers that may not even exist any longer. 16 | 17 | === Suggested Fix 18 | 19 | Fixing this will require a few different things. 20 | 21 | * Change the toopology id to be a more unique number that can be recreated instead of an incrementing number 22 | ** HashCode of pending (if not null) or read consistent hash 23 | ** HashCode of the CacheTopology itself 24 | * Change so the server will send a new topology if the id does not match instead of being less than 25 | * Ensure the client applies the topology update even if the id does not match instead of being greater than 26 | 27 | Note that the server itself can still have an incrementing topology id, just the one it shares to the client 28 | will be a generated one. 29 | 30 | === Backwards Compatibility 31 | 32 | The changes suggested above are normally done as part of a new protocol, such as 3.2 or 4.0. 33 | But it may be desirable instead to apply retroactively to all versions of the protocol. 34 | 35 | However, some clients (13.0.0 & 9.4.24) had a change that made it so they only applied a topology update if the returned topology 36 | id was greater than their own. 37 | This means if a user upgrades the server the fix above will still not work as the client may or may not apply 38 | the update. 39 | 40 | This means we can either provide backwards compatibility to a subset of previous clients or to none, requiring a 41 | new protocol version to handle the new topology ids. 42 | It is unclear what approach should be done. 43 | Note, however, that for many users it is much simpler to update the server than it is for all clients. 44 | -------------------------------------------------------------------------------- /Clustered-cache-configuration-state.asciidoc: -------------------------------------------------------------------------------- 1 | Clustered cache configuration / state 2 | ===================================== 3 | 4 | Currently all members of a cluster manage their configuration, cache lifecycle and most of the state independently. 5 | This means that in order for a cache to be created across a cluster, each node must define its configuration locally and 6 | start it. There is also no validation that a cache configuration is compatible across nodes. Also, while some 7 | configuration attributes can be changed at runtime, this does not happen cluster-wide (although it might make sense for 8 | some configuration attributes to be asymmetric, e.g. capacity factor). Additionally, when new nodes join a running 9 | cluster, they should also start any clustered caches so that symmetry is maintained. 10 | 11 | Implementation 12 | -------------- 13 | 14 | * Change the current behaviour of the GlobalStateManager, enabling it internally by default. Persistence of state will however continue to be enabled only if requested by the user. 15 | * The GlobalStateManager adds a ___state cache handled by the InternalCacheRegistry. This cache will contain both cache configurations (stored as strings using the Infinispan XML schema) and a list of running caches (do we need to store more state ?). 16 | * Cache configuration and lifecycle performed via the usual DefaultCacheManager API will continue to behave as it does 17 | currently (i.e. each node is independent). 18 | * Add a new API for managing clustered configuration and state (see below) 19 | * Allow configuration Attributes to be either Global or Local. 20 | * Implement a variant of equals (equalsIgnoreLocal ?) for Configuration objects to validate congruency ignoring local 21 | attributes. 22 | * Support the +start+ configuration attribute for caches (which can be either +LAZY+ or +EAGER+) so that EAGER caches 23 | are automatically started when the CacheManager is started. 24 | * When a node joins a cluster it retrieves the list of running caches from the state cache and starts them using 25 | the configuration from the state cache. The configuration coming from the state cache is validated against any local 26 | configuration which might be present in the DCM's ConfigurationManager. 27 | * Modifying a Global configuration Attribute at runtime will propagate the change to all nodes. 28 | 29 | API 30 | --- 31 | 32 | Add a cluster() method to the DefaultCacheManager which returns a cluster-affecting implementation of the EmbeddedCacheManager interface. Methods such as +defineConfiguration()+, +undefineConfiguration()+, +startCaches()+, will affect the entire cluster. All other methods will be delegated to the underlying local DefaultCacheManager. 33 | 34 | -------------------------------------------------------------------------------- /Kubernetes-CLI.adoc: -------------------------------------------------------------------------------- 1 | = Kubernetes CLI 2 | 3 | The Infinispan CLI currently has two modes: command-line and interactive. 4 | To improve the user experience when using the Infinispan Operator we can enhance the CLI to add Kubernetes capabilities. 5 | 6 | == Native 7 | 8 | Using Quarkus we can build the CLI as a native executable for the three most common platforms: Linux x86_64, OS X, Windows. 9 | 10 | == Kubernetes mode 11 | 12 | Installing the executable as `kubectl-infinispan` allows invocation as a `kubectl/oc` plugin, e.g.: 13 | 14 | `kubectl infinispan ...` 15 | 16 | Using the Fabric8 Kubernetes Client we can automatically detect the existence of a Kubernetes configuration (default to `~/.kube/config`) and automatically connect to the Kubernetes cluster. 17 | 18 | === Kubernetes commands 19 | 20 | The Infinispan Kubernetes CLI plugin uses the default namespace for all operations, unless overridden with the `--namespace` parameter. 21 | 22 | ==== Install the operator 23 | 24 | `kubectl infinispan install [-n namespace]` 25 | 26 | Install the Infinispan Operator in the specified namespace. 27 | This should also install the relevant OLM subscription and operator group so that the operator can be updated. 28 | 29 | ==== Create an Infinispan cluster 30 | 31 | `kubectl infinispan create cluster [-n namespace] [-r replicas] [--expose-type=type] [--expose-port=port] [--expose-host=hostname] [name]` 32 | 33 | Creates an Infinispan cluster: 34 | * `-n namespace` overrides the namespace. 35 | * `-r replicas` selects the initial number of replicas. Defaults to 1. 36 | * `--expose-type=LoadBalancer|NodePort|Route` selects how the service should be exposed. 37 | * `--expose-port=port` specifies the exposed port for LoadBalancer and NodePort types. 38 | * `--expose-host=hostname` specifies the exposed hostname for the Route type. 39 | * `--security-secret=secretname` specifies the secret to use for endpoint authentication. 40 | * `--security-cert-secret=secretname` the secret that contains a service certificate, tls.crt, and key, tls.key, in PEM format. 41 | * `--security-cert-service=servicename` the Red Hat OpenShift certificate service name. 42 | 43 | ==== Scale an Infinispan service 44 | 45 | `kubectl infinispan scale -r replicas [servicename]` 46 | 47 | Scale an Infinispan cluster to the specified number of replicas 48 | 49 | ===== Interact with a running Infinispan service 50 | 51 | `kubectl infinispan shell [name]` 52 | 53 | Connects to one of the active pods of the specified cluster and launches the CLI in interactive mode. 54 | 55 | ==== Remove an Infinispan service 56 | 57 | `kubectl infinispan delete cluster [-n namespace] [servicename]` 58 | 59 | Stops and removes an Infinispan service. 60 | 61 | ==== Uninstalling the operator 62 | 63 | `kubectl infinispan uninstall [-n namespace]` 64 | 65 | Removes the Infinispan operator 66 | 67 | -------------------------------------------------------------------------------- /Task-Execution-Design.md: -------------------------------------------------------------------------------- 1 | Some of the most powerful features in Infinispan involve execution of code "close to the data": distributed executors and map/reduce tasks allow running user tasks using the inherent data parallelism and the computing power of all the nodes in a cluster. 2 | While it is interesting to allow running such tasks from HotRod, the concept can be extended even further to the execution of any kind of code, be it local to a single node or distributed among some/all the nodes in the cluster. 3 | In the spirit of HotRod language-agnostic nature, we should not limit execution just to Java, but leverage the JDK's scripting engine API so that tasks can be developed in a multitude of languages. 4 | 5 | # Stored scripts 6 | * Store scripts in a dedicated script cache (persistent, secure, etc) 7 | * If the scripting engine supports it, precompile to bytecode to take advantage of HotSpot 8 | * Each script can take multiple named parameters which will appear as bindings (i.e. variables) in the script 9 | * Script bindings include the cacheContainer, cache, scriptingManager, marshaller (if in HotRod) 10 | * A script can optionally return a result 11 | * Scripts can be of various types supported by JSR-223 (Java, Javascript, Scala, etc) 12 | 13 | # Remote execution over HotRod 14 | * Add EXEC op 15 | * Input parameters: (name: string, value: byte[])* < marshalled values 16 | * Returns: (byte[])? 17 | * Optional flags to specify execution type (local, distributed, map/reduce) 18 | 19 | # Remote execution over REST 20 | * Invoke a POST on the URL http://server/rest/!/scriptname/cachename?param1=value1¶m2=value2 21 | * Invoke a POST on the URL http://server/rest/!/scriptname/cachename and pass any parameters in the request body. 22 | * Manipulate sync/async execution using the 'performAsync' header used by other ops 23 | * The returned value will use the request variant to perform conversion appropriately to the requested type (text/plain, application/octet-stream, application/xml, application/json, application/x-java-serialized-object) 24 | 25 | # Task manager 26 | * The task manager will be the main entry point for executing tasks and for retrieving execution history 27 | * Task execution is delegated to specific engines 28 | ** Scripts will be handled by the existing ScriptManager 29 | ** Server-deployed tasks will be handled by a DeployedTaskManager 30 | * The executor within which tasks are executed is global to all engines. 31 | * Track start/finish/what/who/where for tasks that have been executed. Store these in an internal, persistent cache which can be queried/filtered appropriately. Possibly have a limited retention. 32 | * Support aborting running jobs (best-effort, since it requires handling of InterruptedException in appropriate places. Badly-behaved tasks will not be stoppable until JVM termination). 33 | * Throttling ? (via some executor / thread priority modification) 34 | 35 | -------------------------------------------------------------------------------- /Smoke-Testsuite.md: -------------------------------------------------------------------------------- 1 | As the Infinispan project grows, there's a need to expand the tests to run them under different set ups, e.g. with uber jars, with client modules running in Wildfly, with normal jars...etc. 2 | 3 | On top of different set ups, we are also finding that a subset of the tests we have also need to be run with different configurations, such as compatibility mode, with/without distribution, with/without replication, with/without transactions, with/without xsite...etc. 4 | 5 | The current test set up means that each module has its own set of tests, so duplication can easily happen, and on top of that, to be able to add set ups to test, tests need to be duplicated or extended in a way not originally intended. 6 | 7 | On top of the duplication of tests, there's a lot of waste of resources generated by the number of times caches/cachemanagers/clusters/sites are started and stopped in the entire testsuite, which leads to a general slow down of the testsuite. 8 | 9 | During a recently held meeting, it was agreed that the following improvements would need to be made: 10 | 11 | * Move the most relevant of all functional tests into a single `testsuite` project. We could try to use some tooling to determine the most relevant tests (student project?) 12 | * Define suites for which caches/cachemanagers/clusters/sites are started and then run a load of tests for that particular set up. The aim is to reduce time needed to run tests. 13 | * Running all tests in all configurations/set-ups would likely take too long, so we should consider randomising the number of configurations/set-ups run. Any randomising applied should allow to backtrack to figure out the exact configuration/set-up run to debug failures. Other OS projects such as Lucene are already using randomising of tests, so we should check what they do. 14 | * Randomising could be used as way to introduce failures and see how the tests behave in those conditions. Again, being able backtrack to the cause of failure would be necessary. 15 | * Incremental testing possibilities should be considered if available. In essence, incremental test would mean that when a change happens, only tests affected by those changes would be run. Not sure how it'd run in the presence of reflection? 16 | * Randomising test data would also be helpful to discover failures, e.g. primitive data, non-serializable data, serializable data, externalizer-marshallable data, nulls...etc. 17 | * It could be desirable to make all this work independent of the test framework used by creating our own DSL and defining tests in plain test. Sanne has done some similar work with ANTLR for one of Hibernate OGM parsers. 18 | 19 | Some of the projects that could help with this are: 20 | * [JUnit Lambda](http://junit.org/junit-lambda.html) 21 | * [Ekstazi](http://www.ekstazi.org) - Lightweight Test Selection 22 | * [Pitest](http://pitest.org) 23 | * [Infinitest](http://infinitest.github.io) for continuous testing -------------------------------------------------------------------------------- /Cache-Store-Subsystems.md: -------------------------------------------------------------------------------- 1 | While the cache store configuration machinery in embedded is easily extensible, adding new cache stores to Infinispan Server is much more complex because of the "monolithic schema" approach that WildFly imposes on subsystems. This means that, in order to add a schema-enabled cache store to server, we have to modify the Infinispan subsystem itself. This applies to both "Infinispan-owned" cache stores which live outside the main Infinispan repo and to custom cache stores. 2 | 3 | The proposed solution is to create a dedicated subsystem for "external cachestores", akin to how datasources are currently configured. Cache stores would be defined in this subsystem and referenced from the main Infinispan subsystem using a naming convention (think JNDI). 4 | 5 | We would need to define an archetype subsystem which can be extended for the specific cache store configuration logic with little effort. 6 | 7 | Here is an example of the configuration for such a subsystem: 8 | 9 | 10 | 15 | 16 | 17 | 20 | 21 | 22 | 23 | The subsystem would be responsible for parsing the configuration and 24 | registering an appropriate StoreConfigurationBuilder under a named 25 | Service within the server. We need to be careful about class loader 26 | visibility, but the datasource subsystem does something similar, so it 27 | should be possible. 28 | 29 | The main Infinispan subsystem would need to be extended to be able to 30 | parse the following (simplified): 31 | 32 | 33 | 34 | 35 | 38 | 39 | 40 | 41 | 42 | It would also not be difficult, for symmetry, to support a compatible 43 | schema for the embedded use-case. 44 | 45 | The cachestores would be distributed both as a simple jar for embedded 46 | use-cases and as a zip containing the necessary modules for server. 47 | Note that this would not leverage the deployable cachestores machinery 48 | as subsystems need to be installed as modules. 49 | -------------------------------------------------------------------------------- /Create-Cache-over-HotRod.md: -------------------------------------------------------------------------------- 1 | NB. This page refers to Hot Rod, but the concepts can be applied to REST as well. 2 | 3 | Creating a cache from a remote client is an often requested feature: 4 | 5 | * JCache's API CacheManager.createCache(String cacheName, C configuration) 6 | * Hibernate OGM can optionally create caches (as part of schema generation) 7 | * Teiid can create tables 8 | 9 | Since cache configuration is very complex, remote cache creation will be limited to specifying an existing configuration template name. Additionally, caches created remotely might be temporary (i.e. they are not persisted back into the configuration, if applicable). The proposed API for the cache creation call (using the Java client as reference) would be: 10 | 11 | RemoteCache RemoteCacheManager.createCache(String cacheName, String configurationName, boolean persistInConfiguration) 12 | 13 | with the additional signature 14 | 15 | RemoteCache RemoteCacheManager.createCache(String cacheName, String configurationName) { 16 | return createCache(cacheName, configurationName, false); 17 | } 18 | 19 | For symmetry, a 20 | 21 | RemoteCacheManager.destroyCache(String cacheName) 22 | 23 | method would remove a cache. 24 | 25 | The protocol would be enhanced with the following opcodes: 26 | 27 | 0x37 = create cache request [header + config name length (vInt) + config name (String) + persist (1 byte)] 28 | 0x38 = create cache response 29 | 0x39 = destroy cache request [header] 30 | 0x40 = destroy cache response 31 | 32 | The usual status / error codes will be used in the response. 33 | 34 | Server-side, cache creation should adhere to the following rules: 35 | 36 | * the operation is performed on all nodes 37 | * the operation is complete only when it is completed by all nodes 38 | * an empty configurationName will use the configuration of the default cache 39 | * if the temporary flag is not set, then the cache configuration will be persisted so that it will be recreated on startup 40 | * cache creation would be disallowed on a degraded cluster 41 | * the operation would require the ADMIN permission if authorization is enabled 42 | * if authorization is disabled, the operation would be allowed only when invoked over a loopback interface 43 | 44 | New nodes joining an existing cluster with caches created by the above operation might not have the required cache in their configuration. Therefore the coordinator should send back a join response with the list of running caches which need to be created, including their configuration name as well as whether they are temporary. 45 | 46 | Persisting the configuration will only be possible if the server is in a mode which allows it: 47 | * embedded server: N/A 48 | * standalone mode server: the server applies the configuration to its model so that it can be persisted by the subsystem writer. If part of a cluster, the operation will need to be rolled back in case of failures on other nodes 49 | * domain mode server: the server delegates the configuration change to its domain controller so that it is then applied to all servers in the server group. (TBD) 50 | 51 | -------------------------------------------------------------------------------- /TransportSecurity.asciidoc: -------------------------------------------------------------------------------- 1 | == Server Transport Security 2 | 3 | JGroups offers various protocols to secure the transport, however they require complex setup which could be simplified by the server security configuration. 4 | 5 | == Encryption 6 | 7 | === SSL/TLS 8 | 9 | JGroups does not directly offer the use of TLS/SSL for the TCP transport, although it does allow setting a custom `SocketFactory`. 10 | 11 | By using the server identity of a security realm, we can supply JGroups with an SSL-enabled SocketFactory. The server identity must have both a keystore as well as a truststore so that certificates are mutually trusted. 12 | 13 | By using namespaces, we can extend the transport configuration as follows: 14 | 15 | [source,xml] 16 | ---- 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | ---- 25 | 26 | To simplify the out-of-the-box configuration, we can include a `cluster` security realm with a commented-out server identity: 27 | 28 | [source,xml] 29 | ---- 30 | 31 | 32 | 33 | 34 | 41 | 42 | 43 | 44 | ---- 45 | 46 | A big advantage of using SSL/TLS is that, if OpenSSL is available, we get native performance. 47 | 48 | === SYM_ENCRYPT (optional) 49 | 50 | The `SYM_ENCRYPT` protocol requires all nodes to use the same secret stored within a keystore. We can use the server's credential stores to supply the secret: 51 | 52 | [source,xml] 53 | ---- 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | ---- 62 | 63 | === ASYM_ENCRYPT (no) 64 | 65 | The complexity of configuring `ASYM_ENCRYPT` doesn't provide any additional value compared to the SSL/TLS approach described above. 66 | 67 | == Authentication (optional) 68 | 69 | When paired with a realm which can authenticate, depending on its features, we can set up JGroups authentication. 70 | 71 | === Trust realm 72 | 73 | If the security realm contains a trust realm, we can automatically enable SSL client authentication without any further configuration 74 | 75 | === Kerberos identity 76 | 77 | If the security realm has a Kerberos identity, we can automatically set up the `AUTH` JGroups protocol with the `KrbToken` 78 | 79 | === Others 80 | 81 | Supporting realms with other types of authentication providers (properties, LDAP), while possible, would require encryption to prevent credential sniffing. 82 | -------------------------------------------------------------------------------- /Lock-Reordering-For-Avoiding-Deadlocks.asciidoc: -------------------------------------------------------------------------------- 1 | === Context 2 | Considering a from-the-book deadlock scenario, with two transactions Tx1 and Tx2 writing on two keys "a" and "b" in different order: 3 | 4 | * Tx1: writes on “a” then “b” 5 | * Tx2: writes on “b” then “a” 6 | 7 | With some “right” timing => deadlock during prepare time. 8 | A common approach for avoiding the above deadlock is by forcing all transactions to write the keys in the same order. 9 | 10 | Considering the previous example, if both Tx1 and Tx2 write the keys in lexicographical order: 11 | 12 | * Tx1: writes on “a” then “b” 13 | * Tx2: writes on “a” then “b” 14 | 15 | Now both transactions access (and lock) the keys in the same order so there's no possibility for deadlock. 16 | 17 | Key reordering is not always possible though: 18 | . it is not always possible to know all the keys written within a transaction a priori 19 | . there is not always possible to define a link:http://en.wikipedia.org/wiki/Total_order[total order] relation over the set of keys. 20 | 21 | ==== Ordering keys for optimistic locking 22 | The above mentioned shortcomings can be overcome in a distributed grid based on link:Optimistic-Locking-In-Infinispan[optimistic locking]: 23 | 24 | . With optimistic locking, the lock acquisition takes place at prepare time. So before acquiring any lock, the set of keys for which locks are to be acquired is known 25 | . In order to infer a total order relation over the set of keys we can make use of the consistentHash function: 26 | ** during prepare for each transaction order the keys based on their consistent hash value 27 | *** this assures that all transactions deterministically order their keys, as consistent hash is a deterministic function 28 | ** acquire the locks in this sequence 29 | 30 | This means that locks are NOT acquired in the order in which the corresponding operation has happened. This doesn't have any impact for the user: lock acquisition takes place at prepare time and at this stage the user does not have any control over the transaction. 31 | 32 | There is a chance for two keys touched by the same transaction to have the same consistent hash value. 33 | 34 | In this case: 35 | 36 | * it is still possible to have deadlocks 37 | * the possibility of consistent hash collision to happen is rather small, assuming the consistent hash has a uniform distribution 38 | * even if this collision happens, the consistency guaranteed are not violated 39 | 40 | === Value 41 | This prepare-time ordering approach offers a simple and elegant way of reducing the number of deadlocks when the in memory-gird is configured with optimistic locking: 42 | 43 | * no effort is required from the user to define an total order relation over the keys placed in the grid 44 | * the computational effort for reordering the key is not significant 45 | * the user does not have to know all the keys used within a transaction a priori, that mens a very flexible programming model 46 | 47 | === Related 48 | * this optimization is tracked by link:https://issues.jboss.org/browse/ISPN-1132[ISPN-1132] which also contains the suggested design for implementing it. 49 | * Infinispan's link:Optimistic-Locking-In-Infinispan[optimistic locking] model -------------------------------------------------------------------------------- /Authorization.adoc: -------------------------------------------------------------------------------- 1 | = Authorization 2 | 3 | == Premise 4 | This document outlines a proposal for improving authorization in Infinispan. 5 | Currently, Infinispan authorization is disabled by default in both embedded and server. 6 | Authorization has two main scopes: global and per-cache. 7 | Since per-cache operations authorization has a significant impact on performance, it needs to be explicitly enabled. 8 | 9 | == Current limitations 10 | * Configuration of authorization is complex and there is no OOTB configuration 11 | * The internal roles for schema and script management are not friendly 12 | * The small number of permissions means that the scope of the ADMIN permission is quite broad 13 | 14 | == Proposal 15 | The following changes are proposed: 16 | * Introduce a new CREATE permission 17 | * Include a default set of roles when the user doesn't provide any 18 | * Enable authorization out-of-the box in the server 19 | * (Optional) Introduce a way to "scope" permissions to resource names 20 | 21 | === CREATE permission 22 | Add a CREATE permission to `org.infinispan.security.AuthorizationPermission`. 23 | CREATE allows creation/removal of caches, counters, schemas, scripts. 24 | CREATE supersedes and deprecates the existing `___schema_manager` and `___script_manager` internal roles. 25 | 26 | === Default roles 27 | Enabling authorization without supplying any roles would define the following default roles: 28 | 29 | * *admin* superuser, allowed to do everything (*ALL*) 30 | * *application* allowed to perform all read/write ops, but not allowed to create/remove caches, schemas, scripts (*ALL_READ*, *ALL_WRITE*, *LISTEN*, *EXEC*) 31 | * *builder* allowed to create/remove caches, schemas, scripts (*ALL_READ*, *ALL_WRITE*, *LISTEN*, *EXEC*, *CREATE*) 32 | * *observer* a read-only role. Can use the CLI/console but all write ops are forbidden (*ALL_READ*) 33 | 34 | === Server authorization 35 | 36 | * The default server configurations should enable authorization OOTB for server/container ops. 37 | * The default role mapper should be the `cluster-role-mapper` which stores principal-to-role mappings in a replicated persistent cache across the cluster 38 | 39 | === Cluster Role Mapper 40 | 41 | * if empty, it behaves like the `identity-role-mapper` so that default behaviour doesn't change, i.e. the identity's principals are mapped 1:1 with roles. 42 | This means an "admin" user will act with the "admin" role. 43 | * CLI `grant` and `deny` commands control the association between principals and roles, e.g. `grant myuser builder` 44 | 45 | 46 | === Role Scoping (Optional) 47 | 48 | Because some permissions (ADMIN) are quite broad, it may make sense to introduce role scoping. 49 | This would allow granting access to one or more roles to specific resources while preserving the usual semantics everywhere else. 50 | For example, to allow users having the `backup` role perrmissions to invoke privileged backup ops: 51 | 52 | [source,xml] 53 | ---- 54 | 55 | 56 | 57 | 58 | 59 | ---- 60 | 61 | This implies that a role with no scopes would be equivalent to the wildcard scope: 62 | 63 | [source,xml] 64 | ---- 65 | 66 | 67 | 68 | 69 | 70 | ---- 71 | 72 | -------------------------------------------------------------------------------- /CacheMultimap.adoc: -------------------------------------------------------------------------------- 1 | = Native cache Mutimap 2 | 3 | The goal is to provide Infinispan Native CacheMultimap support. The proposal here is to start with the smallest interface implementation, in order to provide a proper API to the first multimap client, vert-x project. 4 | However, this API will evolve, and this first implementation must work on Embedded and Server mode. 5 | This first Multimap won't support duplicate values on the same key. 6 | 7 | [source, java] 8 | .CacheMultimap.java 9 | ---- 10 | public interface MultimapCache { 11 | void put(K key, V value); // <1> 12 | 13 | Collection get(K key); // <2> 14 | 15 | boolean remove(K key); //<3> 16 | 17 | boolean remove(K key, V value); 18 | 19 | void remove(Predicate p); //<4> 20 | } 21 | ---- 22 | <1> "Put" indicates that the value is being somehow replaced on a key if the key already exists. An alternative could be to call it "add" 23 | <2> We might want to return a Set instead of Collection interface, because duplicates on same key won't be supported yet. 24 | <3> Calling this method "reset" has been suggested 25 | <4> vert-x implementation needs a way to remove values depending on a given Predicate. To achieve this, we need to provide an API or CacheSet keySet(); method. 26 | 27 | An alternative is to support only Async API instead of the standard sync API. 28 | 29 | [source, java] 30 | .MultimapCache.java 31 | ---- 32 | public interface MultimapCache { 33 | CompletableFuture put(K key, V value); 34 | 35 | CompletableFuture> get(K key); 36 | 37 | CompletableFuture remove(K key); 38 | 39 | CompletableFuture remove(K key, V value); 40 | 41 | CompletableFuture remove(Predicate p); 42 | } 43 | ---- 44 | 45 | 46 | The underlying implementation will wrap a normal Cache. Some of the suggestions : 47 | 48 | 49 | Apparently there is a problem with functional commands. 50 | They don't really work efficiently over Hot Rod (does get/replace in a loop). 51 | We would need to add some more handling in the protocol to allow for only partial replication 52 | of values and only 1 remote call. 53 | 54 | 55 | == Embedded Multimap 56 | 57 | A new maven module is created. 58 | 59 | To create a multimap cache 60 | 61 | ```java 62 | EmbeddedCacheManager cm = ... 63 | MultimapCacheManager multimapCacheManager = EmbeddedMultimapCacheManagerFactory.from(cm); 64 | multimapCache = multimapCacheManager.get("test"); 65 | 66 | multimapCache.put("k", "v"); 67 | 68 | ``` 69 | 70 | 71 | == Infinispan Server Multimap 72 | 73 | There are at least two options for the implementation 74 | 75 | === Option A 76 | 77 | We define a new interface called RemoteMultimapCache that won't support methods with lambdas, but the most 78 | common and useful methods will be supported. 79 | 80 | We implement it using the existing OperationsFactory. We enrich the header or we add a flag so the server will know that this 81 | call is meant to be over a multimap. So the internal implementation will grab the cache manager and the cache, create the 82 | EmbeddedMultimapCache and call the methods over that instance instead of over the regular cache methods. 83 | 84 | - + Code reuse 85 | 86 | - - Spaghetti code risk 87 | 88 | === Option B 89 | As for Counters, Locks and any other module built on top of the core, we will implement new operations for hotrod that are specific 90 | for multimaps. 91 | 92 | - + Clear separation of modules 93 | - - Duplication of code risk 94 | 95 | 96 | 97 | 98 | -------------------------------------------------------------------------------- /Custom-Cache-stores-(deployable).md: -------------------------------------------------------------------------------- 1 | # Deployable Cache Stores high level design 2 | 3 | This document describes Deployable Cache Stores implementation details. 4 | 5 | ## Client perspective 6 | 7 | The client will be able to deploy a custom Cache Store jar into Hotrod server (put it into `$HOTROD_SERVER/standalone/deployments`). The jar will need to contain one of the following service entries: 8 | * /META-INF/services/org.infinispan.persistence.spi.AdvancedCacheWriter 9 | * /META-INF/services/org.infinispan.persistence.spi.AdvancedCacheLoader 10 | * /META-INF/services/org.infinispan.persistence.spi.CacheLoader 11 | * /META-INF/services/org.infinispan.persistence.spi.CacheWriter 12 | * /META-INF/services/org.infinispan.persistence.spi.ExternalStore 13 | * /META-INF/services/org.infinispan.persistence.spi.AdvancedLoadWriteStore 14 | 15 | Those services might used later used in the configuration. 16 | 17 | ## Implementation details 18 | 19 | _The implementation is based on Deployable Filters and Converters._ 20 | 21 | Currently all writers and loaders are instantiated in `PersistenceManagerImpl#createLoadersAndWriters`. This implementation will be modified to use `CacheStoreFactoryRegistry`, which will contain a list of `CacheStoreFactories`. One of the factories will be added by default - the local one (which will the same mechanism as we do now - `Util.getInstance(classAnnotation)`. Other `CacheStoreFactories` will be added after deployment scanning. 22 | 23 | `PersistenceManagerImpl` is attached to each Cache separately and has a reference to `Configuration` object. During the startup process `PersistenceManagerImpl` scans for configured stores (`PersistenceManagerImpl#createLoadersAndWriters`) and creates separate references for `CacheLoaders` and `CacheWriters`. After modifications, not only local instances will be available (more precisely from local Classloader), but also deployed. 24 | 25 | ## FAQ 26 | 27 | Q: What happens when you have both a cache loader and a cache writer deployed? 28 | A: `PersistenceManagerImpl` uses separate instances for `CacheLoaders` and `CacheWriters`. It is perfectly legal that a custom Cache Store class implements both of those interfaces. In that case the same instance will be used for loading and writing persistence data. Here is a sample block of code from, which illustrates this idea: 29 | ``` 30 | PersistenceManagerImpl#createLoadersAndWriters: 31 | Object instance = Util.getInstance(classAnnotation); 32 | CacheWriter writer = instance instanceof CacheWriter ? (CacheWriter) instance : null; 33 | CacheLoader loader = instance instanceof CacheLoader ? (CacheLoader) instance : null; 34 | ... 35 | loaders.add(loader); 36 | writers.add(writer); 37 | ``` 38 | 39 | Q: What happens when you have a deployed Cache Store and it is also accessible locally? 40 | A: The deployed one will be used. There are 3 options we could use here: 41 | * throw some `DuplicatedCacheStore` exception to indicate such situation (since this is not a total disaster I wouldn't do that) 42 | * Use the local one 43 | * Use the deployed one (my vote) 44 | I think using the deployed one might be useful for upgrades (for example you have a MyCustomStore v. 1.0 accessible locally and MyCustomStore v. 1.1 deployed - you would probably prefer using the newer one). 45 | 46 | Q: How to tie a deployed cache loader/writer with the configuration itself? 47 | A: It is already done via configuration. `PersistenceManagerImpl#createLoadersAndWriters` scans configuration and extracts Cache Store class name from: 48 | * `ConfigurationFor#value` 49 | * `CustomStoreConfiguration#customStoreClass` 50 | Later on this name is used to instantiate loader/writer. 51 | 52 | Q: Are deployed Cache Stores "active" by default? 53 | A: Yes and no. They are accessible and can be instantiated, but without any configuration - they won't be attached to any Cache. 54 | 55 | 56 | 57 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Infinispan Designs 2 | ================== 3 | 4 | This repository contains various design proposals related to Infinispan 5 | 6 | * [A continuum of data structure and query complexity](A-continuum-of-data-structure-and-query-complexity.asciidoc) 7 | * [Alias caches](Alias-Caches.md) 8 | * [Asymmetric caches and manual rehashing](Asymmetric-Caches-and-Manual-Rehashing-Design.asciidoc) 9 | * [Cache Store Subsystems](Cache-Store-Subsystems.md) 10 | * [Clustered Listeners](Clustered-listeners.md) 11 | * [Cluster Registry](Cluster-Registry.md) 12 | * [Compatibility 2.0](Compatibility-2.0.md) 13 | * [Conflict resolution](Conflict-resolution.md) 14 | * [Consistency guarantees in Infinispan](Consistency-guarantees-in-Infinispan.asciidoc) 15 | * [Continuous query design and indexless queries](Continuous-query-design-and-indexless-queries.asciidoc) 16 | * [Create Cache over HotRod](Create-Cache-over-HotRod.md) 17 | * [Custom Cache stores (deployable)](Custom-Cache-stores-(deployable).md) 18 | * [Deelog: direct integration with Debezium](Deelog%3A-direct-integration-with-Debezium.asciidoc) 19 | * [Design For Cross Site Replication](Design-For-Cross-Site-Replication.asciidoc) 20 | * [Design Wiki Rules](Design-Wiki-Rules.md) 21 | * [Distributed Stream Sorting](Distributed-Stream-Sorting.md) 22 | * [Distributed Stream Support](Distributed-Stream-Support.md) 23 | * [Dynamic JMX exposer for Configuration](Dynamic-JMX-exposer-for-Configuration.md) 24 | * [Fine-grained security for caches](Fine-grained-security-for-caches.md) 25 | * [Graceful shutdown & restore](Graceful-shutdown-&-restore.md) 26 | * [Handling cluster partitions](Handling-cluster-partitions.md) 27 | * [Health-check API](Health-check-API.asciidoc) 28 | * [Hot cache via Debezium](Hot-cache-via-Debezium.asciidoc) 29 | * [Incremental Optimistic Locking](Incremental-Optimistic-Locking.asciidoc) 30 | * [Index affinity proposal](Index-affinity-proposal.asciidoc) 31 | * [CLI](Infinispan-CLI.asciidoc) 32 | * [Hibernate Second-Level Cache improvements](Infinispan-Hibernate-Second-Level-Cache-improvements.md) 33 | * [Query - Design and Planning](Infinispan-Query---Design-and-Planning.asciidoc) 34 | * [Query language syntax and considerations](Infinispan-query-language-syntax-and-considerations.asciidoc) 35 | * [Java 8 API proposal](Java-8-API-proposal.md) 36 | * [Lock Reordering For Avoiding Deadlocks](Lock-Reordering-For-Avoiding-Deadlocks.asciidoc) 37 | * [Multimap as a first class data structure](Multimap-As-A-First-Class-Data-Structure.asciidoc) 38 | * [Multi-tenancy for Hotrod Server](Multi-tenancy-for-Hotrod-Server.asciidoc) 39 | * [Near-Caching](Near-Caching.md) 40 | * [Non-Blocking State Transfer](Non-Blocking-State-Transfer.asciidoc) 41 | * [Non-Blocking State Transfer V2](Non-Blocking-State-Transfer-V2.asciidoc) 42 | * [Off-Heap Data Container](Off-Heap-Data-Container.md) 43 | * [Off-Heap Implementation](Off-Heap-Implementation.md) 44 | * [Optimistic Locking In Infinispan](Optimistic-Locking-In-Infinispan.asciidoc) 45 | * [RAC: Reliable Asynchronous Clustering](RAC%3A-Reliable-Asynchronous-Clustering.asciidoc) 46 | * [RAC: Implementation Details](RAC-implementation.md) 47 | * [Remote Admin Client Library](Remote-Admin-Client-Library.md) 48 | * [Remote Command Handler](Remote-Command-Handler.md) 49 | * [Remote Hot Rod Events](Remote-Hot-Rod-Events.asciidoc) 50 | * [Remote Iterator](Remote-Iterator.md) 51 | * [Remote Listeners improvement proposal](Remote-Listeners-improvement-proposal.md) 52 | * [Scattered Cache design doc](Scattered-Cache-design-doc.md) 53 | * [Security](Security.asciidoc) 54 | * [Smoke Testsuite](Smoke-Testsuite.md) 55 | * [Spring 5 features, ideas and integration](Spring-5-features,-ideas-and-integration.md) 56 | * [Task Execution Design](Task-Execution-Design.md) 57 | * [Topology Id Rework](TopologyId.asciidoc) 58 | * [Total Order non Transactional Cache](Total-Order-non-Transactional-Cache.md) 59 | * [XSite Failover for Hot Rod clients](XSite-Failover-for-Hot-Rod-clients.md) 60 | -------------------------------------------------------------------------------- /Server Restructuring.md: -------------------------------------------------------------------------------- 1 | Infinispan Server Restructuring 2 | =============================== 3 | 4 | Status quo 5 | ---------- 6 | 7 | Infinispan Server is currently based on the full WildFly distribution. 8 | It is composed of the following additional components: 9 | 10 | * a JGroups subsystem which is mostly aligned with the WildFly with some additions (e.g. SASL integration) 11 | * an Infinispan subsystem, originally forked from the WildFly one, but much more complete in terms of feature coverage, with support for deployments 12 | * an Endpoint subsystem, unique to Infinispan Server 13 | * the Infinispan admin console 14 | * a CLI which extends the WildFly CLI with some additional convenience commands 15 | 16 | We use the WildFly feature pack and server provisioning tools to create the Infinispan Server package based on the above subsystem and on any additional modules we require (including new versions of modules already supplied by WildFly). 17 | We also remove some unnecessary modules based on the output of a Python script which computes module/subsystem dependencies. 18 | Unfortunately the "trimming process" is not as effective as it could be because some of the modules, while not needed by Infinispan Server, are hard dependencies of other server modules. In particular we rely on the Datasource and Transaction subsystem which have a lot of dependencies. 19 | 20 | Basing Infinispan Server on WildFly has a number of advantages and disadvantages. 21 | 22 | Advantages 23 | 24 | * JBoss Modules combined with the service model provides module and service dependency tracking with concurrent loading 25 | * User deployment support 26 | * DMR with support for inspecting / modifying / persisting configuration at runtime, with propagation to all nodes in the domain, accessible via CLI, Console and remoting 27 | * Security realms with support for certificate stores, LDAP integration, Kerberos, etc 28 | * The subsystems can be packaged as a WildFly layer to provide newer features to WF deployments 29 | 30 | Disadvantages 31 | 32 | * Integrating with the server configuration model adds a lot of overhead 33 | * Larger than needed footprint 34 | * Much of the functionality supplied by WildFly is unnecessary in container deployments such as OpenShift 35 | 36 | Redesign proposals 37 | ------------------ 38 | 39 | For future development we want to pursue two paths: one which maintains compatibility with the current situation but resolves some of the above disadvantages and a more aggressive one tailored for containers 40 | 41 | Plan A: Base the server on WildFly Core 42 | 43 | * Make the dependency on the transactions and datasource subsystems optional by providing our own internal implementations by directly integrating with Narayana and HikariCP 44 | * Still work with the transactions/datasource subsystem if they are present (to support the WildFly layering) 45 | * Need to still support deployable JDBC drivers 46 | 47 | Plan B: Create an uber-jar slim server 48 | 49 | * Add some bootstrapping code 50 | * no need for WildFly Swarm or Spring Boot 51 | * Configuration 52 | * use the embedded Infinispan schema 53 | * configure endpoints and other components via the global modules API 54 | * persist configuration changes using the existing configuration serializer 55 | * add runtime clustered configuration capabilities (store configs in an internal cache) 56 | * Security 57 | * Depend on Elytron for LDAP, Properties, Certificates, Kerberos, KeyCloak, etc integration 58 | * Custom user "deployments" 59 | * Load custom code placed in an extension directory (use JBoss Modules?) 60 | * Scan only at startup 61 | * use META-INF/services/ discovery 62 | * Management 63 | * Runtime management performed via JMX 64 | * Revamp the Embedded CLI 65 | * The web console can be integrated with Jolokia which exposes JMX MBeans as RESTful endpoints 66 | 67 | -------------------------------------------------------------------------------- /Distributed-Stream-Sorting.md: -------------------------------------------------------------------------------- 1 | Infinispan recently added support for streams to be used in a performant way on a distributed cache. Unfortunately some methods such as sort were not implemented in a distributed way. Instead the entire contents of the cache was copied to the originating node and a final sort was done on all of the contents in memory. This while valid can be an extremely inefficient usage of memory and in some cases even run the node out of memory resulting in an OutOfMemoryException. 2 | 3 | This however will change when Distributed Sorting is introduced. The general idea is a [merge sort](https://en.wikipedia.org/wiki/Merge_sort) where each node contains a sub set of data that is independently sorted first. Then the data is finally sorted at the originator. This way we can also utilize the CPUs across the cluster to do initial sorting on each node so the coordinator only has to sort sets of already sorted data which is faster. 4 | 5 | Even doing just a merge sort will not help with memory. In actually it can be a bit worse because now each node could possibly run into OutOfMemory errors if they have cache stores that have more data than they could hold in memory. Thus we need some way of using batched sizes to prevent this as well. The distributedBatchSize is used to control how many sorted elements are retained at a time on each node before sending those batches to the originator. This then requires each node to iterated upon a number of times equal to N / b times where N is the number of primary owned elements on that node and b is the batch size. 6 | 7 | ## Non Rehash 8 | 9 | Below is the pseudo code for the remote and coordinator side sorting for use when doing a sort when rehash is not supported. 10 | 11 | ### Remote Node Sort 12 | 13 | array = new Array[batchSize * 2] 14 | offset = 0 15 | foreach element in cache { 16 | if offset < batchSize || array[batchSize -1] < element 17 | array[offset++] = element 18 | if offset == array.length 19 | sort array 20 | offset = batchSize 21 | } 22 | 23 | Then when the entries are used for a stream the array is sorted and only the elements up to batchSize are returned. Subsequent iterations only add an element to the array if the element is larger than the last element from the last batch. 24 | 25 | The local coordinator holds all the responses from the various nodes including itself with a BoundedQueue which blocks that thread until it has sent all of its responses. 26 | 27 | ### Coordinator Sort 28 | 29 | array = new Array[nodeCount] 30 | nodeReturnedLastValue = null 31 | completedNodes = 0 32 | 33 | method getNext 34 | if nodeReturnedLastValue == null 35 | offset = 0 36 | foreach node 37 | array[offset++] = node.queue.next 38 | sort array 39 | else 40 | value = nodeReturnedLastValue.queue.next 41 | if value != null 42 | array[completedNodes] = nodeReturnedLastValue.queue.next 43 | sort array 44 | else 45 | completedNodes++ 46 | 47 | nodeReturnedLastValue = node where value originated 48 | return array[completedNodes] 49 | 50 | ## Rehash sort 51 | 52 | Rehash sort is quite a bit similar to the non rehash. If a node finds that a rehash occurs while iterating upon the cache entries it will send a message to the originator to say what the new view id is. The originator upon receiving said message will wait for all nodes to send the same message or a completion message (in the case the node completed iteration without the rehash). The originator after this state has occurred will only allow elements received until 1 node has run out but wasn't completed. At this point any additional elements retrieved must be ignored. The originator will then send a new batch of requests but will also send along the highest value it has seen to filter out any values that are smaller than it in the iteration process. -------------------------------------------------------------------------------- /Infinispan-CLI.asciidoc: -------------------------------------------------------------------------------- 1 | This document illustrates the requirements that an Infinispan CLI should implement in order to be useful and flexible. The CLI should be able to expose most (all) of the features of the Core API and of any modules which may be present on a running instance. Currently the only two methods which allow remote interaction with an Infinispan instance are JMX and Hot Rod, but both have limitations on the type of operations that can be performed. 2 | 3 | === Operations 4 | The CLI should implement the following operations: 5 | 6 | * Selecting the default cache on which operations will be applied when an explicit cache is not specified 7 | * CRUD ops such as GET, PUT, REPLACE and REMOVE (with attributes for expiration) 8 | * Bulk ops such as CLEAR, PUTALL, possibly reading input from an external file (CSV, XML, JSON) 9 | * Set specific Flags 10 | * Batching START and END 11 | * Transaction BEGIN, COMMIT and ROLLBACK 12 | * Manually control otherwise automatic processes (eviction, rehashing, etc) 13 | * Query indexed caches 14 | * Start / Stop containers and caches 15 | * Modifying configuration at runtime 16 | * Obtaining statistics 17 | * Integration with some form of scripting (no need to invent something, just reuse JSR-223) to be used with Distributed Executors, the Map Reduce API and possibly Listeners 18 | 19 | === Syntax 20 | * The syntax should be concise but comprehensible, using simple verb commands like GET, PUT, etc. 21 | 22 | put key value 23 | get key 24 | 25 | * Expiration, max idle times and flags: 26 | 27 | put key value expires 1h maxidle 10m 28 | put key value flags (force_sync) 29 | 30 | * Quoting of values should be optional where unambiguous. 31 | 32 | put key 'my value' 33 | 34 | * Type for primitive values should be inferred when possible, using the usual Java notation 35 | 36 | put key 1 # 1 is parsed as an int, and therefore boxed as an Integer when inserted in the cache 37 | put key '1' # 1 is parsed as a string 38 | 39 | * Non-primitive values should be specified using JSON (Jackson) or XML (XStream) notation 40 | 41 | put key { 'a', 'b' 'c' } # the value is an ArrayList 42 | put key 'Widget' 43 | 44 | * Multiple values (e.g. for putall) could be as follows: 45 | 46 | put { key1: value1, key2: value2, key3: value3 } 47 | 48 | * Data should be displayed using different formatters 49 | 50 | get key # display data using the value’s toString() method 51 | get key as xml # display data in XML format (possibly using xstream) 52 | get key as json # display data in JSON format (possibly using jackson) 53 | 54 | * The target cache for operations can either be a default cache or defined explicitly 55 | 56 | cache myCache # selects the default cache 57 | put key value # puts the entry in myCache 58 | put otherCache.key value # puts the entry in otherCache 59 | 60 | * Scripting functions are declared inline as follows: 61 | 62 | exec myCache callable:function(cache, keys) {} 63 | mapreduce myCache mapper:function() {} reducer:function() {} collator:function() {} 64 | 65 | === Processing 66 | The command interpreter would reside entirely within the server: the client is merely responsible for performing input and output and possibly "finger-candy" such as tab completion of commands. The interpreter also needs to handle statefulness of the CLI session (and therefore session management). 67 | 68 | === Connectivity 69 | Most databases (SQL and noSQL alike) use their remote client port also for admin ops. We have the following possibilities: 70 | 71 | * Use Hot Rod (pro: our remote protocol, cons: requires the server to be running, we need to implement security) 72 | * Use JMX (pro: ubiquitous, security as part of the protocol, nothing to develop, cons: ?) 73 | * Use a custom port (possibly using the simple com.sun.net.httpserver as transport) (pro: dedicated for admin ops, cons: more implementation overhead) 74 | 75 | === Security 76 | In case the transport we select does not provide its own security already, we need to consider at least authentication and possibly encryption. -------------------------------------------------------------------------------- /Near-Caching.md: -------------------------------------------------------------------------------- 1 | The scope of Infinispan Near Caches is to add optional L1 caching to Hot Rod client implementations in order to keep recently accessed data close to the user. This L1 cache is essentially a local Hot Rod client cache that gets updated whenever an a remote entry is retrieved via `get` or `getVersioned` operations. By default, Hot Rod clients have near caching disabled. 2 | 3 | Near Cache consistency is achieved with the help of remote events, which send notifications to clients when entries are modified or removed. At the client level, near caching can be configured to either be Lazy or Eager. When enabling near caching, the user must decide between these two modes. 4 | 5 | # Lazy Near Caches 6 | 7 | Entries are only added to the lazy near caches when they are retrieved remotely via `get` or `getVersioned`. If the cache entry is modified or removed server side, the Hot Rod client receive events which in turn invalidate the near cache entries by removing them from the near cache. This way of keeping near cache consistent is very efficient because the events sent back to the client only contain key information. The downside is that if a cache entry is retrieved after it's been modified, the Hot Rod client will have to fetch it from the remote server. 8 | 9 | # Eager Near Caches 10 | 11 | Eager near caches are eagerly populated as entries are created in the server, and when entries are modified, the latest value is sent along with the notification to the client and it stores it in the near cache. Eager caches, same as lazy near caches, are also populated when an entry is retrieved remotely (if not already present). The advantage of eager near caches is that it can reduce the number of trips to the server by having newly created entries already present in the near cache before any requests to retrieve it have been received. Also, if modified entries are re-queried by the client, these can be fetched directly from the near cache. The disadvantage of eager near caching is that events received from the server bigger in size due to the need to ship back value information, and entries could be sent to client that will never be queried. 12 | 13 | To limit the cost the eager near caching, created events are only received for those entries that are created after the client has connected to the Hot Rod servers. Existing cached entries will only be added to the near cache if the user queries them 14 | 15 | ## Filtering 16 | 17 | Some of the downsides of eager near caches could be mitigated by enabling users to add filtering of near cache entries. In other words, the users could potentially in the future provide a filter implementation that defines which keys to fetch eagerly. This filter could be used to enable existing cached entries to be sent back to the client when the client connects to the Hot Rod servers, and could also be used to filter which created events to send back once it's already connected. **This feature is not currently planned for inclusion** 18 | 19 | # Eviction 20 | 21 | Being able to keep control the size of near caches is important. Even though not enabled by default, eviction of near caches can be configured by defining the maximum number of elements to keep in the near cache. When eviction is enabled, an LRU LinkedHashMap is used (protected by a `ReentrantReadWrite` lock to deal with concurrent updates). A better solution based around a BoundedCHMv8 is planned. 22 | 23 | # Cluster 24 | 25 | Near caches are implemented using Hot Rod remote events, which underneath use cluster listeners as technology for receiving events across the cluster. A key characteristic of cluster listeners is that within a cluster they are installed in a single node, with the rest of nodes sending events to the this node. So, it's possible for node that runs the near cache backing cluster listener to go down. In such situation, another node takes over running the cluster listener. When such thing happens, a client failover event callback can be defined to be invoked. For near caches, such callback will be implemented and its implementation will consist of clearing the near cache. Clearing is the simplest and most efficient thing to do, since during the failover events might have been missed. -------------------------------------------------------------------------------- /Off-Heap-Data-Container.md: -------------------------------------------------------------------------------- 1 | Currently the data container that holds all of the entries in memory stores them in the Java heap. This area is subject to garbage collection and others. Unfortunately the default garbage collector can cause stop the world pauses where it can halt the entire JVM from processing. 2 | 3 | The easiest way to avoid these performance issues is to not store these entries in the Java Heap. Easier said than done! Below you will find a few ways that you could do this 4 | 5 | 1. Use a third party tool to implement DataContainer (ie. MapDB, ~~ChronicleMap~~ etc.) - http://stackoverflow.com/questions/7705895/is-there-a-open-source-off-heap-cache-solution-for-java 6 | 2. Allow for data container to store nothing and only ever go to Store. Then we could utilize something like SIFS using a filesystem pointing to ramcache 7 | 3. Implement our own using netty as a an example of how to do the memory allocations. 8 | 9 | # Things that are incompatible 10 | 11 | ### Write skew check 12 | Currently this relies on object instance equality 13 | ### Atomic Map 14 | This updates instance directly, thus ICE factory update has to make a new instance. Also proxy requires some additional changes. This doesn't seem to work properly also because of Read Committed below... 15 | ### Class Resolver 16 | Can't figure out quite yet how to get the class loader to work properly with new DBContainer 17 | ### Read Committed 18 | Read committed cannot be used since everything is serialized when stored. Only repeatable read may be used. This is besides the fact that read committed has issues in a DIST cache anyways. 19 | 20 | 21 | ## Map DB array perf 22 | Result: 251528.669 ±(99.9%) 3849.842 ops/s [Average] 23 | Statistics: (min, avg, max) = (204159.349, 251528.669, 271807.636), stdev = 16300.471 24 | Confidence interval (99.9%): [247678.827, 255378.510] 25 | 26 | 27 | Run complete. Total time: 00:13:36 28 | 29 | Benchmark| (cacheName)| Mode| Cnt| Score| Error| Units| 30 | ---------|---------------------------------|--------|----|--------------|-----------|--------|-- 31 | IspnBenchmark.testGet| nonTxCache| thrpt| 200| 1657296.817| ± 26651.726| ops/s| 32 | IspnBenchmark.testPutImplicit| nonTxCache| thrpt| 200| 251528.669| ± 3849.842| ops/s| 33 | 34 | ## MapDB Direct perf 35 | 36 | Result: 106203.466 ±(99.9%) 1703.474 ops/s [Average] 37 | Statistics: (min, avg, max) = (86628.725, 106203.466, 117797.812), stdev = 7212.614 38 | Confidence interval (99.9%): [104499.992, 107906.939] 39 | 40 | 41 | Run complete. Total time: 00:13:37 42 | 43 | Benchmark| (cacheName)| Mode| Cnt| Score| Error| Units| 44 | ---------|---------------------------------|--------|----|--------------|-----------|--------|-- 45 | IspnBenchmark.testGet| nonTxCache| thrpt| 200| 1608557.747| ± 21212.189| ops/s| 46 | IspnBenchmark.testPutImplicit| nonTxCache| thrpt| 200| 106203.466| ± 1703.474| ops/s| 47 | 48 | 49 | ## Master perf 50 | Result: 1318669.580 ±(99.9%) 17467.592 ops/s [Average] 51 | Statistics: (min, avg, max) = (1043144.741, 1318669.580, 1453850.008), stdev = 73958.877 52 | Confidence interval (99.9%): [1301201.989, 1336137.172] 53 | 54 | 55 | Run complete. Total time: 00:13:35 56 | 57 | Benchmark| (cacheName)| Mode| Cnt| Score| Error| Units| 58 | ---------|---------------------------------|--------|----|--------------|-----------|--------|-- 59 | |IspnBenchmark.testGet| nonTxCache| thrpt| 200| 20555714.561| ± 274213.492| ops/s| 60 | IspnBenchmark.testPutImplicit| nonTxCache| thrpt| 200| 1318669.580| ± 17467.592| ops/s| 61 | 62 | ## SFS perf 63 | Result: 74038.493 ±(99.9%) 1512.844 ops/s [Average] 64 | Statistics: (min, avg, max) = (42949.523, 74038.493, 81948.143), stdev = 6405.476 65 | Confidence interval (99.9%): [72525.649, 75551.337] 66 | 67 | 68 | Run complete. Total time: 00:13:36 69 | 70 | Benchmark| (cacheName)| Mode| Cnt| Score| Error| Units| 71 | ---------|---------------------------------|--------|----|--------------|-----------|--------|-- 72 | IspnBenchmark.testGet| nonTxCache| thrpt| 200| 774529.207| ± 12164.148| ops/s| 73 | IspnBenchmark.testPutImplicit| nonTxCache| thrpt| 200| 74038.493| ± 1512.844| ops/s| 74 | 75 | -------------------------------------------------------------------------------- /Total-Order-non-Transactional-Cache.md: -------------------------------------------------------------------------------- 1 | #Summary 2 | 3 | Use the total order protocol to handle commands in non transactional cache. 4 | 5 | #Changes in configuration: 6 | 7 | * `transaction-protocol` attribute will be removed from `` tag. 8 | * `TOTAL_ORDER` option will be added `locking-mode`. 9 | 10 | #Description (for a normal scenario): 11 | 12 | A client invoked a write operation in node `N`: 13 | * `N` sends a total order message to all the owners with the write operation; 14 | * All owners deliver the write operation and perform it; 15 | * `N` waits for all the replies and returns the result back to the client; 16 | 17 | **Why it works?** 18 | 19 | Since the write operations are deliver in total order by all the owners of the key(s), then when the write operation is deliver, it sees the same state and performs the same steps, generation the same new state. 20 | 21 | **Note:** 22 | In the initial version, the commands are processed directly in the JGroups thread. If it justifies, a new non-transactional total order manager may be created and the operations start to be processed in the remote executor service. 23 | 24 | **Optimizations:** 25 | The originator does not need to wait for all the replies. The reply can be marked as `STABLE` or `UNSTABLE` (or other name; the goal is to mark as unstable the replies from commands processed during the state transfer). Then, the originator waits: 26 | 27 | 1. for the first reply is the reply is marked as `STABLE`, or it is an exception, or `IGNORE_RETURN_VALUE` is set and the operation is non-conditional; 28 | 2. for the first non-null reply when the reply is marked as `UNSTABLE`. 29 | 30 | #State Transfer 31 | 32 | As in total order transactional caches, the cache topology commands which modifies the topology ID are sent in total order. Having this in mind, we have: 33 | 34 | * **when the command topology ID is different from the current topology ID** 35 | 36 | Exception is throw. Originator re-sends the operation. 37 | 38 | **Why?** 39 | 40 | If the operation is delivered in a different topology ID, then the new owners will not see it, breaking the data consistency. 41 | 42 | **Can't it forward the operation to the new owners?** 43 | 44 | No. No order is guarantee when forward operations. Imagine the following scenario: 45 | 46 | 1. Node `N` sends two operations `C1` and `C2` for the same key. `C1` is sent in topology `i` and `C2` in topology `i+1`. 47 | 2. `C1` and `C2` are delivered in topology `i+1` and `C1` is delivered before `C2`. 48 | 3. If `C1` is forward, the new owner could delivered the operation `C1` following by `C2` or `C2` following `C1`. If the later happens, the data consistency is broken and the algorithm too. 49 | 50 | **optimization:** 51 | In replicated caches, if only nodes are leaving, the operation can be processed normally. 52 | 53 | * **when a conditional operation is delivered and state transfer is in progress** 54 | 55 | Exception is throw. Originator should block the operation until the state transfer is finished. 56 | 57 | **Why?** 58 | 59 | When the operation is delivered in the new owners, they don't have the current value to check the condition. 60 | 61 | **Can't it fetch the current value from old owners?** 62 | 63 | No. The remote get is not ordered with the operation, so it can return a value after or before the operation. If a value before the operation is returned, the consistency is broken. 64 | 65 | **Can't the command wait in the new owner until the state is received?** 66 | 67 | No. It is going to block the total order deliver thread and it will never deliver the new topology ID. 68 | 69 | **optimization:** 70 | if the key is not affected by the state transfer (i.e. the ownership didn't change), then the conditional operation can be processed normally. However, it causes the following: 71 | 72 | 1. in replicated caches, a node joining will block all the conditional operations; 73 | 2. in replicated caches, a node leaving does not block any operations. 74 | 75 | * **when a non-conditional operation is delivered and state transfer is in progress** 76 | 77 | The command is processed normally and the originator takes the replies and return the value different from `null` (if any). 78 | 79 | **Why?** 80 | 81 | First, the operation does not dependent on the state of the data. Second, the old owners will return the current value when the operation is deliver and the new owner can return `null` or the current value (if it was already received from state transfer). -------------------------------------------------------------------------------- /Multimap-As-A-First-Class-Data-Structure.asciidoc: -------------------------------------------------------------------------------- 1 | = Dynamic RBAC 2 | 3 | == Tracking Issue 4 | 5 | https://issues.redhat.com/browse/ISPN-13853 6 | 7 | == Description 8 | 9 | Multimap is a data structure that has gotten the attention from Infinispan users. 10 | The structure today lacks evolutions and support of features and new API of Infinispan: 11 | 12 | * REST API support 13 | * Supporting duplicates 14 | * Reactive API integration and development 15 | * Integration with Quarkus through the extension 16 | * Search and query support 17 | 18 | == Compatibility impact 19 | 20 | Ensure that the old Remote API and configuration still works. 21 | 22 | Under the hood, we will still keep using distributed caches, but we will know with the configuration that those 23 | caches are meant to behave as multimaps. This way we will be able to change the implementation details in 24 | the future if the structure becomes even more popular. 25 | 26 | Indexing and Query should evolve to detect that the content to be indexed is inside the list/set of the key/value. 27 | Query will be supported *only* if indexing is enabled. 28 | 29 | == Security impact 30 | 31 | N/A 32 | 33 | == Configuration Schema 34 | 35 | [source,xml] 36 | ---- 37 | 38 | 39 | 40 | 41 | 42 | Determines if the multimap supports duplicates in the values. True or false (default). 43 | 44 | 45 | 46 | 47 | 48 | ---- 49 | 50 | 51 | == Public API 52 | 53 | The work will be done for the new API only. 54 | 55 | == Deprecations 56 | 57 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCache` 58 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCacheManager` 59 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCacheManagerFactory` 60 | * `org.infinispan.client.hotrod.multimap.MultimapCacheManager` 61 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCache` 62 | * `org.infinispan.client.hotrod.multimap.MetadataCollection` 63 | 64 | == Hot Rod API 65 | 66 | The new API will be developed. Provide implementations for the available APIs: 67 | 68 | * org.infinispan.api.async.SyncMultiMap 69 | * org.infinispan.api.async.SyncMultiMaps 70 | * org.infinispan.api.async.MutinyMultiMap 71 | * org.infinispan.api.async.MutinyMultiMaps 72 | * org.infinispan.api.async.AsyncMultiMap 73 | * org.infinispan.api.async.AsyncMultiMaps 74 | 75 | == REST API 76 | 77 | Multimap REST API should be the almost the same as the Cache one. 78 | The difference is in the endpoint. 79 | 80 | Examples: 81 | [source] 82 | ---- 83 | GET /rest/v2/multimaps/ 84 | 85 | GET /rest/v2/multimaps/{multimapName}?action=??? 86 | 87 | DELETE /rest/v2/multimaps/{multimapName}/{multimapKey} 88 | ---- 89 | 90 | * CRUD multimap (providing a configuration to create a multimap, templates not supported) 91 | * Display Multimap content entries 92 | * Get values by key 93 | * In Indexing is enabled, multimaps support query 94 | * CRUD Key/Value. If a key exists we add the value. To remove a key/value we need to send the value as well. 95 | * CRUD Key/AllValues. We remove the entry completely. 96 | * Clearing Multimap operation 97 | * Enable/Disable rebalancing 98 | * Availability get/set 99 | * Replace not supported 100 | * Listeners support to enable interactions with near-caching 101 | 102 | 103 | == CLI 104 | 105 | * `*create multimap* --file=mymultimap.xml myMultimap` Create a new multimap 106 | * `*ls multimap* --file=mymultimap.xml myMultimap` List all multimaps 107 | * `*describe multimap/name* View the configuration 108 | 109 | etc ... 110 | 111 | Provide the same CLI methods but targeting multimap instead of caches. 112 | 113 | == Console 114 | 115 | The console should use the new REST API and provide a new interface separate from the Caches tab 116 | where we will be able to CRUD operations over multimaps and display correctly the key/values of 117 | the multimap. 118 | The UX team should be in the loop to check out the design. 119 | 120 | == Operator 121 | 122 | Provide a MultimapCR 123 | 124 | == Quarkus Integration 125 | Integrate in Quarkus developing a new annotation `@Multimap` that will allow the use of dependency inyection 126 | to work with Multimaps in Quarkus. -------------------------------------------------------------------------------- /Off-Heap-Implementation.md: -------------------------------------------------------------------------------- 1 | Infinispan currently only stores entries on the Java heap. This can cause issues if the cache is used in such a way that a large amount of stop the world garbage collections are performed. One such fix is to write these objects to a cache store, however this can cause performance issues. Instead Infinispan would like to instead of storing entries on the Java heap but rather to store them in off heap memory managed by Infinispan itself. This would allow for more memory to be used and should reduce significantly the number of garbage collections required. 2 | 3 | 4 | This requires quite of a bit of work to be done in the core module of Infinispan. Before I go into the details of how the overall design for core will go I will list some limitations 5 | 6 | ## Limitations 7 | 8 | 1. Write skew check 9 | Currently this relies on object instance equality, which will not work since objects are required to be deserialized/serialized 10 | 2. Atomic Map 11 | This updates instance directly, thus ICE factory update has to make a new instance. Also proxy requires some additional changes. This doesn't seem to work properly also because of Read Committed below... 12 | 3. Read Committed 13 | Read committed cannot be used since everything is serialized when stored. Only repeatable read may be used. This is besides the fact that read committed has issues in a DIST cache anyways. 14 | 4. DeltaAware 15 | DeltaAware would technically work, but it will take a significant performance hit due to the fact that objects would have to be fully deserialized then apply delta and then serialized again. 16 | 5. Store as Binary 17 | Both this and off heap together do not make sense 18 | 19 | ## Configuration 20 | 21 | Off heap memory would be configured under a new element ("STORAGE"?, "MEMORY"?, under data container?) along with storeAsBinary and wrapByteArray. The wrapByteArray is a new configuration to replace equivalence that allows for byte[] to be used as key or values. This element is mutually exclusive as it wouldn't make sense to use them together. Compatibility would fall under this but we are planning on removing Compatibility in Infinispan 9 as well 22 | 23 | ## Put operation 24 | 25 | The general idea for core is as following for a general put: 26 | 27 | 1. Originator immediately deserializes the object as a wrapped byte[] (Do we use hashCode for byte[] or object?) 28 | 2. Originator sends wrapped byte[] to the primary owner and normal forwarding occurs 29 | 3. The primary and backup owners have an interceptor installed that converrts the byte[] to an off heap object before writing this instance to the data container 30 | 4. The previous value (if needed) is sent back as a byte[] and the originator finally deserializes it before returning it to the originator 31 | 32 | ## Get operation 33 | 34 | A get will work in a similar fashion: 35 | 36 | 1. Originator immediately deserializes the object as a wrapped byte[] (Do we use hashCode for byte[] or object?) 37 | 2. Originator sends wrapped byte[] to the owner(s) to retrieve the value 38 | 3. Hacky but the wrapped byte[] can compare equivalence to the off heap object without having to allocate a byte\[\] (indexed gets) - (TODO: check bulk get byte[] - requires read only copy though) 39 | 4. Originator deserializes byte[] to an object before returning to caller 40 | 41 | ## Bulk operations 42 | 43 | Bulk methods such as distributed streams will require deserialization of entries on each node. If we store the hash along with the off heap instance we do not need to deserialize non owned keys as the hash code would determine which segment it is in. 44 | 45 | ## Statistics 46 | 47 | We need to make statistics available that show how much direct memory is in use (blocks? available? used?). Unfortunately this may have an issue with separating information by cache level and may only provide per node usage. 48 | 49 | # Server considerations 50 | 51 | The server doesn't require as many changes as in the core, but there are a few things to implement and investigate 52 | 53 | ## Reuse Netty buffer 54 | 55 | Netty can utilize direct byte buffers for its data read from the socket. This can be useful to reuse these to prevent an additional byte[] allocation. It is unclear if this will work as intended, but if possible this will be added. Even if this works it may not save any performance however for writes as JGroups currently only allows for byte[] to be written and not off heap memory. 56 | 57 | ## Compatibility 58 | 59 | Compatibility cannot be enabled with off heap and will throw an exception preventing the cache from starting. -------------------------------------------------------------------------------- /Deelog:-direct-integration-with-Debezium.asciidoc: -------------------------------------------------------------------------------- 1 | == Contributors 2 | 3 | Emmanuel Bernard, Randall Hauch 4 | 5 | Initial proposal 2016-12-09 6 | 7 | == Goal 8 | 9 | Make Infinispan a Debezium source connector. 10 | Improve as much as possible change event capture minimizing the risk of change loss. 11 | The state in Kafka must be eventually consistent with the Infinispan state. 12 | 13 | This proposal also integrates lightly with Infinispan: it does not require a native event log system in Infinispan. 14 | 15 | == Proposal 16 | 17 | The total order would not be global across the system but per key. 18 | The accepted order will be the one eventually captured in Kafka. 19 | 20 | Integrate Debezium as a library and Infinispan interceptor in each node. 21 | That Debezium library will collect changes and write them in a Kafka queue. 22 | 23 | Each node has a Debezium connector instance embedded that listens to the 24 | operations happening (primary and replicas alike). 25 | This can be done via an Infinispan interceptor sending the events to a queue. 26 | That queue is listened by a Debezium thread. 27 | All of this process is happening async compared to the operation. 28 | 29 | Per key, a log of operations is kept in memory (it contains the key, the 30 | operation, the operation unique id and a ack status. 31 | 32 | If on the key owner, the operation is written by the Debezium connector 33 | to Kafka when it has been acked (whatever that means is where I'm less 34 | knowledgable and needs to be clarified). 35 | 36 | On a replica, the kafka partition is read regularly to clear the 37 | in-memory log from operations effectively stored in Kafka. 38 | If the replica becomes the owner, it reads the kafka partition to see 39 | what operations are already in and writes the missing ones. 40 | 41 | There are a few cool things: 42 | 43 | * few to no change in what Infinispan does 44 | * no global ordering simplifies things and frankly is fine for most 45 | Debezium cases. In the end a global order could be defined after the 46 | fact (by not partitioning for example). But that's a pure downstream 47 | concern. 48 | * everything is async compared to the Infinispan ops 49 | * the in-memory log can remain in memory as it is protected by replicas 50 | * the in-memory log is self cleaning thanks to the state in Kafka 51 | 52 | This model works as long as we don't lose two owners consecutively before enough replicas have caught up (queue change wise). 53 | 54 | == When two owners die too fast 55 | 56 | We're in trouble. 57 | 58 | When two owners die too fast consecutively, we lose the event queue. 59 | We might also lose the state if the state transfer has not finished. 60 | 61 | In both situations, we are left with a state in Kafka which is different than the state in Infinispan and for which we cannot guarantee the eventual consistency. 62 | So we need to send the full state of the lost segments into Kafka as a state _reset_ and make sure that any entry present in Kafka but not in Infinispan are tombstoned. 63 | 64 | There are a few possible solutions: 65 | 66 | 1. from the embedded Debezium, read the Kafka history (could be very long) and compare it from the state transfered state (either clean if we lost state or read back from the cache store); based on the comparison, send the appropriate key create/update and delete. This can be a slow and memory intensive process 67 | 2. from the embedded Debezium, send a global tombstone for the whole segment or even the whole state and send the state afresh to Kafka. This avoids reading Kafka's history (slowness and memory consumption) but the full retransfer of state can be quite long 68 | 3. use a two staged process: send the fact that we might have lost changes as a tombstone to Kafka plus the full segment state. A Debezium consumer would read that information and based on a kept local state build the diff and send less drastic stream of change to actual Debezium end consumers. 69 | 70 | ATM option 2 or 3 feel the most appropriate as building the diff on the Infinispan side would be costly and memory consuming. 71 | 72 | Would it help to do a many to one between segments and kafka partitions? 73 | 74 | == On initial cluster start 75 | 76 | We are also somewhat in trouble. 77 | How do we know that we had a clean stop with full queue flushed? 78 | 79 | If the queues have not been flushed, then we are back to the problem of the two owners dying too fast (see above). 80 | If the queues have been flushed properly, Kafka is in a correct state and we can carry on. 81 | 82 | == Additional cleaning 83 | 84 | When a replica is elected as new owner, we need to differentiate two statuses: 85 | 86 | * the new owner has no queue and thus we have lost change events and eventual consistency 87 | * the new owner has a queue which is either non empty (catch up to do) or empty because it was already synced. In this situation we are good on our eventual consistency promise. 88 | 89 | == Opened questions 90 | 91 | How to capture transaction 92 | 93 | == Alternative 94 | 95 | Rely on a full fledge log handled by Infinispan itself: see [Gustavo's proposal](https://github.com/infinispan/infinispan/wiki/Remote-Listeners-improvement-proposal). 96 | On that front, it looks like a full fledge log is more complex to get right and we can start with proposal Deelog before exploring the full-fledge log. -------------------------------------------------------------------------------- /Incremental-Optimistic-Locking.asciidoc: -------------------------------------------------------------------------------- 1 | === Context 2 | When using link:Optimistic-Locking-In-Infinispan[optimistic locking] we still have a deadlock situation which is not covered by the link:Lock-Reordering-For-Avoiding-Deadlocks[lock reordering]: 3 | 4 | * Tx1 and Tx2 transactions executing in parallel on two nodes `N1` and `N2` and writing the keys `{a,b}` 5 | * `consistentHash(a) = {N3}` and `consistentHash(b) = {N4}` 6 | * with some right timing, during prepare time it is possible for these two transactions to deadlock: 7 | ** Tx1 lock acquired on "a" @ N3 8 | ** Tx2 lock acquired on "b" @ N4 9 | ** Tx1 cannot progress acquiring lock on "b" @ N4, that lock is acquired by Tx2 10 | ** Tx2 cannot acquire lock on "a" @ N3 as that lock is held by Tx1 11 | ** Tx1 and Tx2 are waiting for each other => deadlock 12 | 13 | NOTE: this problem stands disregarding the number of owners for a key. 14 | 15 | === Solution 16 | The suggested solution for solving the above described deadlock is by enforcing transactions to acquire locks on remote nodes in the same order. Transactions, at prepare time, order the nodes it "touches" based on some criteria, and issues lock acquisition incrementally: it doesn't acquire lock on a node the the order sequence until it has a lock on the previous node in the sequence (and consequently on all the nodes before it). 17 | 18 | Node ordering can be defined based on cluster's view. An alternative ordering approach is `consistentHash(nodeAddress)`, but that might lead to conflicts when node that have addresses mapping to the same value. 19 | 20 | In the example above, considering the view = `{N1, N2, N3, N4}` and incremental lock acquisition: 21 | 22 | * Tx1 acquires lock on a@ N3 23 | * Tx2 tries to acquire lock on a@ N3. It cannot acquire and waits for Tx1 to release it 24 | * Tx1 acquires locks on b@ N4, completes and releases locks 25 | * Tx2 acquires lock on a@ N3 26 | * Tx2 acquires lock on b@ N4 and completes as well 27 | 28 | Following sections discusses two approaches for acquiring locks incrementally. 29 | 30 | ==== Direct incremental locking 31 | In this approach, N1 does all the RPCs needed for acquiring locks in sequence. The diagram below illustrates how this happens, measuring performance based on link:http://en.wikipedia.org/wiki/Latency_(engineering)[one-way network latency]. 32 | 33 | image::https://community.jboss.org/servlet/JiveServlet/downloadImage/16657/direct.png[] 34 | 35 | The numbers on the arrows identify one-way network calls: N1 first locks a@N2 (1) and then waits for acknowledgement (2), then b@N3 (RPC 2) etc. With this approach, the cost of an transactions, estimated as one-way network calls is: 2 x number of nodes touched by a transaction. The next section describes a less costly approach. 36 | 37 | ==== Async incremental locking 38 | In this approach, the transaction coordinator multicasts the lock acquisition request to all the transaction participants. At the next step, the node having the lowest index in the sequence of nodes touched by the transactions (ordering given by cluster view), acquires lock and then sends an async lock acquisition request to the next node in the sequence. This is depicted in the following diagram: 39 | 40 | image::https://community.jboss.org/servlet/JiveServlet/downloadImage/16657/async.png[] 41 | 42 | 1a, 1b and 1c happen in parallel (multicast). When the 1a RPC thread arrives at N2 it also tries to acquire lock on "a". On the other hand 1b and 1c don't try lock acquisition, but wait a confirmation from N2 and N3 respectively. After 1a acquired lock it sends an async message to N3, informing it that it can move on and acquire lock on "b". Then lock is acquired on b@N3, then N3 issues an async call to N4 etc. 43 | The cost of this approach is equal to the number of nodes touched by a transaction, which is, in theory, twice as good as the direct incremental locking approach. 44 | 45 | ===== What if the async ack call fails? 46 | The async call needs to be send in a pseudo-reliable way i.e. guaranteed to happen if the node is still alive. If the node crashes the transaction can retry the entire lock acquisition process based on the new owner of the data. 47 | 48 | === Hybrid incremental approach 49 | This is a variation of the async incremental approach in which, during the initial multicast, all nodes try to acquire locks with a very small timeout (potentially 0). If all succeed then transaction proceeds to the 2nd phase of 2PC. If at least one node fails, the locks are being released on all the nodes touched by the transaction which follow its sequence in the view. This is depicted in the following diagram: 50 | 51 | image::https://community.jboss.org/servlet/JiveServlet/downloadImage/16657/hybrid.png[] 52 | 53 | ==== Notes on the diagram 54 | * 1a, 1b and 1c happen in parallel as 2-way rpcs 55 | * 1b fails to acquire lock on b 56 | * when multicast 1 finishes, N1 is aware that lock acquistion failed on N3 but succeeded on N2 and N4 57 | * N1 tells N2 to start the incremental lock acquisition (2a). At the same time tells N4 to release it's lock on "c" (2c) 58 | * 2c is needed in order to avoid potential deadlocks 59 | 60 | This hybrid approach is a good fit for scenarios where there is some contention and deadlocks, but not a lot of it: the initial multicast is optimistic in the sense that it doesn't expect deadlocks to happen - but if there are any then it reacts quickly. 61 | 62 | === Related 63 | * This optimization is tracked by link:https://issues.jboss.org/browse/ISPN-1219[ISPN-1132] 64 | * link:Optimistic-Locking-In-Infinispan[Infinispan's optimistic locking] 65 | * link:Lock-Reordering-For-Avoiding-Deadlocks[Lock reordering] -------------------------------------------------------------------------------- /Remote-Iterator.md: -------------------------------------------------------------------------------- 1 | ## Intro 2 | (by Tristan) 3 | 4 | Currently the HotRod protocol implements a few bulk operations (BulkGetRequest, BulkGetKeysRequest, QueryRequest) which return arrays of data. The problem with these operations is that they are very inefficient since they build the entire payload on the server, thus potentially requiring a lot of transient memory, and then send it to the client in a single response, which again might be to large to handle. It is also not possible to pre-filter the data on the server according to some criteria to avoid sending to the client unneeded entries. 5 | 6 | For symmetry with the embedded distributed iterator, HotRod should implement an equivalent remote distributed iterator which would: 7 | 8 | - allow a client to specify a filter/scope (local, global, segment, custom, etc) 9 | - allow the client and the server to iterate through the retrieved entries using an iterator paradigm appropriate to the environment (java.lang.Iterator in Java, in C++, IEnumerator in C#) in a memory efficient fashion 10 | - allow a server to efficiently batch entry in multiple responses 11 | - leverage the already existing KeyValueFilter and Converters, including deployment of custom ones into the server 12 | 13 | ## Proposal 14 | 15 | ### API changes 16 | 17 | A new method on `org.infinispan.client.hotrod.RemoteCache`: 18 | 19 | ```java 20 | /** 21 | * Retrieve entries from the server 22 | * 23 | * @param filterConverterFactory Factory name for {@link KeyValueFilterConverter} or null 24 | * @param segments The segments to iterate on or null if all segments should be iterated 25 | * @param batchSize The number of entries transferred from the server at a time 26 | */ 27 | CloseableIterator retrieveEntries(String filterConverterFactory, Set segments, int batchSize); 28 | ``` 29 |
30 | ### HotRod Protocol 31 | 32 | ##### Operation: ITERATION_START 33 | 34 | Request: 35 | 36 | | Field name | Type | Value | 37 | | ----------------| --------| ------ 38 | | SEGMENTS | byte[ ] | segments requested 39 | | FILTER_CONVERT | String | Factory name for FilterConverter 40 | | BATCH_SIZE | vInt | batch to transfer from the server 41 | 42 | >About segments 43 | 44 | > One way to encode the list of segments ids to be sent is to use 45 | > N bits to represent each of the N segments, being a flipped bit a desired segment id. With the default number of segments(60) it'd use only 60 bits, but if the number of segments is gigantic this could cause issues. 46 | 47 | Response: 48 | 49 | | Field name | Type | Value | 50 | | ----------------| --------| ------ 51 | | ITERATION_ID | String | UUID of the iteration 52 | 53 |
54 | 55 | ##### Operation: ITERATION_NEXT 56 | 57 | Request: 58 | 59 | | Field name | Type | Value | 60 | | ----------------| --------| ------ 61 | | ITERATION_ID | String | UUID of the iteration 62 | 63 | Response: 64 | 65 | | Field name | Type | Value | 66 | | ----------------| --------| ------ 67 | | ITERATION_ID | String | id of the iteration 68 | | FINISHED_SEGMENTS | byte[ ] | segments that finished iteration 69 | | ENTRIES_SIZE | vInt | number of entries transfered 70 | | KEY 1 | byte | entry 1 key 71 | | VALUE 1 | byte | entry 1 value 72 | | ... | ... | ... 73 | | KEY N | byte | entry N key 74 | | VALUE N | byte | entry N value 75 | 76 |
77 | ##### Operation: ITERATION_END 78 | 79 | Request: 80 | 81 | | Field name | Type | Value | 82 | | ----------------| --------| ------ 83 | | ITERATION_ID | String | UUID of the iteration 84 | 85 | Response: 86 | 87 | | Field name | Type | Value | 88 | | ----------------| --------| ------ 89 | | STATUS | byte | ACK of the operation 90 | 91 |
92 | ### Client side 93 | 94 | Upon calling ```retrieveEntries``` an ITERATION_START op is issued and the client will store the tuple ```[Address, IterationId]``` so that all subsequent operations will go to the same server. 95 | 96 | ```CloseableIterator.next()``` will receive a batch of entries and will keep the keys internally on a per segments basis. 97 | Whenever a segments is finished iterating (received on field FINISHED_SEGMENTS of ITERATION_NEXT response), those keys will be discarded. 98 |
99 | ### Server Side 100 | A ITERATION_START request will create and obtain a ```CloseableIterator``` from a ```EntryRetriever``` with the optional ```KeyValueFilterConverter```, batch size, and set of segments required. The ```CloseableIterator``` will be associated with a UUID and will be disposed upon receiving a ITERATION_CLOSE request. The ```CloseableIterator``` will also have a ```SegmentListener``` associated so that it is notified of finished segments and send them back the client on the ITERATION_NEXT responses. 101 | 102 | An ITERATION_CLOSE will dispose the ```CloseableIterator``` 103 | 104 | ### Failover 105 | If the server backing a ```CloseableIterator``` dies, the client will restart the iteration in other node, filtering out already finished segments. 106 | 107 | One drawback of the per-segment iteration control is that if a particular segment was not entirely iterated before the server goes down, it will have to be iterated again, leading to duplicate values. 108 | 109 | One way to avoid duplicates values to the caller of ```CloseableIterator.next()``` is to have the client to store the keys of the unfinished segments and use them to filter out duplicates after the failover happens. 110 | 111 | ### References 112 | 113 | JIRA: [ISPN-5219] (https://issues.jboss.org/browse/ISPN-5219) 114 | 115 | -------------------------------------------------------------------------------- /scaling-without-state-transfer.asciidoc: -------------------------------------------------------------------------------- 1 | = Scaling up without state transfer 2 | 3 | The goal is to be able to add nodes to the cluster and make them own new entries, 4 | without also owning any of the old entries. 5 | 6 | Simplifying assumptions: 7 | 8 | * Only one node is being added at a time. 9 | * Single owner 10 | * No transactions 11 | * When scaling down data is just lost 12 | 13 | The basic idea is that all keys are written on the newest member. 14 | The location of all inserted keys is kept in a replicated cache 15 | (the **anchor cache**) and used for further reads/updates/removals. 16 | 17 | It's important to note that clients don't have access to the anchor cache, 18 | so clients access servers in a round-robin fashion. 19 | 20 | 21 | == Implementation 22 | 23 | The implementation will live in a module named `anchored-keys`. 24 | 25 | Since we can't deterministically map keys to segments, we don't need a distributed cache. 26 | Instead we can use an invalidation cache and the `ModuleLifecycle` implementation 27 | can replace the invalidation interceptor with a custom interceptor. 28 | 29 | The interceptor will first look up affected keys in the anchor cache. 30 | If the key doesn't exist in the anchor cache, reads can assume the value is missing. 31 | Writes must insert the current writer's location in the anchor cache, 32 | then forward to the current writer for the actual write. 33 | 34 | When a key is removed, the interceptor must also remove the key from the anchor cache. 35 | When a node leaves, the interceptor must remove all the keys mapped to that node 36 | from the anchor cache. 37 | 38 | 39 | == Configuration 40 | 41 | Configuration means a custom element `anchored-keys` 42 | with a single attribute `enabled`. 43 | 44 | ``` 45 | 46 | 47 | 48 | ``` 49 | 50 | 51 | == Performance considerations 52 | 53 | === Latency / RPCs 54 | 55 | The client doesn't know the owner, so any read or write has a 56 | `(N-1)/N` probability of requiring a unicast RPC from the processing server to the owner. 57 | 58 | In addition to this, the first write to a key also requires a replicated cache write, 59 | which means a multicast RPC plus a `(N-1)/N` probability of another unicast RPC. 60 | 61 | If `FailoverRequestBalancingStrategy` knew whether the next request 62 | was a read or a write, we could make it always send write requests 63 | to the last server. 64 | However, that would only work if updates and removals are minimal, 65 | otherwise the last server could be overloaded. 66 | 67 | === Memory overhead 68 | 69 | The anchor cache contains copies of all the keys and their locations, 70 | plus the overhead of the cache itself. 71 | 72 | The overhead is lowest for off-heap storage: 73 | 21 bytes in the entry itself plus 8 bytes in the table, 74 | assuming no eviction or expiration. 75 | The location is another 20 bytes, assuming we keep the serialized owner's address. 76 | 77 | Note: We could reduce the location size to <= 8 bytes 78 | by encoding the location as an integer. 79 | 80 | Addresses are interned, so an address already uses only 4 bytes 81 | with OBJECT storage and `-XX:+UseCompressedOops`. 82 | But the overhead of the ConcurrentHashMap-based on-heap cache is much bigger, 83 | at least 32 bytes from CHM.Node, 24 bytes from ImmortalCacheEntry, 84 | and 4 bytes in the table. 85 | 86 | Note: because replicated writes are sent to the primary owner, 87 | which forwards them to all the other members, the keys cannot be de-duplicated. 88 | 89 | === State transfer 90 | 91 | There will be no state transfer for the main cache, but the anchor cache still needs 92 | to transfer all the keys and the location information. 93 | 94 | Assuming that the values are much bigger compared to the keys and to the cache overhead per entry, 95 | the anchor cache's state transfer should also be much faster compared to the state transfer for the main cache. 96 | 97 | The initial state transfer should not block a joiner from starting, 98 | because the joiner can ask an existing node for the location. 99 | 100 | == Alternative implementations 101 | 102 | === Key generator 103 | If the keys could be generated by Infinispan, e.g. in a server task, 104 | we could generate the keys to map to a segment owned by the current writer 105 | and we wouldn't need to keep track of key locations. 106 | 107 | 108 | === Cluster loader 109 | Instead of keeping track of the key locations in a replicated cache, 110 | an anchored cache could send a broadcast request to all the members. 111 | 112 | Read hits would be slowed down a bit, because they would send out 113 | additional get requests, but the difference would be small, because 114 | they would only have to wait for the first non-null response. 115 | Read misses, on the other hand, would be much slower, because 116 | the originator would have to wait for responses from all the members 117 | before telling the client that the key does not exist in the cache. 118 | 119 | Writes would be faster, because they would only need a unicast RPC 120 | to the current writer, without the replicated cache write. 121 | 122 | The RPCs might be reduced by maintaining on each node a set of bloom filters, 123 | one for each member. 124 | 125 | === Single cache 126 | Store the data and the location information in a single `REPLICATED_SYNC` cache. 127 | There is a single segment, and a custom `ConsistentHashFactory` makes the newest member 128 | the primary owner of that segment. 129 | 130 | The distribution/triangle interceptor would be replaced with a custom interceptor 131 | that replaces the value in backup write commands with a reference to the primary owner. 132 | For reads it checks if the value in the context is a reference to another node, 133 | and makes a remote get request to that node. 134 | 135 | The state provider would also be replaced to send the location information 136 | to the joiners instead of the actual values. 137 | -------------------------------------------------------------------------------- /cluster-backup-tool.md: -------------------------------------------------------------------------------- 1 | Cluster Backup Tool 2 | ==================== 3 | # Requirements 4 | ## Hard 5 | - Ability to: 6 | - Backup data to a file(s) at a given point in time 7 | - Recreate data from archive 8 | - Must be backwards Compatible across x major versions 9 | - Supported in client/server mode 10 | 11 | ## Optional/Future 12 | - Topology aware restoration 13 | - Disabled by default, focus is on data 14 | 15 | - Import failure modes: 16 | - Clusterwide: Rollback all on first error 17 | - Per Cache: Only rollback importing of caches which encounter errors 18 | 19 | - Cache specific imports 20 | - Greater flexibility 21 | - Ability to retry failed cache imports 22 | - Convert from one cache type to another 23 | 24 | - Configuration only backup/import 25 | 26 | - Enable "online" backups where the cluster creates a snapshot from a live cluster 27 | - Utilise `CommitManager` as used by state transfer and conflict resolution 28 | 29 | 30 | # Architecture 31 | The backup tool should be implemented on the server and be accessible via the `/v2/cluster` REST endpoint, allowing the various 32 | Infinispan clients (CLI, Console, Operator) to utilise it. Backups can be initiated via a GET call and restored by 33 | uploading an archive via a POST. 34 | 35 | Backup: 36 | ``` 37 | GET /v2/cluster?action=backup 38 | ``` 39 | 40 | Restore: 41 | ``` 42 | POST /v2/cluster?action=restore 43 | Content-Disposition: form-data; name=`"file`"; filename=`"backup.zip`" 44 | Content-Type: application/zip 45 | Content-Transfer-Encoding: binary 46 | ``` 47 | 48 | # Archive Format 49 | A cluster backup consists of a global manifest which contains clusterwide metadata, such as the version. As well as a 50 | directory structure that contains all backed up cache-containers. Each cache-container then contains it's configured 51 | cache templates, cache instances and counters. This directory structure is then distributed 52 | as a archive format, which can be passed to the server in order to initiate a import. 53 | 54 | ``` 55 | containers/ 56 | containers/ 57 | containers//cache-container.xml 58 | containers//container.properties 59 | containers//cache-configs/ 60 | containers//cache-configs/some-template.xml 61 | containers//cache-configs/another-template.xml 62 | containers//caches/ 63 | containers//caches/example-user-cache.dat 64 | containers//caches/example-user-cache.xml 65 | containers//counters/ 66 | containers//counters/counters.dat 67 | 68 | manifest.properties 69 | ``` 70 | 71 | The above files will be packaged as a single `.zip` distribution to aid backup/import; this provides compression as well 72 | as being compatible with both Linux and Windows. 73 | 74 | > Utilising the above directory structure simplifies future features such as per container or per cache imports. 75 | 76 | ## Manifest Properties 77 | Basic java properties file with key/values which contains cluster wide information. 78 | 79 | ```java 80 | version=11.0.0.Final 81 | cache-containers="default" 82 | ``` 83 | 84 | The Infinispan version can be used by the Server to quickly determine if a migration to the new server version is possible 85 | based upon the number of major versions we support migrations across. 86 | 87 | ## Container Files 88 | Each container is identified in the directory structure by it's name attribute and contains several sub directories. 89 | 90 | ### Container Properties 91 | `container.properties` file that lists all of the configured templates, caches and counters. 92 | 93 | ```java 94 | cache-configs="some-template, another-template" 95 | caches="org.infinispan.CONFIG", "example-user-cache" 96 | counters="example-counter" 97 | ``` 98 | 99 | ### Container XML 100 | The container XML can be used to determine if the backup depends on any additional user classes, e.g. serialization marshallers, 101 | and can check that the server contains these resources on it's classpath, failing fast if they are not present. 102 | 103 | ### Template files 104 | Each defined template is represented by a `.xml` file. 105 | 106 | ### Cache Files 107 | A cache backup consists of a `.xml` and a `.dat` file. The `.xml` file contains 108 | the cache configuration and is used to create the initial empty cache. The `.dat` file contains the cache 109 | content and is read by the server in order to restore entries to the cache. 110 | 111 | Entries in the `.dat` file are stored as a stream of Protobuf messages: 112 | 113 | ```protobuf 114 | message CacheBackupEntry { 115 | bytes key = 1; 116 | bytes value = 2; 117 | bytes metadata = 3; 118 | PrivateMetadata internalMetadata = 4; 119 | int64 created = 5; 120 | int64 lastUsed = 6; 121 | } 122 | ``` 123 | We store Metadata implementations as generic bytes to account for custom implementations. 124 | 125 | > If the cache is empty during the backup, then no `.dat` file is created. 126 | 127 | #### Storage MediaTypes 128 | Currently all supported MediaTypes are stored as `byte[]`, with the exception of `application/x-java-object`, so it's 129 | possible to store them as is. If a cache has a key and/or value with `application/x-java-object`, it's necessary to 130 | first convert the objec to a byte[] via the `PersistenceMarshaller`. 131 | 132 | #### Internal Caches 133 | Volatile internal caches should not be included in a backup. Internal caches can be excluded during the creation of the 134 | backup archive, with no `.xml` or `.dat` files created. 135 | 136 | The counter cache should also be omitted in favour of a dedicated `counters.dat` file. 137 | 138 | ### Counters File 139 | The `counters.dat` file that contains the informated required in order to recreate all counters and their values 140 | at backup time. If no counters exist, this file can be omitted from the archive. 141 | 142 | ```protobuf 143 | message CounterBackupEntry { 144 | string name = 1; 145 | CounterConfiguration configuration = 2; 146 | int64 currentValue = 3; 147 | } 148 | ``` 149 | 150 | # CLI Integration 151 | The CLI will be the first client to expose the backup/restore capabilities, proposed syntax: 152 | 153 | Backup: 154 | ```bash 155 | bin/cli.sh -c localhost:11222 --backup 156 | ``` 157 | 158 | Restore: 159 | ```bash 160 | bin/cli.sh -c localhost:11222 --restore 161 | ``` 162 | 163 | # Missing Capabilities 164 | List of missing features that are a prerequisite for the backup/restore feature. 165 | 166 | ## Endpoints 167 | * Ability to reject all requests in backup mode? 168 | -------------------------------------------------------------------------------- /Asymmetric-Caches-and-Manual-Rehashing-Design.asciidoc: -------------------------------------------------------------------------------- 1 | Manual/delayed rehashing and asymmetric caches have very similar requirements and so they will be implemented together. 2 | 3 | === Virtual Cache Views 4 | A big part of the implementation will be supporting a separate view for each cache with a global component called CacheViewsManager. 5 | 6 | * The CacheViewsManager on the coordinator will receive REQUEST_JOIN(cachename) and REQUEST_LEAVE(cacheName) commands whenever a cache is started/stopped on one of the nodes. 7 | ** A node leaving the JGroups cluster will also be interpreted as a leave. 8 | * The coordinator coalesces join requests, so if several nodes join in rapid succession it will be installed in a single step. Leaves need to be treated differently, in the first phase there will be no coalescing for leaves. 9 | ** Might need a pluggable policy for handling this, or maybe just a configurable policy where the user can configure a "cooldown" interval from the last topology change together with a maximum numbers of joiners, maximum number of leavers and maximum timeout since the first uncommitted topology change. 10 | ** The user should be also able to dynamically disable automatic installation of views and only install new views manually (e.g. when the entire cluster is shutting down). 11 | *** Any member should be able to send a REQUEST_VIEW_UPDATE command to the coordinator in order to trigger a new view with the current members. 12 | *** Any member should be able to send a DISABLE_VIEW_UPDATES command to the coordinator in order to suspend further view updates. 13 | *** Both of these should be accessible through JMX or the AS console. 14 | * The coordinator sends a PREPARE_VIEW(cacheName,viewId,oldMembers,newMembers) command to all the nodes that have the cache started (including the one that sent the join request). Each node can do its state transfer, lock transfer etc while handling the PREPARE_VIEW command and any exception will be propagated back to the coordinator. 15 | * After all the nodes responded to the PREPARE_VIEW command successfully, the coordinator sends a COMMIT_VIEW(viewId) to all the nodes and they install the new view. 16 | * If any of the node fails, the coordinator sends a ROLLBACK_VIEW(viewId) command to all the nodes and they return to the old view. The coordinator should retry to install the view after a "cooldown" amount of time, just like it would do with a join request. 17 | * If the coordinator changes or in a merge, the new coordinator will have its own copy of the last committed view but it will have to send a RECOVER_VIEW command to all the nodes in the cluster in order to rebuild the pending requests to join or to leave the cluster. 18 | ** The coordinator was the one tallying PREPARE_VIEW responses, so the view should be automatically rolled back by all the members when the coordinator dies. 19 | *** There is a slight possibility of a race condition here, if only some of the nodes got the COMMIT_VIEW command before the old coordinator failed - unless JGroups ensures that either all or none of the recipients will receive the multicast and we really do use JGroups multicast. 20 | ** For a given cache, if the old coordinator didn't have the cache running, the new coordinator will retry to install the view; otherwise there will be a new view without the old coordinator. 21 | * The CacheViewsManager will not decide a "winning partition" or help in any other way with conflict resolution during a merge. 22 | ** Nodes will have different "last committed" views, so each node may need to use its own "old members" list instead of the coordinator's in order to determine what state to transfer. 23 | 24 | NOTE: There is a plan to simplify classloading issues in AS by creating a separate CacheManager for each deployment unit and multiplexing them on a single JGroups channel. That will not work well with our approach, since we are requiring that the CacheManager (and particularly the CacheViewsManager contained within) does exist on the JGroups coordinator even if none of its caches are started. 25 | A possible workaround would be to change CacheViewsManager into a JGroups protocol. That way we know it will always be started and it can keep state for more all the CacheManagers that are sharing that JGroups channel. 26 | 27 | === Blocking State Transfer 28 | The StateTransferManager component will use the view callbacks provided by the CacheViewsManager instead of the JGroups channel in order to trigger state transfer. This will ensure that the "old" consistent hash and "new" consistent hash are the same on all the nodes. 29 | 30 | Delayed view callbacks will mean that at any given time some of the owners of a key may be stopped, so the DistributionInterceptor/ReplicationInterceptor will need to complete a write operation successfully in that case. 31 | 32 | With the current state transfer algorithm, all write operations are blocked during state transfer. Incoming write operations will reply with a StateTransferInProgressException and the originator will have to retry the operation after the state transfer has finished. 33 | 34 | === Non-blocking state transfer / lock transfer 35 | See link:Non-Blocking-State-Transfer[non-blocking state transfer designs] for more details. 36 | 37 | == Comments 38 | === Manik Surtani 39 | * Is this pluggable policy (cooldown, max joiners and leavers, etc) scheduled for 5.1? And if so, for BETA1? Or is it better to split this into a separate JIRA for later? 40 | * Manual rehashing should also be a separate JIRA so we can split it out/defer if necessary. 41 | * Manual rehash control should be JMX only. JBoss AS admin console hooks into JMX. 42 | * Class loading and cache managers per JBoss AS application: the problem you mentioned can be solved by injecting the same CacheViewsManager instance into each of the CacheManagers created. The same way the same JChannel instance is injected. This will mean the CacheViewsManager logic can remain in Infinispan and still work in this setup. 43 | * Non-blocking state transfer should be a separate JIRA as well. 44 | * Locking: you say each cache has a view change lock. Should this be each cache, or cache manager? Or cache view manager? 45 | 46 | === Dan Berindei (in response to Manik Surtani) 47 | * Pluggable policy: it's not scheduled for 5.1, I'll create a separate JIRA. 48 | * Manual rehashing: there's already ISPN-1394 49 | * I might be wrong, but I think AS7 is no longer starting the JMX subsystem by default. 50 | * Injecting the same CacheViewsManager in all application: this sounds much simpler than what I had in mind. 51 | * I'll create the JIRA. 52 | * I'd say each cache, because I don't want to block anything on an running cache while starting up a new one. -------------------------------------------------------------------------------- /Schemas-API.adoc: -------------------------------------------------------------------------------- 1 | # Proposal: Schema Manipulation in Hotrod for Protostream Infinispan 2 | 3 | ## Goal 4 | To enhance schema evolution and management in Infinispan's Protostream serialization framework, we propose replacing the current API with a new schema manipulation mechanism. 5 | 6 | ## Hotrod client side marshaller 7 | 8 | ### Client side api today 9 | 10 | [source, java] 11 | ---- 12 | Schema schema = // programmatic schema 13 | MagazineMarshaller magazineMarshaller = new MagazineMarshaller(); // custom marshaller 1 14 | BookMarshaller bookMarshaller = new BookMarshaller(); // custom marshaller 2 15 | 16 | ProtoStreamMarshaller protostreamMarshaller = new ProtoStreamMarshaller(); 17 | SerializationContext serializationContext = marshaller.getSerializationContext(); 18 | FileDescriptorSource fds = FileDescriptorSource.fromString(schema.getName(), schema.toString()); 19 | serializationContext.registerProtoFiles(fds); 20 | serializationContext.registerMarshaller(magazineMarshaller); 21 | serializationContext.registerMarshaller(bookMarshaller); 22 | builder.marshaller(protostreamMarshaller); 23 | builder.remoteCache("cacheName").marshaller(protostreamMarshaller); 24 | ---- 25 | 26 | ### Client side api proposal 27 | 28 | [source, java] 29 | ---- 30 | Schema schema = // programmatic schema 31 | MagazineMarshaller magazineMarshaller = new MagazineMarshaller(); // custom marshaller 1 32 | BookMarshaller bookMarshaller = new BookMarshaller(); // custom marshaller 2 33 | 34 | ProtoStreamMarshaller protostreamMarshaller = new ProtoStreamMarshaller(); 35 | protostreamMarshaller.register(schema, magazineMarshaller, bookMarshaller); 36 | 37 | builder.remoteCache("cacheName").marshaller(marshaller); 38 | ---- 39 | 40 | ### Administration api today 41 | Hotrod specific api does not exist. However, an existing REST API exists. 42 | 43 | Example of what clients are doing today to manage errors: 44 | 45 | [source, java] 46 | ---- 47 | public static void uploadAndReindexCaches(RemoteCacheManager remoteCacheManager, GeneratedSchema schema, List indexedEntities) { 48 | var key = schema.getProtoFileName(); 49 | var current = schema.getProtoFile(); 50 | 51 | var protostreamMetadataCache = remoteCacheManager.getCache(InternalCacheNames.PROTOBUF_METADATA_CACHE_NAME); 52 | var stored = protostreamMetadataCache.getWithMetadata(key); 53 | if (stored == null) { 54 | if (protostreamMetadataCache.putIfAbsent(key, current) == null) { 55 | logger.info("Infinispan ProtoStream schema uploaded for the first time."); 56 | } else { 57 | logger.info("Failed to update Infinispan ProtoStream schema. Assumed it was updated by other Keycloak server."); 58 | } 59 | checkForProtoSchemaErrors(protostreamMetadataCache); 60 | return; 61 | } 62 | if (Objects.equals(stored.getValue(), current)) { 63 | logger.info("Infinispan ProtoStream schema is up to date!"); 64 | return; 65 | } 66 | if (protostreamMetadataCache.replaceWithVersion(key, current, stored.getVersion())) { 67 | logger.info("Infinispan ProtoStream schema successful updated."); 68 | reindexCaches(remoteCacheManager, stored.getValue(), current, indexedEntities); 69 | } else { 70 | logger.info("Failed to update Infinispan ProtoStream schema. Assumed it was updated by other Keycloak server."); 71 | } 72 | checkForProtoSchemaErrors(protostreamMetadataCache); 73 | } 74 | 75 | private static void reindexCaches(RemoteCacheManager remoteCacheManager, String oldSchema, String newSchema, List indexedEntities) { 76 | if (indexedEntities == null || indexedEntities.isEmpty()) { 77 | return; 78 | } 79 | var oldPS = KeycloakModelSchema.parseProtoSchema(oldSchema); 80 | var newPS = KeycloakModelSchema.parseProtoSchema(newSchema); 81 | var admin = remoteCacheManager.administration(); 82 | 83 | indexedEntities.stream() 84 | .filter(Objects::nonNull) 85 | .filter(indexedEntity -> isEntityChanged(oldPS, newPS, indexedEntity.entity())) 86 | .map(IndexedEntity::cache) 87 | .distinct() 88 | .forEach(cacheName -> updateSchemaAndReIndexCache(admin, cacheName)); 89 | } 90 | ---- 91 | 92 | Example of what clients are doing today to retrieve errors: 93 | 94 | [source, java] 95 | ---- 96 | // For errors 97 | private static void checkForProtoSchemaErrors(RemoteCache protostreamMetadataCache) { 98 | var errors = protostreamMetadataCache.get(ProtobufMetadataManagerConstants.ERRORS_KEY_SUFFIX); 99 | if (errors == null) { 100 | return; 101 | } 102 | for (String errorFile : errors.split("\n")) { 103 | logger.errorf("%nThere was an error in proto file: %s%nError message: %s%nCurrent proto schema: %s%n", 104 | errorFile, 105 | protostreamMetadataCache.get(errorFile + ProtobufMetadataManagerConstants.ERRORS_KEY_SUFFIX), 106 | protostreamMetadataCache.get(errorFile)); 107 | } 108 | } 109 | ---- 110 | 111 | https://github.com/keycloak/keycloak/blob/636fffe0bc37a63bd5a2578b6bbcd815364c41d8/model/infinispan/src/main/java/org/keycloak/marshalling/KeycloakIndexSchemaUtil.java#L71-L97 112 | 113 | ### Administration API 114 | 115 | [source, java] 116 | ---- 117 | public interface RemoteCacheManagerAdmin { 118 | SchemasAdministration schemas(); 119 | } 120 | ---- 121 | 122 | [source, java] 123 | ---- 124 | public interface SchemasAdministration { 125 | CompletionStage createAsync(Schema schema); 126 | SchemaOpResult create(Schema schema); 127 | 128 | CompletionStage updateAsync(Schema schema); 129 | SchemaOpResult update(Schema schema); 130 | 131 | CompletionStage createOrUpdateAsync(Schema schema); 132 | SchemaOpResult createOrUpdate(Schema schema); 133 | 134 | CompletionStage deleteAsync(String name); 135 | SchemaOpResult delete(String name); 136 | 137 | CompletionStage retrieveSchemaErrors(); 138 | 139 | SchemaErrors retrieveSchemaErrors(Schema schema); 140 | // # Optional. Check if this is needed or possible. Not creating a schema and validating it only. 141 | CompletionStage validateAsync(Schema schema); 142 | SchemaOpResult validate(Schema schema); 143 | // # end of optional 144 | } 145 | ---- 146 | 147 | SchemaOpResult: Specify the content and return values when implementing it 148 | based on real code examples around (like keycloak or what quarkus or spring users do) 149 | 150 | -------------------------------------------------------------------------------- /Optimistic-Locking-In-Infinispan.asciidoc: -------------------------------------------------------------------------------- 1 | === Context 2 | At the moment (Infinispan 5.0) two locking schemes are supported: 3 | 4 | * *pessimistic*, in which locks are being acquired remotely on each transactional write. This is named eager locking and is detailed here. 5 | * a *hybrid* optimistic-pessimistic locking approach, in which local locks are being acquired as transaction progresses and remote locks are being acquired at prepare time. This is the default locking scheme. 6 | 7 | This document describes a replacement for the hybrid locking scheme with an optimistic scheme. 8 | 9 | === Shortcomings of the hybrid approach 10 | In the current hybrid approach local locks are acquired at write time and remote write locks are acquired at prepare time as well. 11 | E.g. considering the following code that runs on node A and consistentHash("a") = {B}. 12 | 13 | transactionManager.begin(); 14 | cache.put("a", "aVal"); // this acquires a WL on A. "a" is not locked on B 15 | Object result = cache.get("b"); //this doesn't acquire any lock 16 | transactionManager.commit(); // this acquires a WL on B as well, then it release it after applying the modification 17 | 18 | This locking model has some shortcomings, especially in distributed mode: 19 | 20 | ==== Less throughput 21 | The overall throughput of the system is reduced in the following scenario: 22 | On node A two transactions are running: 23 | 24 | * Tx1: writes to keys {a, b, c} in that order 25 | * Tx2: writes to keys {a, c, d} in that order 26 | 27 | Let's assume that consistentHash(a) = {B}. In other nodes node B is the main data owner of key "a". 28 | 29 | These two transactions execute in sequence: after Tx1 locks “a”, Tx2 is not able to make any progress until Tx1 is finished. Making Tx2 wait for "a" 's lock doesn't guarantee Tx1 the chances to complete the transaction: another transaction running on node C might still be able to acquire lock on "a" before Tx1. 30 | 31 | ==== Different locking semantic based on transaction locality 32 | With the hybrid approach, two transactions competing for the same key will be serialized if run on the same node, but would execute in parallel if run on two different nodes. Even more, if a key locked by a transaction maps(consistent hash) to the same node where the transaction runs, an "eager lock" is practically acquired - so again the locking semantic is influenced by where the transaction runs. Whilst this is not necessarily incorrect it certainly brings a degree of unneeded complexity in using and understanding Infinispan's transactions. 33 | 34 | == Optimistic locking 35 | In order to overcome the above shortcomings, an optimistic locking scheme will replace the hybrid one in Infinispan 5.1. With the optimistic approach no locks are being acquire until prepare time. 36 | 37 | === How does it work 38 | Infinispan associates a transaction context (TC) with each Transaction, on the node where the transaction runs. This is a map-like structure used to store all the data touched by the transaction, as follows: 39 | 40 | * On a get(key) operation 41 | .. if key is in the TC then the value associated with it is returned. If not: 42 | .. if key maps, according to the consistent hash, to a remote node then value is fetched from the remote node (rpc). 43 | ... If writeSkewCheck is enabled, then key's version at the moment of the read is also fetched. Then (key,value, version) is placed in the TC. 44 | ... If writeSkewCheck is disabled the (key,value) pair is then placed in TC. 45 | ... Note that in both above cases value can be null. 46 | .. if key maps to the local node the value is obtained from the local data container. The (key,value) and potentially version (see 1.2.1) is then placed in TC 47 | .. value (potentially null) is then returned to the caller. 48 | * On a put(key,value) operation 49 | .. if the TC contains an entry for key 50 | ... existing value associated with is cached to be returned to the caller 51 | ... TC updated to reflect new (key, value) pair 52 | ... value, as read as 2.1.1 is returned to the caller 53 | .. if the TC doesn't contain an entry for key 54 | ... If unreliableReturnValues is enabled then (key, value) is added to the TC and null is returned 55 | ... if unreliableReturnValues is not enabled (default) then 56 | .... a get(key) is executed first, as described in 1. The return of the get is cached to be returned to the caller 57 | .... (key, value) is added to the TC 58 | .... The value cached at 2.2.2.1 is then returned to the user 59 | * During transaction completion: 60 | .. At prepare time, 61 | ... a prepare message is multicasted to all the nodes owning keys written by the transaction. The prepare message contains the keys that should be locked together with the new values and potentially the versions read at 1.2.1 or 1.3. 62 | ... Locks are acquired remotely, on each one of the keys written by the transaction. No locks are acquired for keys that were only read. 63 | ... If a remote lock cannot be acquired within lockAcquistionTimeout milliseconds then an exception is thrown and prepare fails. 64 | ... If all remote locks are successfully acquired 65 | .... If writeSkewCheck is enabled: 66 | ..... for each remotely locked key, if its current version is different than the version read at 1.2.1 or 1.3 then an exception is thrown and transaction is rolledback 67 | ..... these check does not require a new RPC, but executes in the scope of the RPC sent for acquiring the lock 68 | ... If writeSkewCheck is disabled the above check does not take place 69 | .. At commit time: 70 | ... A commit message is sent from the node where transaction originated to all nodes where locks were acquired 71 | ... On each participating node 72 | .... if writeSkewCheck is enabled then the version of the entry is increased 73 | .... the old values are overwritten by new ones 74 | .... locks are released 75 | * The atomic operations, as defined by the ConcurrentHashMap, don't fit well within the scope of optimistic transaction: this is because there is a discrepancy between the value returned by the operation and the value and the fact that the operation is applied or not: 76 | ** E.g. putIfAbsent(key, value) might return true as there's no entry for key in TC, but in fact there might be a value committed by another transaction. 77 | ** Later on, at prepare time when the same operation is applied on the node that actually holds key, it might not succeed as another transaction has updated key in between, but the return value of the method was already evaluated long before this point. 78 | ** In order to solve this problem, if an atomic operations happens within the scope of a transaction, Infinispan forces a writeSkewCheck at commit time. The writeSkewCheck, optional otherwise, makes sure that the decision made at prepare time still stands at commit time. 79 | 80 | === Related 81 | * The JIRA tracking the implementation for this is link:https://issues.jboss.org/browse/ISPN-1131[ISPN-1131] -------------------------------------------------------------------------------- /ServerNG.md: -------------------------------------------------------------------------------- 1 | Infinispan ServerNG 2 | ==================== 3 | 4 | Infinispan ServerNG is a reboot of Infinispan's server which addresses the following: 5 | 6 | * small codebase with as little duplication of already existing functionality (e.g. configuration) 7 | * embeddable: the server should allow for easy testability in both single and clustered configurations 8 | * RESTful admin capabilities 9 | * Logging using [JBoss Logging logmanager](https://github.com/jboss-logging/jboss-logmanager) 10 | * Security using [Elytron](https://github.com/wildfly-security/wildfly-elytron) 11 | 12 | # Layout 13 | 14 | The server is laid out as follows: 15 | 16 | * `/bin` scripts 17 | * `server.sh` server startup shell script for Unix/Linux 18 | * `server.ps1` server startup script for Windows Powershell 19 | * `/lib` server jars 20 | * `infinispan-server.jar` uber-jar of all dependencies required to run the server. 21 | * `/server` default server instance folder 22 | * `/server/log` log files 23 | * `/server/configuration` configuration files 24 | * `infinispan.xml` 25 | * keystores 26 | * `logging.properties` for configuring logging 27 | * User/groups property files (e.g. `mgmt-users.properties`, `mgmt-groups.properties`) 28 | * `/server/data` data files organized by container name 29 | * `default` 30 | * `caches.xml` runtime cache configuration 31 | * `___global.state` global state 32 | * `mycache` cachestore data 33 | * `/server/lib` extension jars (custom filter, listeners, etc) 34 | 35 | # Paths 36 | 37 | The following is a list of _paths_ which matter to the server: 38 | 39 | * `infinispan.server.home` defaults to the directory which contains the server files. 40 | * `infinispan.server.root` defaults to the `server` directory under the `infinispan.server.home` 41 | * `infinispan.server.configuration` defaults to `infinispan.xml` and is located in the `configuration` folder under the `infinispan.server.root` 42 | 43 | # Command-line 44 | 45 | The server supports the following command-line arguments: 46 | 47 | * `-b`, `--bind-address=
` 48 | * `-c`, `--server-config=` 49 | * `-o`, `--port-offset=` 50 | * `-s`, `--server-root=` 51 | * `-v`, `--version` 52 | 53 | # Configuration 54 | 55 | The server configuration extends the standard Infinispan configuration adding server-specific elements: 56 | 57 | * `security` configures the available security realms which can be used by the endpoints. 58 | * `cache-container` multiple containers may be configured, distinguished by name. 59 | * `endpoints` lists the enabled endpoint connectors (hotrod, rest, ...). 60 | * `socket-bindings` lists the socket bindings. 61 | 62 | An example skeleton configuration file looks as follows: 63 | 64 | ``` 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | ``` 118 | 119 | # Logging 120 | 121 | Logging is handled by JBoss Logging's LogManager. This is configured through a `logging.properties` file in the 122 | `server/configuration` directory. The following is an example: 123 | 124 | ``` 125 | loggers=org.jboss.logmanager 126 | 127 | # Root logger 128 | logger.level=INFO 129 | logger.handlers=CONSOLE 130 | 131 | logger.org.jboss.logmanager.useParentHandlers=true 132 | logger.org.jboss.logmanager.level=INFO 133 | 134 | handler.CONSOLE=org.jboss.logmanager.handlers.ConsoleHandler 135 | handler.CONSOLE.formatter=PATTERN 136 | handler.CONSOLE.properties=autoFlush,target 137 | handler.CONSOLE.autoFlush=true 138 | handler.CONSOLE.target=SYSTEM_OUT 139 | 140 | formatter.PATTERN=org.jboss.logmanager.formatters.PatternFormatter 141 | formatter.PATTERN.properties=pattern 142 | formatter.PATTERN.pattern=%d{HH:mm:ss,SSS} %-5p [%c] (%t) %s%E%n 143 | ``` 144 | # Internals 145 | 146 | The following is a dump of various internal facts about the server, in no particular order: 147 | 148 | * All containers handled by the same server share the same thread pools and transport. 149 | * When a server starts it locks the `infinispan.server.root` so that it cannot be used by another server concurrently. 150 | * The `admin-connector` endpoint is a special type of `rest-connector` with additional ops 151 | * The CLI connects to the `admin-connector` using either 152 | * the `local-user` SASL mech provided by Eltryon when running on the same host/user 153 | * any HTTP auth supported by the rest endpoint 154 | 155 | 156 | -------------------------------------------------------------------------------- /Single_port.adoc: -------------------------------------------------------------------------------- 1 | = Single port support for Infinispan Server 2 | 3 | The main goal of a single port support for Infinispan Server is to expose some (if not all) endpoints with a single port. 4 | The functionality is targeted mainly (but not limited to) for Cloud environments (where exposing everything as a single port is very convenient). 5 | 6 | == Single port from technical perspective 7 | 8 | The single port implementation requires a mechanism for switching communication protocols. Such mechanism has already been invented 9 | and implemented in HTTP/1.1 and HTTP/2+TLS/ALPN. Other protocols such as Hot Rod, Web Sockets or Memcached don't support it. In other words 10 | the client will have only one chance to negotiate protocol (switch from HTTP to a custom protocol). 11 | 12 | === Switching from HTTP/1.0 13 | 14 | https://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7230.html#header.upgrade[Not supported]: 15 | 16 | > A server must ignore an Upgrade header field that is received in an HTTP/1.0 request. 17 | 18 | === Switching from HTTP/1.1 19 | 20 | HTTP/1.1 supports so called https://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7230.html#header.upgrade[Upgrade Header], which allows to re-negotiate 21 | protocol for the same HTTP connection. The re-negotiate request might be sent either by the client or the server. 22 | 23 | .Shortened description of the switching procedure 24 | ................................... 25 | The client sends the Upgrade header with a list of supported protocols. The server responds with `101` Switching protocols and its own list of protocols. 26 | After choosing destination protocol, the client starts sending messages using negotiated connection details. 27 | ................................... 28 | 29 | More information: https://issues.jboss.org/browse/ISPN-6676 30 | 31 | === Switching from HTTP/2.0 32 | 33 | HTTP/2.0 supports only backwards-compatible HTTP/1.1 Upgrade procedure. With HTTP/2, negotiating destination protocol happens during connection initialization - 34 | during https://http2.github.io/http2-spec/#rfc.section.3.2[TLS handshake]. The procedure requires ALPN support. 35 | 36 | .Shortened TLS+ALPN handshake procedure 37 | ................................... 38 | Client Server 39 | 40 | ClientHello --------> ServerHello 41 | (ALPN extension & (ALPN extension & 42 | list of protocols) selected protocol) 43 | Certificate* 44 | ServerKeyExchange* 45 | CertificateRequest* 46 | <-------- ServerHelloDone 47 | Certificate* 48 | ClientKeyExchange 49 | CertificateVerify* 50 | [ChangeCipherSpec] 51 | Finished --------> 52 | [ChangeCipherSpec] 53 | <-------- Finished 54 | Application Data <-------> Application Data 55 | ................................... 56 | 57 | Unfortunately ALPN support has been scheduled for JDK9 (http://openjdk.java.net/jeps/244[JEP244]). Luckily, 58 | https://github.com/undertow-io/undertow/blob/master/core/src/main/java/io/undertow/protocols/ssl/ALPNHackSSLEngine.java[Undertow] as well as 59 | http://netty.io/wiki/requirements-for-4.x.html#tls-with-jdk-jetty-alpnnpn[Netty] implemented some hacks to support it. However with Netty 60 | we still need to modify the boot classpath, which is pretty bad. We might consider using Undertow hack for JDK8. 61 | 62 | More information: https://issues.jboss.org/browse/ISPN-6899 63 | 64 | == Rewriting REST Server 65 | 66 | Our REST implementation is based on https://github.com/resteasy/Resteasy/tree/master/server-adapters/resteasy-netty[RestEasy Netty]. 67 | Unfortunately this needs to be changed in the near future because there are plans for intelligent HTTP/2 clients (with topology 68 | information). An early prototype might be found https://github.com/AntonGabov/infinispan/blob/topologyId/server/rest/src/main/scala/org/infinispan/rest/http2/Http2Handler.java[here]. 69 | 70 | Once the REST Server is rewritten into pure Netty, we could start implementing the single port support. 71 | 72 | == Single port support - the implementation 73 | 74 | The single port implementation will be based on the https://github.com/infinispan/infinispan/tree/master/server/router[multi-tenant router]. 75 | 76 | As a reminder, the multi-tenant router (or shorter, the router) allows to redirect requests from a single endpoint (might be called frontend) 77 | into multiple `CacheContainer`s and `Cache`s (might be called backends). 78 | 79 | The implementation will be slightly altered to support both https://netty.io/4.1/api/io/netty/handler/codec/http/HttpServerUpgradeHandler.html[HTTP/1.1 Upgrade] 80 | and https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/ssl/ApplicationProtocolConfig.java[HTTP/2+TLS/ALPN]. 81 | 82 | Once the implementation is ready, the server endpoint will need to be altered to support new configuration elements. An exemplary 83 | configuration will look like the following: 84 | 85 | .Router configuration 86 | ---- 87 | 88 | 89 | ... existing multi-tenancy support ... 90 | 91 | 92 | 93 | 94 | 95 | 96 | ---- 97 | 98 | Note that it is perfectly possible to use the single port functionality next to with multi-tenancy (but the same port 99 | which will be used for single-port won't support multi-tenancy). Another interesting 100 | aspect is that REST connector is not necessary for the switching. Since the switching logic will be included 101 | inside the Router, it will be possible to have a single port only with Hot Rod endpoint. In that case a client will need 102 | to connect using HTTP/1.1 or TLS/ALPN and switch to Hot Rod client. 103 | 104 | == Limitations 105 | 106 | There are number of features which won't be available after initial implementation: 107 | 108 | * Embedding a Router endpoint inside a single port. Even though recursion might seem like a good idea, it may lead to deadlock 109 | during server startup (Router `A` depends on `B` whereas Router `B` depends on `A`). As a side effect, the single port endpoints 110 | will not support multi-tenancy at the same time. 111 | * Authentication won't be implemented in the Router. Since the authentication mechanisms are slightly different for 112 | Hot Rod and REST, it would hard to implement all of them in the Router. 113 | * In order to make the implementation simple, switching from HTTP/1.1 to HTTP/2 in the REST interface will also be 114 | implemented by the Router. 115 | 116 | == TODO list 117 | 118 | * [ ] Refactor REST interface to Netty 119 | * [ ] (Optional) Implement missing authentication mechanisms for REST 120 | * [ ] Implement switching logic in the Router 121 | * [ ] Implement multi-protocol Hot Rod client -------------------------------------------------------------------------------- /Scattered-Cache-design-doc.md: -------------------------------------------------------------------------------- 1 | This design doc is out of date - please refer to [package documentation](https://github.com/rvansa/infinispan/blob/ISPN-6645/core/src/main/java/org/infinispan/scattered/package-info.java) 2 | that's part of [preview PR #4458](https://github.com/infinispan/infinispan/pull/4458). 3 | 4 | ## Idea 5 | Distributed caches have fixed owners for each key. Operations where originator is one of the owners require less messages (and possibly less roundtrips), so let's make the originator always the owner. In case of node crash, we can retrieve the data by inspecting all nodes. 6 | 7 | To find the last-written entry in case of crashed primary owner, entries will keep write timestamps (counter value rather than wall-clock time) in metadata. These timestamps also determine order of writes; we don't have to use locking anymore. 8 | 9 | For the time being, let's implement only the case resilient against one node failure (equivalent to 2 owners in distributed cache). 10 | 11 | * + faster writes 12 | * + no locking contention 13 | * - reads always have to go to the primary (slower writes in small clusters) 14 | * - complex reconciliation during rebalance 15 | 16 | ## Name 17 | I believe that this behaviour definitely desires its own cache mode, not just a configuration attribute. As much as I'd like to call it Radim's Cache as Bela proposed, for practical reasons let's use Scattered Cache. If anyone has better name, please suggest it. 18 | 19 | ## Operations 20 | I'll assume non-transactional mode below. Transactional mode is quite similar, though (but non-tx will be POCed/implemented first). 21 | 22 | We need to keep tombstones with timestamp after entry removal. Luckily, these tombstones have limited lifespan - we keep them around only until the invalidations are applied on all nodes. 23 | 24 | The timestamps have to grow monotonically; therefore the timestamp counter won't be per-entry but per segment (as tombstone will be eventually removed, per-entry timestamp would be lost). 25 | 26 | ### Single entry write (put, getAndPut, putIfAbsent, replace...) 27 | * originator == primary owner: 28 | 29 | 1. Primary increments timestamp for segment 30 | 2. Primary commits entry + timestamp 31 | 3. Primary picks one node (let's say the next member) and sends backup RPC 32 | 4. Backup commits entry 33 | 5. Backup sends RPC response 34 | 6. Primary returns after it gets the response 35 | 7. Primary schedules invalidation of entry with lower timestamps 36 | * This requires just one sync RPC 37 | * Selection of backup could be random, but having it ~fixed probably reduces overall memory consumption 38 | * Updating value on primary before backup finishes does not change data consistency - if backup RPC fails, we can't know whether backup has committed the entry and so it can be published anyway. 39 | 40 | * originator != primary owner 41 | 42 | 1. Origin sends sync RPC to primary owner 43 | 2. Primary checks op if conditional 44 | 3. Primary increments timestamp for segment 45 | 4. Primary commits entry + timestamp 46 | 5. Primary returns response with timestamp (+ return value if appropriate) 47 | 6. Origin commits entry + timestamp 48 | 7. Origin schedules invalidation of entry with lower timestamps 49 | * Invalidation must be scheduled by origin, because primary does not know if backup committed 50 | 51 | ### Single entry read 52 | * originator == primary owner: Just local read 53 | * originator != primary owner: 54 | 55 | 1. Origin locally loads entry + timestamp with SKIP_CACHE_LOAD 56 | 2. Origin sends sync RPC including the timestamp to primary 57 | 3. Primary compares timestamp with it's own 58 | * If timestamp matches, value is not sent back 59 | * If timestamp does not match, value + timestamp has to be sent 60 | 61 | * Optional configuration options: 62 | * Return value after 1) if present (risk of stale reads) 63 | * Store read value locally with expiration (L1 enabled) - as invalidations are broadcast anyway, there's not much overhead with that. This will still require RPC on read (unless stale reads are allowed) but not marshalling the value. 64 | 65 | ### Multiple entries writes 66 | 67 | Entries for which this node is the primary owner won't be backed up just to the next member, but to a node that is primary owner of another entries in the multiwrite. That way we can spare some messages by merging the primary -> backup and origin -> backup requests. 68 | 69 | ### Multiple entries reads 70 | 71 | Same as single entry reads, just merge RPCs for the same primary owners. 72 | 73 | ## Invalidations 74 | 75 | It would be inefficient to send invalidations (key + timestamp) one-by-one, so these will be merged and sent in batches. Invalidations from other nodes can remove scheduled 'outdated' invalidations, but it requires additional complexity and the overall gain is debatable. 76 | 77 | ## State Transfer 78 | 79 | ### Node leave/crash 80 | 81 | Operations after node crash are always driven by the new primary owner of given segment 82 | 83 | * If it was the primary owner in previous topology: 84 | * Replicate all data from this segment to the next node 85 | * Send invalidate request with all keys + timestamps to all nodes but the next one 86 | * The above is parallel and processed in chunks. 87 | * Write to entry can proceed in parallel with this process; ST-invalidation cannot overwrite newer entry, though write-invalidation can arrive before the ST-replication (then we would have 3 copies of each entry). 88 | * After chunk is replicated (synchronously), primary owner compares actual timestamps with those replicated and sends invalidations to the backup with the modified timestamps 89 | * If the primary owner has just become a new primary owner: 90 | * Synchronously retrieve highest timestamp for this segment from all nodes 91 | * All writes are blocked until the timestamp is retrieved 92 | * Request all nodes to send you all keys + timestamps from this segment (and do that locally as well) 93 | * Compute max timestamps for each key 94 | * Retrieve values from nodes with highest timestamp, send invalidation to others 95 | * If there is a concurrent write during the retrieval, primary owner has to retrieve key + timestamp (+ value if needed) from all nodes, write can proceed only after all nodes respond 96 | * Second write during the ST to the same entry can be processed without the all-nodes-read. This can be achieved by storing the timestamp retrieved at the start of ST - if local entry timestamp is higher than this value, the write can safely proceed. 97 | 98 | ### Clean rebalance (after join, no data is lost) 99 | 100 | If node has become primary owner: 101 | * Retrieve segment timestamp counter value + old owner starts rejecting further reads & writes 102 | * Send request for all keys + timestamp + value to primary owner 103 | * Update local values as these come 104 | * Write is similar to the crash case, but the value is retrieved just from primary owner 105 | * Request old owner to remove/L1-ize all entries 106 | 107 | ### Crash during clean rebalance 108 | 109 | This shouldn't be different from regular crash, but there might be technical difficulties to cancel the ongoing ST. Node is always considered a new primary owner for a segment that had unfinished ST, since the previous new owner did not have all entries but had some new modifications. 110 | 111 | ## Open questions 112 | 113 | * Do we need remote-thread pool when we don't use locks? 114 | 115 | -------------------------------------------------------------------------------- /Remote Command Handler.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | In clustered caches, the data is always replicated to other nodes. When a node receives a command, it is processed immediately. However, this approach have some issues because the commands are not independent among them (they can depend on other commands or conflict) that can make the thread enter in a waiting state. The worst scenario occurs when all threads are waiting for an event and none is available to process this notification. In this scenario, the cluster blocks and exceptions start to be thrown (usually `TimeoutException`). 4 | 5 | # Goal 6 | 7 | Implement a "smarter" algorithm to handle remote commands. The algorithm will order the commands in order to minimize conflicts and waiting state between them. 8 | 9 | # Algorithm 10 | 11 | In the following, it explains how the remote commands are handled. A command is considered "blocking" if it needs to wait for an external event, such as, lock release, reply from other node, etc. 12 | 13 | ## Introducing the `RemoteLockCommand`, the `RemoteLockOrderManager` and the `LockLatch` 14 | 15 | The `RemoteLockCommand` is the interface that the `RpcCommands` must implement to be handled by this algorithm. This interface has a single method to return the key(s) to acquire lock, `Collection getKeysToLock()`. 16 | 17 | The `RemoteLockOrderManager` main goal is to order the commands. Two or more commands conflict if they update in one or more common key(s). Since the lock can only be acquired by a single command, the `RemoteLockOrderManager` will define which command advances while the others wait. The interface only has a single method, `LockLatch order(RemoteLockCommand)`. 18 | 19 | The `LockLatch` is a latch that notifies when the command can be processed. Also, it notifies when the command is done with its work and unblocks other commands. It has two methods: 20 | 21 | * `boolean isReady()`: returns `true` when it is probable that the `Lock` is released. 22 | * `void release()`: must be invoked after its work is done and the `Lock` is released. This will notify waiting command and may unblock other commands. 23 | 24 | Finally, when a command is ready, it is sent to the remote executor service. 25 | 26 | ## How the handling is made? 27 | 28 | ### Read Command 29 | 30 | No changes are made. When a read command is delivered, it is processed directly in the JGroups executor service. 31 | 32 | ### Non-Transactional Cache 33 | 34 | We have two types of write commands: the ones delivered in the primary owner, which will acquire the lock and it sends the update to the backup owners; and the ones delivered in the backup owner. Since the later ones does not block, they are processed immediately in the JGroups executor service. The first ones, are managed by the `RemoteLockOrderManager` and processed in the remote executor service. 35 | 36 | **State Transfer** 37 | No changes needed. If the command if processed in the wrong topology id, an exception is thrown and the command is retried. The `LockLatch` and the `Lock` are released before the retry. 38 | 39 | **Update: 16/09** 40 | Needs to be implemented for the `PutMapCommand` and the `ClearCommand`. They will be similar to the description above, but the `LockLatch` implementation will be a collection of single key `LockLatch`es. 41 | 42 | ### Transactional Cache 43 | 44 | **Update: 16/09** 45 | Not implemented yet... 46 | 47 | #### Pessimist Caches 48 | 49 | In pessimistic caches, the algorithm handles the following commands: 50 | 51 | * `LockControlCommnad` is managed in `RemoteLockOrderManager` and the `LockLatch` is associated to the transaction. Later the command is processed in the remote executor service. 52 | * `PrepareCommand` is processed directly in JGroups executor service if L1 is **disabled**. In this case, it will not acquire any locks neither wait for any other events. But if L1 is **enabled**, it is processed in remote executor service because it needs to invalidate the L1 (synchronous operation). Note that it isn't managed by the `RemoteLockOrderManager` and it releases the `LockLatch`es associated to the transaction. 53 | * `RollbackCommand` processed directly in JGroups executor service andit releases the `LockLatch`esassociated to the transaction. 54 | 55 | **State Transfer** 56 | It needs some mechanism to create `LockLatch` when the transactions transferred. Note that the `LockLatch` does not need to send to the other node, but the transaction needs to create open `LockLatch` in the new primary owner. _(to think better about it)_ 57 | 58 | #### Optimistic Caches 59 | 60 | In optimistic caches, the algorithm handles the following commands: 61 | 62 | * `PrepareCommand` it is managed by `RemoteLockOrderManager` and are processed in the remote executor service. The `LockLatch` is associated to the transaction. 63 | * `CommitCommand` processed directly in JGroups executor service when L1 is **disabled**. When L1 is **enabled**, it is processed in remote executor service since it needs to invalidate the L1 (synchronous operation). Also, it releases the `LockLatch`es associated to the transaction. 64 | * `RollbackCommand` processed directly in JGroups executor service and it releases the `LockLatch`es associated to the transaction. 65 | 66 | **State Transfer** 67 | The same as in pessimistic caches _(think...)_. 68 | 69 | # Changes in default configuration 70 | 71 | The default configuration uses a non-queued executor service with a large number of thread. With this new algorithm, it would be better to have a large queue and a reasonable number of threads. 72 | 73 | Also, it would be good to merge the total order executor service with the remote executor service. In the end, they both have the same goal and it is not needed to configure two executor services. 74 | 75 | # Known Issues 76 | 77 | If the remote executor service queue is not large enough to process all the "blocking" commands, it can hit the same problem again. The algorithm assumes that "blocking" commands are processed in remote executor service and the remaining in JGroups executor service. 78 | 79 | An idea to solve this, is the executor service to expose their internal state (running threads and queue occupation). This way, it would make it possible to the algorithm to queue a bit longer the commands in order to prevent an overload of the executor service. 80 | 81 | ## L1 invalidation deadlock 82 | 83 | When L1 is enabled, each update (or transaction) will generate an invalidation message to invalidate the L1 cache in non-owners. 84 | 85 | **non transactional caches** 86 | It has at least 3 messages to be processed in the same pool. The message sent to the primary owner to lock generates an invalidation message and a message to the backup owners. At the same time, the backup owners generate another invalidation message. when the invalidation is synchronous, the executor service may be full with threads awaiting the ACK from invalidation (or from the backup owners) and no threads are available to process other requests. 87 | 88 | **transactional caches** 89 | The commit generates an invalidation message. If the executor service is full, the same problem as described above can occur. 90 | 91 | ## State Transfer Forwarding 92 | 93 | This problem happens only with transactional caches. When a topology changes, the nodes involved in the transaction will forward the command (prepare and commit) for the new owners. Since the forward commands are processed in the same executor service and the "non-forward" commands, it can happen the executor service to be full. -------------------------------------------------------------------------------- /Clustered-listeners.md: -------------------------------------------------------------------------------- 1 | #Context 2 | The current(as of Infinispan 6.0) eventing support in Infinispan has the following limitation: the registered listeners are local to the node where the event has produced. E.g. considering an distributed cache with _numOwners=2_ which runs on nodes {A,B,C,D,E} and _K_ being a key in the cache, with owners(K)={A,B}. A [@CacheListener](http://docs.jboss.org/infinispan/6.0/apidocs/org/infinispan/server/websocket/CacheListener.html) instance registered on node C does not receive notifications on any events related to the key _K_ such as the entry being created, removed or updated. It is only the **local** listeners in A and B which receive this notifications. 3 | 4 | This wiki page describes an design approach for implementing **cluster** listeners: listeners that are to be notified on certain events disregarding where the events actually occurred. Resuming the previous example, a **cluster** listener registered on C can subscribe to notifications of events originated by adding, removing or updating _K_. 5 | 6 | ## Befits 7 | There are several problems that **clustered** listeners help solving: 8 | * [materialized views] (http://en.wikipedia.org/wiki/Materialized_view): for queries that are run very frequently against the grid (e.g. few k times a second) instead of repeating the query on every request, one can register a cluster listener that defines the query and keep track of the query result. This fits very nicely for situations where the query is run frequently but the result of the query is amended rarely. 9 | * [Complex Event Process (CEP)] (http://en.wikipedia.org/wiki/Complex_event_processing) because the listeners allow keeping state between invocations, certain basic CEP conditions can be implemented within the listeners. For more complex state processing (or in order be allowed to define CEP logic dynamically) an proper CEP engine can be plugged in, such as [Drools](http://www.jboss.org/drools/) 10 | 11 | #Suggested solution 12 | For consistency and simplicity we plan to build the cluster listener logic on top of the existing [@CacheListener](http://docs.jboss.org/infinispan/6.0/apidocs/org/infinispan/server/websocket/CacheListener.html) API. 13 | 14 | ##API changes 15 | 16 | The existing `Cache` API adds an overloaded `addListener` method with the following signature: 17 | ```java 18 | /** 19 | * Registers a listener that will be notified on events that pass the filter condition. 20 | * @param filter decides which events should be feed to the listener. Particularly relevant for cluster 21 | * listeners as it reduces the network traffic by only propagating the relevant events to the node where the 22 | * listener is registered. 23 | * @param converter creates a projection on the filtered entry's value which is then returned by the 24 | * CacheEntryCreatedEvent#getValue(). Particularly relevant for clustered listeners in order to reduce the 25 | * size of the data sent by the event originating node to the event consuming node (where the listener is 26 | * located). 27 | */ 28 | Cache.addListener(Object listener, Filter filter, Converter converter); 29 | ``` 30 | 31 | The listener annotation is enhanced with two new attributes: 32 | ```java 33 | public @interface Listener { 34 | /** 35 | * Defines whether the annotated listener is clustered or not. 36 | * Important: Clustered listener can only be notified for @CacheEntryRemoved, @CacheEntryCreated and 37 | * CacheEntryModified events. 38 | */ 39 | boolean clustered() default false; 40 | 41 | /** 42 | * For clustered listeners only: if set to true then the entire existing state within the cluster is 43 | * evaluated. For existing matches of the value, an entryAdded events is triggered against the listener 44 | * during registration. 45 | **/ 46 | boolean includeCurrentState() default false; 47 | } 48 | ``` 49 | 50 | ```java 51 | /** 52 | * Accepts or rejects an entry based on key, value and metadata. In the case of entries being removed, the 53 | * value and the metadata are null. 54 | */ 55 | interface Filter { 56 | boolean accept(K key, V value, Metadata metadata) 57 | } 58 | ``` 59 | 60 | ```java 61 | /** 62 | * Builds a projection of V which is send to the node where the listener is registered. 63 | */ 64 | interface Convertor { 65 | C convert(K,V, Metadata metadata); 66 | } 67 | ``` 68 | 69 | ##Lifecycle 70 | The (cluster) listener can be registered with the `Cache.addListener` method described above and is active till one of the following two events happen: 71 | * it is explicitly unregistered by invoking `Cache.removeListener` 72 | * the node on which the listener was registered crashes 73 | 74 | ##Guarantees 75 | ###Ordering 76 | For cluster listeners, the the order in which the cache is updates is reflected in the sequence of notifications received. 77 | ###Singularity 78 | The cluster listener does not guarantee that an event is sent once and only once. Implementors should guard agains a situation in which the same event is send twice or multiple times, i.e. the listener implementation should be idempotent. Implementors can expect singularity to be honored for stable clusters and outside of the time span in which synthetic events are generated as a result of _includeCurrentState_. 79 | 80 | ##Internals 81 | The sequence diagram below describes how we plan to implement listeners. 82 | ![Generated with Astah community](https://lh5.googleusercontent.com/na8Tm1WbHyhyQm0tnydnyFXBSo5mFHigg2mWdirFgcoF5JlkibOpbYE_7ru_Np3mNrb0cNguVUM=w1256-h898) 83 | * 1.1: the filter attached to the listener is broadcasted to all the nodes in the cluster. Each node holds a structure Filter -> {Node1, NodeX} mapping the set of nodes interested in events matching the given filter 84 | * 1.1.1.checkFilterAlreadyRegistered is a potential optimization for the case in which the same filter is used by multiple listeners within the cluster 85 | * 2: for each write request, the the registered filters are invoked and the corresponding remote listener(s) are notified 86 | 87 | ###Handling includeCurrentState=true 88 | * in parallel with 1.1.1 a Map/Reduce command is run against the cluster that would feed @CacheEntryCreated events into the listener instance 89 | * after 1.1.1 completes, for every operation matching the filter, instead of 2.1.1:notifyRemoteFilterOnMatch, the node keeps a (bounded) queue of the modifications for future processing 90 | * after the Map/Reduce task is finalized the node where the listener is installed broadcasts a message to the cluster allowing the queue messages to be flushed 91 | * this approach guarantees ordering 92 | 93 | ###Filter housekeeping 94 | If a node crashes, then the filters originated from that node are removed from the list (assuming the filter is not in use by listeners registered on other nodes). This can be implemented using a ViewChangeListener. 95 | 96 | ###Handling topology changes 97 | Cluster registration information should be propagated as part of the state transfer, for new joining nodes. This would be an additional step in the state transfer process, together with the memory state and the persistent state. Other functionality might need to transfer state as well (e.g. remote listeners to transfer the remote listener information) so we might want to make this a pluggable mechanism. 98 | 99 | #Related 100 | JIRA: [ISPN-3355] (https://issues.jboss.org/browse/ISPN-3355) 101 | A relevant [discussion](http://markmail.org/thread/qanjofgmpyvdjnmg) on the mailing list is (even though interlaced with the partitioning one). -------------------------------------------------------------------------------- /A-continuum-of-data-structure-and-query-complexity.asciidoc: -------------------------------------------------------------------------------- 1 | :linkcss!: 2 | :source-highlighter: highlightjs 3 | :toc: 4 | 5 | This essay is a discussion on the various data structure complexities, 6 | and how to implement various queries on top of them. 7 | 8 | ## A continuum of data structure complexity 9 | 10 | Data as *blob* is the simplest form. 11 | Being opaque, the datastore cannot do much about it. 12 | 13 | Data as *map* (set of property/values) offers the ability for the datastore to expose each "property". 14 | 15 | Data as a *self contained object graph* is more flexible than map. 16 | *Embedded objects* and collection of embedded objects are expressible. 17 | A per entry or shared *schema* can be imposed on entries and offer validation. 18 | Note that data is not duplicated between different keys here. 19 | 20 | Data as a set of *connected objects* offer the most flexibility. 21 | While each entity can contain (collections of) embedded objects, 22 | they can also connect to other entities. 23 | 24 | [NOTE] 25 | .Critical definitions 26 | ==== 27 | An embedded object lifecycle is entirely dependent of its owning object. 28 | An entity has an independent lifecycle compared to other objects. 29 | 30 | * Entities connected to each other are connected objects. 31 | * Entities only pointing to embedded objects are self contained object graphs. 32 | ==== 33 | 34 | For reference, document stores tend to offer the self contained object graph approach. 35 | Connected objects have to be built on top by the application. 36 | 37 | ### OO 38 | 39 | OO implies polymorphism, so when someone looks up or query an +Animal+, 40 | it can be a +Cat+, +Dog+, +Human+ etc. 41 | Some queries are addressed to a specific subtype. Some are to the higher type. 42 | 43 | ## A continuum of query 44 | 45 | This section describes the possible approaches to implement query on your data set. 46 | This realistically imposes restrictions on how query-able data can be stored on a given system. 47 | 48 | I am not trying to list all possible techniques and optimizations. 49 | Enough to get a cost and understanding of what is at stake. 50 | In particular and for brievety, I focus on the filtering part of a query, not the aggregation: 51 | +from ... where ...+ as opposed to +select ... group by ... having+. 52 | 53 | ### Query by primary key 54 | 55 | When data is a blob, the efficient way to query is by its *primary identifier* (key in the data grid). 56 | All other queries require one or more passes of *full scan*. 57 | 58 | ### Query by property value 59 | 60 | When data is a map and to most extend a self contained object graph, 61 | you can additionally query data by their *property values*. 62 | To do that efficiently, an index (property-value->primary key) is maintained. 63 | For a relatively modest cost at CUD time, we avoid the full scan of the data set. 64 | An index can also offer additional features like full-text search. 65 | 66 | An index can be maintained manually by the application or delegated to the datastore. 67 | 68 | IMPORTANT: 69 | Full scan are to be avoided if possible, especially when part of the data set is passivated. 70 | Even when fully in memory, a full scan can keep a thread busy for a while compared to an index. 71 | 72 | An alternative is to physically store data ordered by the property value we are looking for. 73 | This is a trick often used for time series. 74 | This is a single-shot pistol though as you can only select one property. 75 | 76 | For the curious, systems like Google Dremel store data completly differently 77 | and pay the cost at different time. 78 | 79 | ### Query between entities 80 | 81 | Queries between related entities can be done in several ways. 82 | 83 | Materialized view:: 84 | A structure that physically represents the results of a query. 85 | It has to be maintained. 86 | When a CUD happens, all related materialized views need to be updated. 87 | This is essentially a denormalization. 88 | 89 | Data denormalization:: 90 | Store the entity A and its associated entities B (and C and D...) under one key. 91 | When you denormalize, something must keep the duplicated copies synchronized. 92 | The benefit is that these queries are becoming "as simple as" query by property value. 93 | 94 | Index denormalization:: 95 | Index the necessary object graph and store that information in the index. 96 | For a given entity A, the list of Bs (and Cs and Ds) associated will be stored in the index. 97 | To achieve that, something must know that A is associated to these Bs and Cs and must maintain the index. 98 | The benefit is that these queries are becoming "as simple as" query by property value using indexes. 99 | 100 | Computed via full scan (on the fly join):: 101 | This scenario is orders of magnitude worse than the full-scan for a given query by property value. 102 | Assuming a join between A and B. 103 | For each matching A, you need to find and load the matching Bs. 104 | If the join is applied on the non primary keys, we are looking at best at 2 full scans. 105 | And in a distributed world, the full scan must encompass all primary nodes. 106 | Multiple joins leads to an exponential explosion. 107 | 108 | All of the above:: 109 | In practice, a datastore needs to implement a combination of these techniques. 110 | A query planner is then necessary to decide which technique to use for a given part of a query and data set. 111 | This planner must be fed with data statistics and with the relations between these entities and properties. 112 | Pretty much like a RDBMS. 113 | 114 | Note that when one needs to load a set of data (entities A) to find the related set of data (entities B), 115 | one often end up with something called the n+1 problem. n+1 lookups must be performed. 116 | 117 | ## A continuum of use cases (kind of the conclusion) 118 | 119 | If the system only accesses data by its primary key, life is easy. 120 | 121 | If the system only has maps or self contained object graphs, 122 | the data for a given entity type is all contained in the same cache 123 | and indexes can be built to speed common queries. 124 | Full scan can cover the rest with a higher price. 125 | 126 | If the system stores connected entities (whether in the same cache or not), 127 | life becomes complicated: 128 | 129 | * you need to write a pretty *smart query planner / executor* or be shit slow 130 | * more importantly, *something needs to know about the relations between the entities* and 131 | maintain the denormalization structures (index, actual data duplication etc) 132 | ** to maintain an index with denormalized data 133 | ** to maintain physically denormalized data 134 | ** because multiple full scans for each query should be a non starter 135 | 136 | That something can be: 137 | 138 | 1. the user application (manually) 139 | 2. a framework or paradigm which deals with interconnected entities and store them in the datastore 140 | 3. the datastore itself 141 | 142 | Infinispan is not in the business of 3. 143 | It is not a relational database. 144 | And the currently exposed API are a long way from achieving this. 145 | 146 | The user would be hard pressed to implement and maintain the data denormalizations we are talking about 147 | unless they are very limited and don't evolve. 148 | Same when it comes to use this denormalized data to write efficient queries. 149 | I suspect that in practice, this is a nightmare. 150 | 151 | That leaves a framework or paradigm dealing in interconnected entities. 152 | 153 | I believe a *cache API and Hot Rod are well suited to address up to the self contained object graph* use case 154 | with a couple of relations maintained manually by the application but that cannot be queried. 155 | 156 | For the connected entities use case, only a high level paradigm is suited like JPA. -------------------------------------------------------------------------------- /Conflict-resolution-perf-improvements.md: -------------------------------------------------------------------------------- 1 | Conflict Resolution Performance Improvements 2 | ==================== 3 | Currently conflict resolution (CR) performance is dog slow, due to the following: 4 | 5 | 1. All segments and their cache entities are checked 6 | - Performance degrades rapidly as the size of the cluster increases 7 | 8 | 2. Centralised CR 9 | - The merge coordinator is responsible for comparing all cache entities 10 | - In a distributed cache, this requires the coordinator to retrieve all segments from both partitions in order 11 | to compare them 12 | - User triggered CR occurs on the node the request is made 13 | 14 | 3. No parallelism 15 | - A single segment is processed at a given time in order to avoid the centralised CR coordinator suffering a OOM exception 16 | 17 | # Proposals 18 | The following proposals are listed in the order of the anticipated performance benefit, with those with greatest impact 19 | presented first. 20 | 21 | ## Maintaining a Merkle tree 22 | To minimise the number of entries that need to be checked during CR, we should maintain a Merkle tree of cache entry hashes 23 | on a per segment basis, with the root of the tree being a hash of all the entry hashes in the segment. Given the root hash 24 | of a segment from three different nodes, we can determine that no conflict exists simply by checking if all three values 25 | are equal. If any of the three hashes differ, then it's then necessary to compare the segment's entries and perform CR. 26 | Amazon Dynamo DB utilises this technique https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf. 27 | 28 | The simplest solution for producing such a tree, is to simple have a tree of depth 2. Where each cache entry is a leaf node 29 | and the root is a hash of all leaf hash values. This can be easily implemented by simply iterating over the segment container 30 | in order to calculate the hash of individual entries and the combined hash. 31 | 32 | TRACKED BY: [ISPN-8412](https://issues.redhat.com/browse/ISPN-8412) 33 | 34 | ### Hash to use 35 | MurmurHash3? 36 | 37 | ### Calculating the entry hash 38 | Simply hash the `.hashcode()` returned by the `InternalCacheEntry` implementation. 39 | 40 | The `equals` and `hashcode` methods of our `InternalCacheEntry` implementations should be updated to take into account 41 | the `EntryVersion` stored in the Metadata. Currently this is ignored by CR, however including the check would allow the 42 | following conflicts to be resolved deterministically via `LATEST_VERSION` resolution strategy: 43 | 44 | ```java 45 | put(k, v, 1) // 1. put k/v with version 1 46 | put(k, v', 2) // 2. 47 | put(k, v, 3) // 3. Missed by partition 1 48 | ``` 49 | By comparing the entry versions during CR, it's possible to ascertain that the original value missed by a partition in 50 | step 3 is in fact the latest version. 51 | 52 | Even if a different merge strategy was used, maintaining an entries `EntryVersion` value is necessary in order for 53 | HotRod conditional operations to work as expected for the winning value post CR. 54 | 55 | ### Calculating the tree 56 | The computation of non-leaf nodes should only occur at the start of CR, as the additional iteration of entries would 57 | adversely affect cache write operation performance. Furthermore, it's necessary to ensure that both the in-memory and 58 | store entries are included in the creation of the tree. The cost of calculating invidual leaf node hashes should be minimal, 59 | depending on the hash used, so this could be computed actively or lazily. 60 | 61 | Cache operations that occur during the CR phase should be treated as the latest value and should overwrite any writes 62 | that occurr as part of CR. Therefore, once the tree has being created at the start of CR it's state should be immutable 63 | until CR is resolved/aborted. 64 | 65 | ### Increasing Merkle Tree Depth 66 | A segment root with all entries being it's children is not very efficient when a segment contains a large number of values, 67 | as when an inconsistency is discovered, it's still necessary to send all entries within the segment over the wire. Increasing 68 | the depth of the tree means that it's possible to perform more fine-grained consistency checks, with a larger tree depth 69 | and smaller amount of leaf nodes resulting in less entries having to be sent over the wire when inconsistencies do occur. 70 | 71 | #### Constant lookup/removal tree - Depth 3 72 | This is a simple implementation which adds an additional layer of depth to the tree in order to increase CR granularity: 73 | 74 | * A segment is further divided into `X` nodes, which are the root's childen 75 | - These nodes represent a range of a cache's `key` hashcodes, e.g. 0..5, 6..10 etc 76 | - `(hash(key) & Integer.MAX_VALUE) \ X` 77 | - MurmurHash3 can also be used here 78 | - Where `X` could be configurable, but should probably just be an internal constant 79 | 80 | * Each node contains: 81 | - A Map to store leaf nodes, which enables amortised constant lookup and to accomodate `.hashcode()` conflicts on keys 82 | - A node hash field containing a hash of all the leaf node's hashes combined 83 | 84 | * The tree, excluding leafs, can be represented as a int and int[X] (Root+X hashes) in a RPC 85 | 86 | * During CR if two tree's root values don't match, it's then possible to compare all `X` hashes and only request the indexes 87 | of the array which have conflicting hashes. At which point the participating nodes only return the InternalCacheEntries 88 | associated with those nodes. 89 | 90 | ### Offline Backup Consistency Check 91 | The Merkle tree hashes can also be utilised to ensure the consistency of data being dumped to an offline backup. When a 92 | user initiates a backup: 93 | 94 | 1. The cluster initiates CR to ensure that no inconsistencies exist before the backup 95 | 2. A new Merkle tree is created, with the root node being the hash of all primary replica segment hashes 96 | 3. The root hash is stored as part of the backup metadata 97 | 4. When a cluster is restored from a backup, the clusterwide Merkle tree can be recreated and the new root hash can be 98 | compared to the value stored in the metadata. 99 | 100 | > Utilising Merkle trees in this manner would mean that it's not possible to make changes to how entries are hashed 101 | and how the tree is created without an appropriate migration strategy between Infinispan versions. 102 | 103 | ## Distributed CR processing 104 | Instead of CR being coordinated by a single node, the coordinator should instruct the primary owner of each segment to 105 | initiate CR. As it's likely that a single node will be the primary for multiple segments in a small cluster, we should 106 | still limit each node to executing CR for a single segment at a time. 107 | 108 | Distributing CR also benefits from the use of Merkle trees, as it means that the primary owner in the preferred partition 109 | who is coordinating the CR would not have to send their tree over the wire. 110 | 111 | TRACKED BY: [ISPN-9084](https://issues.redhat.com/browse/ISPN-9084) 112 | 113 | ## Prioritise segments based upon requests during merge 114 | During a partition merge, if CR is in progress and a request is made on a specific key before it's segment has been processed, 115 | then we should perform CR on the key in place. 116 | 117 | The advantage is that conflicts on an active entitiy are resolved sooner, furthemore, if (Merkle Trees)[Maintaining-a-Merkle-tree>] 118 | are implemented, it potentially could result in a conflict(s) being resolved before the segment is checked, meaning all entries 119 | in the segment do not have to be retried. 120 | 121 | The disadvantages are the increased complexity, which will only increase if (#Distributed CR processing) is implemented. 122 | 123 | TRACKED BY: [ISPN-9079](https://issues.redhat.com/browse/ISPN-9079) 124 | -------------------------------------------------------------------------------- /RAC-implementation.md: -------------------------------------------------------------------------------- 1 | Infinispan Reliable Asynchronous Clustering 2 | =========================================== 3 | 4 | Original Document: [RAC: Reliable Asynchronous Clustering](RAC%3A-Reliable-Asynchronous-Clustering.asciidoc) 5 | Implementation: Pedro Ruivo, 2019 6 | 7 | # Changes from the original design 8 | 9 | * The `Updaters` are the owners of the `key` and the `primary-owner` is responsible to send 10 | the updates to the remote site. 11 | 12 | * The *Conflict Resolution* was re-done, and it uses versioning (`vector-clock`) to detect conflicts and duplicates. 13 | It will be described later in this document. 14 | 15 | # Overview 16 | 17 | This section describes a short overview of the algorithm. 18 | 19 | When a `key` is updated (doesn't matter if it is a transaction or not), all the owners keeps track the `key` by 20 | storing it in a queue. 21 | Periodically, the `primary-owner` iterates over all `keys` queued and sends its value to the remote site. 22 | To fetch the values, it peeks the `DataContainer` and `CacheWriter`. 23 | If the value isn't found, it sends a `remove` request to the remote site, 24 | otherwise an update request with the `key` and its value. 25 | 26 | After a successful replication (and `ack` is sent back from the remote site), the `primary-owner` sends a `cleanup` 27 | request to the `backup-owners`. 28 | They drop the `key` from the queue. 29 | 30 | Note that, the `lock` for the `key` isn't acquired during this process. 31 | 32 | # Conflict Resolution 33 | 34 | The conflict happens when two or more sites update the same `key` concurrently. 35 | It uses `vector-clocks` to detect such scenarios and perform the conflict resolution to decide which 36 | update stays and which update is discarded. 37 | 38 | The vector clock has the following structure: 39 | 40 | ``` 41 | { 42 | site-1: [topology, version] 43 | site-2: [topology, version] 44 | ... 45 | } 46 | ``` 47 | 48 | The `site-1` and `site-2` are the site's names as configured in `RELAY2`. 49 | 50 | The `topology` is a number associated to the topology changes for that site. 51 | When a node joins/leaves/crashes, the `topology` increases and the `version` resets to `0`. 52 | 53 | The `version` is a number that increments on each update for a segment. 54 | To improve the concurrency and reduce the memory footprint, the `version` is associated to a segment, 55 | instead of having one per `key`. 56 | 57 | *Note:* The `clear` operation is destructive, and it doesn't generate any new `version` neither checks for conflicts! 58 | 59 | ### Version comparator 60 | 61 | To define the order for a `[topology, version]` pair, it compares the elements. 62 | The `topology` is compared first and, if `p1.topology` < `p2.topology`, than pair `p1` is less than pair `p2`. 63 | If the `topology` is the same, then `version` with the same logic as above to determine the ordering. 64 | Examples: 65 | 66 | ``` 67 | [1,10] is less than [2,0] 68 | [1,11] is higher than [1,10] 69 | ``` 70 | 71 | Finally, to compare `vector-clocks`, it compares the pairs for each site. 72 | If for all sites, the pair is lower, then the `vector-clock` is less than the other. 73 | On other hand, if for some sites the pairs are less and for other the pair is higher, then we have a conflict. 74 | Example: 75 | 76 | .vector-clock 1 77 | ``` 78 | { 79 | "LON": [1,1] 80 | "NYC": [0,0] 81 | } 82 | ``` 83 | 84 | .vector-clock 2 85 | ``` 86 | { 87 | "LON": [0,0] 88 | "NYC": [1,1] 89 | } 90 | ``` 91 | 92 | 93 | ### Resolving conflicts 94 | 95 | The current conflict resolution algorithm uses the site names to resolve the conflict. 96 | It uses the lexicographic order to determine which update wins. 97 | In the example above, the update from site `LON` will be applied, and the update from site `NYC` will be discarded. 98 | 99 | The future plan includes a SPI so users can do their own conflict resolution. 100 | See [ISPN-11802](https://issues.redhat.com/browse/ISPN-11802). 101 | 102 | ### Tombstones 103 | 104 | When a `key` is removed, it stores a `tombstone` with the `vector-clock` corresponding to that update operation. 105 | The `primary-owner` sends the `tombstone` to the remote site, together with the `remove` request, 106 | in order the remote site to perform a correct conflict resolution. 107 | 108 | # Implementation details 109 | 110 | ## Classes 111 | 112 | ### `IracManager` 113 | 114 | It is responsible to keep track the `keys` updated and to send them to the `remote site`. 115 | 116 | ### `IracVersionGenerator` 117 | 118 | It is responsible to generate the `vector-clock` and to store the `tombstones`. 119 | 120 | ### `*BackupInterceptor` 121 | 122 | It intercepts the `WriteCommand` and notified the `IracManager` with the `keys` modified by them. 123 | 124 | ### `*IracLocalInterceptor` 125 | 126 | It intercepts the local updates, and it does the following: 127 | 128 | * In the `primary-owner` generates a new `vector-clock` and stores it in the `WriteCommand` for the `backup-owners`. 129 | * In the `backup-owners` just fetches the `vector-clock` from the `WriteCommand` and store it. 130 | 131 | It interacts with the `IracVersionGenerator`. 132 | 133 | ### `*IracRemoteInterceptor` 134 | 135 | It intercepts the updates from the remote site and the `primary-owners`performs the conflict resolution. 136 | 137 | ## Transactions 138 | 139 | The transactional caches need some special handling. 140 | They have the following issues: 141 | 142 | * The `vector-clock` must be generated during the `prepare` phase and not when the `WriteCommand` is executed. 143 | * The conflicts must be detected before the transaction reaches it final outcome. 144 | * The transaction originator may not be the `primary-onwer` for a particular `key` and 145 | only the `primary-onwer` must generate the `vector-clock`. 146 | 147 | ### Optimistic Transactions 148 | 149 | The Optimistic Transactions commits in 2 phases, make it easer to target the issues above. 150 | During the prepare phase the `primary-onwer` generates the `vector-clock` and sends it back to the originator. 151 | 152 | The new `vector-clocks` is sent together with the `CommitCommand` to all the `backup-onwers`. 153 | 154 | When an update from the remote site is received, it is wrapper in a transaction. 155 | This transaction only contains a single `WriteCommand`, and the `primary-owner` checks for conflict 156 | during the prepare phase. 157 | If a conflict is detected and the update needs to be discarded, it aborts the transaction. 158 | 159 | ### Pessimistic Transactions 160 | 161 | The Pessimistic Transactions commits in one phase and, it requires a bit of work to handle IRAC properly. 162 | 163 | The `primary-owner` acquires lock for a `key` during the transaction execution. 164 | So, after the lock request, the originator sends a request to obtain a new `vector-clock` for that `WriteCommand`. 165 | It sends the request asynchronously and, it only waits for the reply, just before committing the transaction. 166 | 167 | Conflict resolution requires a little more work. 168 | To avoid complicate the code dramatically, when an update from remote site is received 169 | it must be forwarded to the `primary owner`. 170 | It is wrapped in the transaction and the originator (`primary-owner`) has all the data needed to check for conflicts. 171 | 172 | ## Topology Changes 173 | 174 | The `topology` element in the `[topology, version]` pair is here to provide consistency in case of topology changes, 175 | and it changes every time a topology change. 176 | 177 | When the `primary-owner` crashes, the "new" `primary-owner` will start sending the updates to the remote site. 178 | It may send duplicated updates for `keys` sent but yet not confirmed by the previous `primary-owner`. 179 | With the `vector-clock`, the duplicated are detected and ignored. 180 | 181 | Finally, the state transfer sends the pending `keys` and `tombstone` for new nodes that become owners for a segment. 182 | 183 | -------------------------------------------------------------------------------- /Handling-cluster-partitions.md: -------------------------------------------------------------------------------- 1 | # Problem 2 | It is possible and should be expected for a cluster to form multiple partitions (aka. [split brain](http://en.wikipedia.org/wiki/Split-brain_(computing)). E.g. if the cluster has the initial topology {A,B,C,D,E}, because of a network issue (e.g. router crash) it is possible for the cluster to split into two partitions {A,B,C} and {D,E}. Assuming DIST mode with _numOwners=2_, both partitions end up holding a subset of data and can individually be in inconsistent state. 3 | Partitions cause inconsistencies: if in the original cluster the key _K_ is owned by nodes D and E, after the split brain the {A,B,C} partition considers _K_ as null. 4 | At the moment Infinispan allows the users to be notified in the eventuality of a partition by registering [ViewChanged](http://docs.jboss.org/infinispan/6.0/apidocs/index.html?org/infinispan/notifications/class-use/Listener.html) listers but it doesn't offer any support for the user to react to the partition e.g. by making the cluster as inactive in order to avoid inconsistent reads. 5 | This wiki page documents a solution we consider for better handling of cluster partitions. 6 | 7 | # Possible approaches 8 | There are several approaches that can be taken for either mitigating or (eventually) solving the partitioning problem: 9 | 10 | 1. Redundant infrastructure 11 | * using two (or more) physical networks infrastructures the cache for partitions to happen can be reduced significantly. E.g. if the cluster is connected by two networks, each network having an availability rate of 99%, then the overall availability of the system is 99.99%. This redundancy can be configured at OS level through [IP bonding](http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-single-interface.html) and doesn't require any additional Infinispan/JGroups configuration 12 | * note that this approach, whilst feasible for many situations doesn't entirely avoid the possibility for the partition to happen 13 | 2. Partition merging 14 | * in this approach the partitions can progress individually accepting read/writes from the user application (might cause inconsistencies as described above). When the two partition discover each other as a result of the network healing, the state of the two partitions is merged. There are several approaches to merge the state: e.g. automatic (would require each entry to be versioned) or pass the merging logic to the application. 15 | * note that when the two partition run in parallel the data is inconsistent(AP from the [CAP](http://en.wikipedia.org/wiki/CAP_theorem) theorem) 16 | 3. Primary partition 17 | * both partitions make a deterministic decision on which partition to stay active and which one to go in read-only mode (or even stop serving users entirely). The decision can be made based on quorum, e.g. the partition having _numMembers/2 + 1_ nodes to win (or both to loose if a deterministic decision cannot be made, e.g. for even clusters). 18 | * when the network heals, the "loosing" partition can merge into the active partition (which might have been modified in between) by wiping out its state and re-fetching it (state transfer) 19 | 20 | # Our approach 21 | In the first iteration we plan to enhance the support for partition detection and allow the user to react to a partition happening (custom policy) by making a partition as UNAVAILABLE (stop answering users' request), READ_ONLY or AVAILABLE. This is along the lines of item 3 (Primary partition) as described in the previous section. 22 | 23 | # Design 24 | 25 | ## New API and Config 26 | The partition handling policy is configured in the _availability_ section of the global configuration: 27 | ```xml 28 | 29 | 30 | 31 | 35 | 36 | 40 | 41 | 42 | ... 43 | 44 | ``` 45 | 46 | ```java 47 | interface PartitionContext { 48 | /** 49 | * returns the list of members before the partition happened. 50 | */ 51 | View getPreviousView(); 52 | 53 | /** 54 | * returns the list of members as seen within this partition. 55 | */ 56 | View getCurrentView(); 57 | 58 | /** 59 | * Returns true if this partition might not contain all the data present in the cluster before 60 | * partitioning happened. E.g. if numOwners=5 and only 3 nodes left in the other partition, then 61 | * this method returns false. If 6 nodes left this method returns true. 62 | * Note: in future release for distributed caches, this method might do some smart computing based on 63 | * segment allocations, so even if > numOwners left, this method might still return true. 64 | */ 65 | boolean isDataLost(); 66 | 67 | /** 68 | * Marks the current partition as read only (writes are rejected with an AvailabilityException). 69 | **/ 70 | void currentPartitionReadOnly(); 71 | 72 | /** 73 | * Marks the current partition as available or not (writes are rejected with a 74 | * AvailabilityException). 75 | **/ 76 | void currentPartitionAvailable(boolean available); 77 | } 78 | ``` 79 | 80 | ```java 81 | interface PartitionHandlingStrategy { 82 | /** 83 | * Implementations might query the PartitionContext in order to determine if this is the primary 84 | * partition, based on quorum and mark the partition unavailable/readonly. 85 | **/ 86 | void handlePartition(PartitionContext pc); 87 | } 88 | ``` 89 | 90 | ## Implementation details 91 | * a new `AvailabilityInterceptor` is added, having 3 states: available, readOnly, unavailable. Based on its state the interceptor might allow, reject the writes or respectively reject all operations to the cache 92 | * when an operation is rejected a custom exception exception is thrown to the user indicating the fact that the partition is not available (AvailabilityException) 93 | * the `PartitionContext.currentPartitionAvailable` and `PartitionContext.currentPartitionReadOnly` methods set the state of the `AvailabilityInterceptor` and are invoked by configured `PartitionHandlingStrategy` implementation 94 | * the status of `AvailabilityInterceptor` is to be exposed through JMX operations as well (read/write) 95 | * we might also provide an @Merge listener implementation to automatically merge a primary partition with an secondary (unavailable) partition by making the later wipe out it state and re-fetch it from the former. This is a useful auto-healing tool for situations where the partitioning doesn't happen because of an network error but because of e.g. a long GC on an isolated node. 96 | 97 | ## To be further considered 98 | The partition handling strategy is intended for the whole cache manager. Wouldn't it make more sense to have a per cache strategy? E.g. a certain cache might not even be affected by a partition (e.g. if asymmetric clusters are used). 99 | 100 | #Related 101 | * JIRA: [ISPN-263] (https://issues.jboss.org/browse/ISPN-263) 102 | * An interesting [mail discussion](http://infinispan.markmail.org/search/#query:+page:3+mid:qanjofgmpyvdjnmg+state:results) around the subject -------------------------------------------------------------------------------- /Multi-tenancy-for-Hotrod-Server.asciidoc: -------------------------------------------------------------------------------- 1 | Context 2 | ~~~~~~~ 3 | 4 | In cloud enabled environments it would be beneficial to run Infinispan as a Service. Unfortunately current implement lacks multi tenancy and requires spinning separate Hot Rod server per tenant. 5 | 6 | This design doc addresses those concerns and the implementation should allow running Infinispan in the Cloud environment and serving data for multiple tenants. 7 | 8 | Current implementation 9 | ~~~~~~~~~~~~~~~~~~~~~~ 10 | 11 | Currently in Infinispan 8/9 we have some sort of multi tenancy implementation - multiple `cache-containers`s in https://github.com/infinispan/infinispan/blob/278597ce4864e9e857ef5ab2650af5c08badae9d/server/integration/infinispan/src/main/resources/schema/jboss-infinispan-core_8_2.xsd#L39-L39[Infinispan Subsystem]. The problematic part is Infinispan Endpoint, which expects https://github.com/infinispan/infinispan/blob/614e35f3927f2c73b4d24703ef1d9ba0dd40fb39/server/integration/endpoint/src/main/resources/schema/jboss-infinispan-endpoint_8_0.xsd#L26-L26[a single cache-container per server]. 12 | 13 | In other words - it is possible to have a multi tenant Hot Rod server now, but each `cache-container` would have to be served on separate port. 14 | 15 | Core Infinispan doesn't support multi tenancy (because of symmetry, it parses configuration file which looks like server side xml and it really can't handle multiple `cache-container` elements). 16 | 17 | Finally the Hot Rod protocol also assumes that it connects to a single `cache-container`. 18 | 19 | Scope of changes 20 | ~~~~~~~~~~~~~~~~ 21 | 22 | Our standard use cases don't need to be extended with multi tenancy (library mode). It is only a matter of implementing this functionality on the server side. Having said that, the changes should be limited to: 23 | 24 | * Hot Rod Server - it needs to serve data for multiple tenants on a single endpoint 25 | * Hot Rod Clients - they need to be able to connect to different tenant 26 | * Rest and Hot Rod protocol. Memcached server and Web Socket servers are out of the scope. 27 | 28 | The implementation idea - Multi Tenant Router 29 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 30 | 31 | The main idea for the implementation is to create another layer on the top of Protocol Servers (REST, Hot Rod), which is responsible for routing requests to proper server. The router will be attached to endpoint (instead of a Protocol Server instance) and will accept incoming requests and forward them to the proper server. 32 | 33 | Implementing the Router as a new service attached to the endpoint has the advantage of backwards compatibility in terms of configuration. In other words, the administrators will still be able to attach Protocol Server to the endpoint as they did prior to Infinispan 9 but additionally they could use the router. 34 | 35 | Configuration 36 | ~~~~~~~~~~~~~ 37 | 38 | Server side configuration will look like this: 39 | 40 | ``` 41 | 42 | 43 | 44 | ... 45 | 46 | 47 | ... 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | ``` 68 | 69 | The Hot Rod client will use TLS/SSL SNI mechanism to inform the server about chosen tenant. Here is an example: 70 | 71 | ``` 72 | builder 73 | .security() 74 | .ssl() 75 | .sslContext(cont) 76 | .sniHostName("sni") 77 | .enable(); 78 | ``` 79 | 80 | Implementation details 81 | ~~~~~~~~~~~~~~~~~~~~~~ 82 | 83 | The router will be bootstrapped the same way as all other Protocol Servers - using Netty. It will use the following pipeline: 84 | 85 | ``` 86 | +--------------------+ 87 | | Request | 88 | +--------------------+ 89 | | 90 | V 91 | +--------------------+ 92 | | SNI | 93 | +--------------------+ 94 | | From this handler down we can read the content. 95 | V Up to this point it can be encrypted. 96 | +--------------------+ 97 | | Router | 98 | +--------------------+ 99 | | After the router decided where to send incoming message 100 | V it attachs Protocol Server's handlers below. 101 | +--------------------+ 102 | | Chosen server | 103 | | handler stack | 104 | +--------------------+ 105 | | 106 | V 107 | +--------------------+ 108 | | Exception | 109 | +--------------------+ 110 | ``` 111 | 112 | For HotRod, the router will determine target tenant based on SNI host name. Use cases without SSL are currently out of scope and if implemented in the future - will require additional operation in the Hot Rod Protocol. 113 | 114 | The router implementation for REST protocol will use Path prefixes. There is also possibility to use `host` header but currently it's out of the scope. 115 | 116 | Memcached and WebSocket are out of the scope. 117 | 118 | Note that existing Protocol Servers would need to be adjusted to support the router - they should not start the Netty server if there is no host/port binding. They should start Cache Managers instead. 119 | 120 | Note that this effectively means that each type of Protocol Server will spin up its own Router. 121 | 122 | Alternative approach 123 | ~~~~~~~~~~~~~~~~~~~~ 124 | 125 | Embed routing logic into the Hot Rod, REST and Memcached (to be considered if it's worth it) servers. 126 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 127 | 128 | This approach would require modifying `ProtocolServer` and its configuration (`ProtocolServerConfiguration`) and putting multiple `EmbeddedCacheManager`s into it. We would also need to put additional `RoutingHandler` into the Netty configuration which would select proper `CacheManager` for the client. Each Multi Tenant `CacheManager` would possibly share the same transport (this is the easiest way for implementing it). 129 | 130 | What do we gain by this? 131 | 132 | * We won't introduce another component on server side 133 | * The implementation will be simpler 134 | 135 | What are the pain points? 136 | 137 | * We will have `if(isMultiTenant())` statements all over the place 138 | * Since we host multiple `CacheContainer`s inside a single `ProtocolServer`, using different configuration (e.g. container 1 uses SSL whereas container 2 does not) might be problematic. 139 | 140 | Questions to confirm 141 | ~~~~~~~~~~~~~~~~~~~~ 142 | 143 | * Perhaps we can enforce all our clients to use SSL+SNI? If so - the routing protocol could be removed. 144 | ** Update: Yes, we would like to enforce all clients to use SSL+SNI for now (possibly we adjust this in the future) 145 | * How to dynamically add a new Protocol Server to existing configuration? Does current implementation support this? 146 | ** Yes, DMR supports it 147 | * Should we allow switching tenant once the client was started? How does this should work together with Near Caching (I'm assuming we should flush everything)? 148 | ** Currently out of the scope 149 | * How to protect against scanning for non-protected tenants in Cloud Environment? This could be potentially used as an attack vector. 150 | ** Since we use HTTPS, we should be good here 151 | 152 | 153 | -------------------------------------------------------------------------------- /Infinispan-query-language-syntax-and-considerations.asciidoc: -------------------------------------------------------------------------------- 1 | Authors: Adrian Nistor, Emmanuel Bernard 2 | 3 | == Syntax 4 | 5 | The basic syntax is a cleanup JP-QL + the embedding of the Lucene query syntax. 6 | 7 | TODO: describe the syntax 8 | 9 | The following sections enhance the syntax with more advanced full-text query support. 10 | 11 | === Sort 12 | 13 | Full-text search allow sorting by score, by order in the index and by distance. 14 | 15 | [SOURCE] 16 | ---- 17 | select u from User u where u.firstname : "Emmanuel" 18 | order by u.lastname, score(), index() 19 | ---- 20 | 21 | We use a function approach to differentiate properties from the full-text search operators. 22 | 23 | One can also order by distance from a given point. 24 | 25 | [SOURCE] 26 | ---- 27 | # when a single spatial predicate is present 28 | select u from User u 29 | where u within (5 km) of (:latitude, :longitude) 30 | order by distance() 31 | 32 | # when no or several spatial predicates are present, we need to express the distance and point to distance from 33 | # option 1 34 | select u from User u 35 | order by distance(u, :latitude, :longitude) 36 | # option 2 37 | select u from User u 38 | order by distance(u) from(:latitude, :longitude) desc 39 | ---- 40 | 41 | === Spatial 42 | 43 | [SOURCE] 44 | ---- 45 | # here we use the default @Spatial index (Hibernate Search style 46 | # the latitude/longitude fields an driven by the annotation metadata 47 | select u from User u 48 | where u within (5 km) of (:latitude, :longitude) 49 | 50 | # here we use an explicit spatial name (either the coordinates property 51 | # or a synthetic property representing the tuple latitude, longitude 52 | select u from User u 53 | where u.location within (5 km) of (:latitude, :longitude) 54 | 55 | # here we use the actual latitude and longitude properties explicitly 56 | # this is not the Hibernate Search style, so one would need to find the 57 | # spatial index from these properties. 58 | # or a synthetic property representing the tuple latitude, longitude 59 | select u from User u 60 | where (u.address,latitude, u.address.longitude) within (5 km) of (:latitude, :longitude) 61 | 62 | ---- 63 | 64 | The third option does not have my preference but felt more natural initially. 65 | Both variation 1 and 2 are the one targeted. 66 | 67 | [NOTE] 68 | ==== 69 | I'm not happy about the confusion of latitude vs longitude (ordering). 70 | Which one comes first, could we have a syntax making this explicit. 71 | Apparently, latitude, then longitude is the standard. 72 | Maybe that's goo enough? 73 | ==== 74 | 75 | === All except 76 | 77 | This is a way to negate a (list of) full-text predicate meaning get all entries except the ones matching the sub predicates. 78 | 79 | [source] 80 | ---- 81 | # Preferred as it's more natural in the language 82 | select t from Transaction t 83 | where not ( t.status : +"failed" or t.status : +"blocked" ) 84 | 85 | # Alternative closer to the Hibernate Search Query DSL 86 | select t from Transaction t 87 | where all except ( t.status : +"failed" or t.status : +"blocked" ) 88 | ---- 89 | 90 | === Boosting and constant score 91 | 92 | One can boost per term or per field. 93 | One can also force a score to be ignored or made constant for a subtree of the full-text queries. 94 | 95 | [source] 96 | ---- 97 | # term boosting 98 | # TODO check term boosting with Gustavo, does Hibernate Search supports term boosting 99 | select u from User u 100 | where u.firstname : ("Adrian"^3 "Emmanuel") 101 | 102 | # field boosting option 1 (DO NOT IMPLEMENT) 103 | select u from User u 104 | where u.firstname^3: ("Adrian" "Emmanuel") 105 | # field boosting option 2 106 | select u from User u 107 | where u.firstname: ("Adrian" "Emmanuel")^3 108 | 109 | # sub query boosting based on option2 110 | select u from User u 111 | where (u.firstname : "Adrian")^3 OR (u.firstname : "Emmanuel") 112 | ---- 113 | 114 | Now let's tackle constant score 115 | 116 | 117 | [source] 118 | ---- 119 | select u from User u 120 | where ( (u.firstname : "Adrian")^3 OR (u.firstname : "Emmanuel") ) 121 | and (u.lastname : ("Nistor" "Bernard")) )^[constant=10] 122 | ---- 123 | 124 | === Analyzer 125 | 126 | It is sometimes necessary to force a different analyzer between query time and index time. 127 | 128 | 129 | [source] 130 | ---- 131 | select u from User u 132 | where u.firstname : "Emmanuel" with analyzer "ngram" 133 | and u.lastname_3gram : ("ber" "rna" "nar" "ard")^6 with no analyzer 134 | ---- 135 | 136 | this syntax `with (no) analyzer` can only be present after a predicate and not on a composed query. 137 | 138 | === More like this 139 | 140 | This compares an entity and expect to find similar entities. 141 | 142 | We will pass the entity or its key as parameter. 143 | This means Hot Rod clients need to extract the entity type to know the protobuf schema to then properly serialize the parameter payload. 144 | 145 | [NOTE] 146 | .TODO: General question on parameters 147 | ==== 148 | Adrian, do you plan on *guessing* the parameter type from the protobuf of the targeted entity? 149 | If not, how do you plan on serializing parameter values with the query? 150 | ==== 151 | 152 | [source] 153 | ---- 154 | # Compare to a user instance (not necessarily persisted) 155 | select u from User u 156 | where u like :user 157 | comparing (u.lastname^3, u.firstname) 158 | with options (favorSignificantTermsWithFactor=3, excludeEntityUsedForComparison=true) 159 | 160 | # Compare to a user instance stored on a given key in the grid 161 | select u from User u 162 | where u likeByKey :key 163 | # we compare all fields since we did not give the compare keyword 164 | with options (favorSignificantTermsWithFactor=3, excludeEntityUsedForComparison=true) 165 | ---- 166 | 167 | [NOTE] 168 | .Many options and workaround for it 169 | ==== 170 | Some full-text options require a bunch of fine-tuning options which would be hard to embed int he syntax unless we offer a generic system. 171 | more like this offers a possible solution 172 | 173 | [source] 174 | ---- 175 | where u.property someMagicFullTextSearchOperator [some values or parameters] with options (option1=value1, option2=value2) 176 | ---- 177 | 178 | In this model, options are generic key/values deemed less important and not requiring a keyword. 179 | This could be useful for things like boolean query options like `minimumNumberShouldMatch`, dismax query etc. 180 | ==== 181 | 182 | === Explore DisMax 183 | 184 | Dismax is like a boolean query except the score of matching document is computed differently, it takes the best of the score of the document from all of the subqueries. 185 | https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-dis-max-query.html 186 | https://lucidworks.com/blog/2010/05/23/whats-a-dismax/ 187 | 188 | Elasticsearch and Solr expose different approach to use DisMaxQuery and exposing it differently to the user. 189 | 190 | [source] 191 | ---- 192 | select u from User u 193 | where u.fistname: " 194 | ---- 195 | 196 | === Meta thinking 197 | 198 | Express all full-text things as function calls that can be nested: 199 | 200 | * AND(query1, queryn), OR 201 | * within(,,,) 202 | * morelikethis 203 | * fuzzy(field, fuzzyFactor) 204 | * etc 205 | 206 | And have a flat syntactic sugar to replace them (or a subset of their usage) 207 | 208 | === Remaining syntax TODOs 209 | 210 | * discuss generic options (see above) 211 | * how to do the additional fuzzy options (since fuzzy is not a keyword but is embedded in the Lucene syntax 212 | 213 | == Query features around the syntax 214 | 215 | Both are operations atop a query that return additional and specific informations 216 | 217 | * explain 218 | * faceting 219 | 220 | 221 | == Security 222 | 223 | At the implementation detail, we need to ensure we are not susceptible to DoS attack from a rogue client query. 224 | Possible ideas: 225 | 226 | * stop if the query payload looks too big and could lead to a huge memory consumption upon parsing 227 | * TODO: what else --------------------------------------------------------------------------------