├── .gitignore
├── SecurityRolesPermissions.png
├── Design-Wiki-Rules.md
├── Hot-cache-via-Debezium.asciidoc
├── Remote-Admin-Client-Library.md
├── InfinispanAPIObject.adoc
├── Spring-5-features,-ideas-and-integration.md
├── Cluster-Registry.md
├── Alias-Caches.md
├── XSite-Failover-for-Hot-Rod-clients.md
├── DataStore.adoc
├── Dynamic-JMX-exposer-for-Configuration.md
├── Fine-grained-security-for-caches.md
├── Graceful-shutdown-&-restore.md
├── Conflict-resolution.md
├── TopologyId.asciidoc
├── Clustered-cache-configuration-state.asciidoc
├── Kubernetes-CLI.adoc
├── Task-Execution-Design.md
├── Smoke-Testsuite.md
├── Cache-Store-Subsystems.md
├── Create-Cache-over-HotRod.md
├── TransportSecurity.asciidoc
├── Lock-Reordering-For-Avoiding-Deadlocks.asciidoc
├── Authorization.adoc
├── CacheMultimap.adoc
├── Custom-Cache-stores-(deployable).md
├── README.md
├── Server Restructuring.md
├── Distributed-Stream-Sorting.md
├── Infinispan-CLI.asciidoc
├── Near-Caching.md
├── Off-Heap-Data-Container.md
├── Total-Order-non-Transactional-Cache.md
├── Multimap-As-A-First-Class-Data-Structure.asciidoc
├── Off-Heap-Implementation.md
├── Deelog:-direct-integration-with-Debezium.asciidoc
├── Incremental-Optimistic-Locking.asciidoc
├── Remote-Iterator.md
├── scaling-without-state-transfer.asciidoc
├── cluster-backup-tool.md
├── Asymmetric-Caches-and-Manual-Rehashing-Design.asciidoc
├── Schemas-API.adoc
├── Optimistic-Locking-In-Infinispan.asciidoc
├── ServerNG.md
├── Single_port.adoc
├── Scattered-Cache-design-doc.md
├── Remote Command Handler.md
├── Clustered-listeners.md
├── A-continuum-of-data-structure-and-query-complexity.asciidoc
├── Conflict-resolution-perf-improvements.md
├── RAC-implementation.md
├── Handling-cluster-partitions.md
├── Multi-tenancy-for-Hotrod-Server.asciidoc
└── Infinispan-query-language-syntax-and-considerations.asciidoc


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | *.iml
3 | .eclipse
4 | .project
5 | 
6 | 


--------------------------------------------------------------------------------
/SecurityRolesPermissions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/infinispan/infinispan-designs/HEAD/SecurityRolesPermissions.png


--------------------------------------------------------------------------------
/Design-Wiki-Rules.md:
--------------------------------------------------------------------------------
 1 | The design wiki is meant to hold design documentation.
 2 | 
 3 | # Writing effective designs
 4 | Designs should be written in Markdown (supports syntax highlight) or AsciiDoc (use external syntax highlighter such as [gist] (https://gist.github.com) until github [fixes](https://github.com/gollum/gollum/issues/280) the highlighting).
 5 | 
 6 | # What to include
 7 | Try and make sure your design docs have references to JIRAs, as well as versions affected and target versions.
 8 | 
 9 | # Images and diagrams
10 | Pictures speak a thousand words.  Use [Google Drive](http://drive.google.com) to draw and host your images, and link to them from here.


--------------------------------------------------------------------------------
/Hot-cache-via-Debezium.asciidoc:
--------------------------------------------------------------------------------
 1 | Use Debezium to keep the Infinispan cache data up to date compared to the database structure.
 2 | 
 3 | Design questions and choices to be debated:
 4 | 
 5 | * no Kafka infrastructure, that means Debezium should be in embedded mode in JDG
 6 | * decide of the Debezium topology: one instance on the coordinator (or a fixed position in the hash wheel)
 7 | * deciding of a clean stop for Debezium and a fresh start on a new node when a topology change occurs.
 8 | * decide how the Debezium internal states are stored (dedicated cache?, started automatically and transparently?)
 9 | * decide what Infinispan makes from these Debezium events and how that is configured
10 | ** all via configuration with a table to cache transfert
11 | ** have it programmatically pluggable to add the conversion logic?
12 | * how should the end value look like: Protobuf autogenerated from the Table structure? From the DDL stream Debezium offers? What about plain Java?
13 | 


--------------------------------------------------------------------------------
/Remote-Admin-Client-Library.md:
--------------------------------------------------------------------------------
 1 | We should provide a client library which is capable of performing remote administration operations using the DMR.
 2 | This would be used in a number of places: as a general remote admin client, as a helper for remote JCache, to allow Teiid to perform alias switching in remote configurations.
 3 | 
 4 | ## Functionality
 5 | The client should provide a subset of management operations which are available within a server:
 6 | 
 7 | * creation/removal of caches
 8 | * creation/removal of cache configurations
 9 | * cache lifecycle management (start/stop)
10 | * switching cache aliases
11 | * etc
12 | 
13 | ## API
14 | If possible we should reuse the embedded configuration API.
15 | 
16 | ## Security
17 | The client will use the security policies enabled in the management security domain of the server.
18 | 
19 | ## Packaging
20 | The remote admin client should be packaged as a separate jar from the infinispan-remote jar: infinispan-remote-admin-$VERSION.jar. This uberjar would contain the necessary jboss client, xnio and other dependencies required to communicate with the admin endpoint.


--------------------------------------------------------------------------------
/InfinispanAPIObject.adoc:
--------------------------------------------------------------------------------
 1 | == Infinispan API
 2 | 
 3 | === Motivation
 4 | 
 5 | Being able to offer a simple API to create clustered caches and improve API usability
 6 | 
 7 | 
 8 | == Infinispan
 9 | 
10 | This factory will help to create a cluster of Infinispan.
11 | 
12 | Whenever a new member is added on the VM, the implicit configuration will handle the visibility between members.
13 | 
14 | ```java
15 | 
16 | public final class Infinispan {
17 | 
18 |     private Infinispan() {}
19 | 
20 |     //Constructs and starts a new instance of the CacheManager, using the system defaults.
21 |     public static EmbeddedCacheManager createClusteredCache() {}
22 | 
23 |     public static EmbeddedCacheManager createClusteredCache(Configuration configuration) {}
24 | 
25 |     public static EmbeddedCacheManager createClusteredCache(GlobalConfiguration globalConfig, Configuration configuration) {}
26 | 
27 |     // Return the running CacheManagers on the cluster in the calling VM
28 |     public static Set<EmbeddedCacheManager> getAllMembers() {}
29 | 
30 |     // Stop all the members on the VM
31 |     public static void stopAll() {}
32 | }
33 | 
34 | ```
35 | 
36 | 
37 | 


--------------------------------------------------------------------------------
/Spring-5-features,-ideas-and-integration.md:
--------------------------------------------------------------------------------
 1 | Spring 5 will have a couple of [interesting features](https://jira.spring.io/browse/SPR-13574?jql=project%20%3D%20SPR%20AND%20issuetype%20%3D%20%22New%20Feature%22%20AND%20fixVersion%20in%20(%225.0%20M2%22%2C%20%225.0%20M3%22%2C%20%225.x%20Backlog%22)%20ORDER%20BY%20issuetype%20DESC):
 2 | 
 3 | * [CDI support (JSR 330)](https://jira.spring.io/browse/SPR-12211). We may enhance our CDI Extension to work correctly with Spring and propose this configuration for our users. Even though our extension has no hard dependency to Weld (apart from tests) I believe we will need to adjust it a little bit.
 4 | * [JCache 2.0 support](https://jira.spring.io/browse/SPR-13574). Implementing this issue will probably require adding few bits into Spring module and verifying if everything works smoothly.
 5 | 
 6 | A few ideas not necessarily related to Spring 5:
 7 | * Implement a "bridge" between [Spring tasks](http://docs.spring.io/spring/docs/current/spring-framework-reference/htmlsingle/#scheduling) our [Infinispan distributed tasks](http://infinispan.org/docs/dev/user_guide/user_guide.html#DistributedExecutor).
 8 | 
 9 | Scheduled bigger features:
10 | * [Spring Data-like use case](https://issues.jboss.org/browse/ISPN-1068)
11 | * [Spring starter](https://issues.jboss.org/browse/ISPN-6561)
12 | * [Define how to specify configuration](https://issues.jboss.org/browse/ISPN-3337)
13 | 


--------------------------------------------------------------------------------
/Cluster-Registry.md:
--------------------------------------------------------------------------------
 1 | Currently the ClusterRegistry provides some syntactic sugar around a replicated cache which can be used as an internal repository for service-data.
 2 | To differentiate between different scopes and uses, the ClusterRegistry uses a composite key composed of a scope and a key.
 3 | While useful, the current design is very limited:
 4 | * it doesn't allow configuration of the underlying replicated cache (i.e. persistence, security, eviction, etc)
 5 | * some use cases might prefer a more efficient invalidation approach instead of a replicated one
 6 | * the wrapping of the key adds an extra object and it makes it cumbersome to expose this cache to users (Remote query would need to)
 7 | * many internals avoid using preferring custom solutions (e.g. Map/Reduce temporary caches, Remote Query schema cache)
 8 | 
 9 | The new ClusterRegistry should behave as follows:
10 | * create a dedicated per-service cache instead of a single catch-all cache
11 | * the characteristics of a registry cache are specified by the requiring service
12 | * persistent/volatile
13 | * replication/invalidation
14 | * eviction/expiration
15 | * security
16 | * naming strategy (i.e. static name, derived name, etc)
17 | * dependencies (so that a registry cache's lifecycle is bound to another cache)
18 | * custom settings (Remote Query schema cache needs a custom interceptor)
19 | 
20 | Potential users of registry caches
21 | * Remote Query schema cache
22 | * Query index caches
23 | * Map/Reduce temporary caches
24 | * Security ACL cache
25 | * ...
26 | 
27 | 


--------------------------------------------------------------------------------
/Alias-Caches.md:
--------------------------------------------------------------------------------
 1 | In order to support materialized views, Teiid uses the following technique with relational databases:
 2 | - populates a "hidden" table with the new data
 3 | - removes the old "visible" table and renames the new "hidden" table to the "visible" name
 4 | 
 5 | To support this in Infinispan, instead of renaming caches, we should support Alias Caches.
 6 | 
 7 | ## Definition
 8 | An Alias Cache is a named cache which acts as a delegate to a concrete named cache.
 9 | It is configured only with the name of a cache to which it will act as delegate. On top of the standard Cache API it also provides an additional void switchCache(String name) method with which it is possible to change the delegate. Switching is allowed only if the delegate caches mode is compatible, i.e. local <> local, clustered <> clustered. Switching between local and clustered caches is not allowed. Switching a clustered cache will be performed cluster-wide so that all aliases switch at the same time.
10 | 
11 | ## Configuration
12 | The declarative configuration, common to both embedded and server, is as follows:
13 | 
14 | `&lt;alias-cache name="alias-cache-name" delegate-cache="delegate-cache-name" /&gt;
15 | 
16 | The programmatic configuration, for embedded mode, is as follows:
17 | 
18 | `ConfigurationBuilder builder = new ConfigurationBuilder();`
19 | `builder.aliasCache("delegate-cache-name");`
20 | `cacheManager.defineConfiguration("alias-cache-name", builder.build());`
21 | 
22 | ## Management
23 | In embedded mode the switch operation is exposed via JMX. In server mode, the switch operation is exposed via a management operation.
24 | 
25 | 


--------------------------------------------------------------------------------
/XSite-Failover-for-Hot-Rod-clients.md:
--------------------------------------------------------------------------------
 1 | This wiki describes the design of the cross-site failover for Hot Rod clients ([ISPN-5520](https://issues.jboss.org/browse/ISPN-5520)).
 2 | 
 3 | Hot Rod clients currently support failover to nodes within the cluster to which the client is connected. To support that, Hot Rod servers send topology information to clients as part of the responses as topology changes happen. When an operation to a clustered node fails due to transport or cluster rebalancing issues, the client automatically retries the operation in a different cluster node. This node is elected using a configured load balance policy.
 4 | 
 5 | To add basic cross-site failover support, the following changes are required:
 6 | 
 7 | * Client configuration needs to be enhanced to have 0-N cross-site static configuration, where each cross-site configuration would have 1-N host information of nodes in that site.
 8 | * In its simplest form, a Hot Rod client should failover to the nodes defined in cross-site failover section, if all nodes in the main cluster have failed (after a complete retry). The first site where nodes are available becomes the the client's main site, working as usual with the topology of this site.
 9 | 
10 | Cross-site failover could be improved further if clients would be able to failover when the site is offline while still being accessible remotely. To achieve this, there'd need to be a way for the server to tell the clients that the site is offline. The simplest way to do so would be to add a new response status that indicates that the site is offline, so next time the client sends an operation and gets offline status, it fails over to a different site.


--------------------------------------------------------------------------------
/DataStore.adoc:
--------------------------------------------------------------------------------
 1 | == Data Store
 2 | 
 3 | In scenarios where Infinispan is used as a persistent data store, the elasticity provided by rebalancing on scaling down (either voluntarily or because of node failure) can lead to data loss, even with persistent caches if all the owners of a segment leave the cluster before rebalancing can be completed. The remaining cluster should prevent writes to the lost segments until the nodes that own them are restarted.
 4 | 
 5 | It should be possible to configure Infinispan so that elasticity only applies when scaling up, i.e. adding a node will cause a rebalance.
 6 | 
 7 | === Partition Handling
 8 | 
 9 | Partition handling should be configured with `when-split=deny_read_writes` so that the nodes prevent reading/writing to the lost segments.
10 | 
11 | === Topology
12 | 
13 | Rebalancing should have the following configurations:
14 | 
15 | * ALWAYS: rebalancing happens both on nodes joining and leaving the cluster.
16 | * SCALE_UP: rebalancing only happens when nodes join.
17 | * NEVER: rebalancing is disabled in all situations.
18 | 
19 | Every time a stable topology is installed in the cluster, it should be persisted to the global state. Currently this only happens on shutdown.
20 | 
21 | === Restarting nodes
22 | 
23 | When a node is restarted it should read the topology stored in the global state and join only if it matches the stable topology of the existing cluster.
24 | 
25 | === Open issues
26 | 
27 | When the segment owners are missing, an application will not be able to read/write any data belonging to those segments even if it is "new" data (i.e. keys that were not in those segments when they disappeared).
28 | We could create temporary segments on the remaining nodes which would then be merged back into the real owners using the conflict resolution logic.
29 | 


--------------------------------------------------------------------------------
/Dynamic-JMX-exposer-for-Configuration.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | Currently configuration values are exposed in JMX as a summary list (see CacheImpl#getConfigurationAsProperties). This approach covers only read-only access and some of the Configuration's attributes could have been modified in runtime. This requires a new approach - scanning and exposing Configuration attributes dynamically.
 4 | 
 5 | More information might be found in tickets:
 6 | * https://issues.jboss.org/browse/ISPN-5343
 7 | * https://issues.jboss.org/browse/ISPN-5340
 8 | 
 9 | # User Perspective
10 | 
11 | Configuration attributes will be exposed as a new Node in JMX MBean tree on the same level as Activation, Cache, LockManager etc. 
12 | 
13 | # Implementation details
14 | 
15 | A new Dynamic MBean [1] will be created as a standard Infinispan component and will have a reference to Cache Configuration. During the startup it will scan the Configuration recursively and construct a `Map<String, ReflectionInfo<Class, Field>>`. The map's key will contain path to the attribute, for example: `clusteringConfiguration.asyncConfiguration.replicationQueueInterval` and the map's value will have all necessary information for making reflexive calls (a class and a field and possibly some additional metadata). Those attributes will used to provide MBean metadata and perform reflexive calls in the runtime.
16 | 
17 | The newly created Dynamic MBean will be discovered by `ComponentsJmxRegistration` and will be registered in MBean Server. Note that currently this class uses `ResourceDMBean`s for this, so additional scanning for Dynamic MBeans would have to be performed there.
18 | 
19 | # FAQ
20 | 
21 | Q: Shouldn't we use `ResourceDMBean` directly?
22 | 
23 | A: No. `ResourceDMBean` implementation is focused on invoking Managed Operation on a single class level. We need something more - we need to scan a bunch of configuration classes recursively.
24 | 
25 | Q: Shouldn't we prepare metadata when parsing classes (using `ComponentMetadataPersister`)?
26 | 
27 | A: We could but the parsing logic would have to fit into `ResourceDMBean` then and we would need to do a lot of code refactoring to support it. I'm not sure if it's worth it.
28 | 
29 | [1] http://docs.oracle.com/cd/E19698-01/816-7609/6mdjrf83d/index.html


--------------------------------------------------------------------------------
/Fine-grained-security-for-caches.md:
--------------------------------------------------------------------------------
 1 | We would like to enable fine-grained security to data within caches.
 2 | 
 3 | Two approaches:
 4 | 
 5 | ## Authorization Callback
 6 | Via some kind of user-provided callback which would intercept all calls. The callback would receive the key, value and the required permission (READ, WRITE, etc) and allow/deny the operation. 
 7 | 
 8 | ### Pros
 9 | * no additional memory is required to store the ACL
10 | * can implement per-user checks. 
11 | 
12 | ### Cons
13 | * requires custom code.
14 | 
15 | Proposed interface:
16 | 
17 | ```
18 | public interface AuthorizationCallback {
19 |    void authorize(Subject subject, Cache<K, V> cache, K key, V value, AuthorizationPermission permission) throws SecurityException;
20 | }
21 | ```
22 | 
23 | Global cache operations (i.e. execute, listen, etc) would still go through the existing security checks.
24 | 
25 | ## Authorization metadata
26 | Store ACLs within each entry's metadata. This would require adding an additional field to the metadata to store the ACL information. The ACL would consist of:
27 | * the owner of the entry. This needs to be as compact as possible while preserving uniqueness, and we need some kind of global mapper which can extract this information from the source of user information (e.g. a UID, GUID, etc)
28 | * a set of authorization roles (as already defined for the container). Each role would imply specific permissions, so that a user would be allowed access/manipulation of an entry only if it is the owner, an administrator, or it belongs to one of the roles.
29 | 
30 | Pros
31 | * does not require custom code (i.e. avoids having to deploy code to server)
32 | * for the basic use-case (only the owner and the admin can manipulate an entry) nothing else needs to be done aside from enabling. A special ownership manipulation API would need to be defined for other cases.
33 | 
34 | Cons
35 | * requires additional memory per entry. This should be mitigated using a BitSet.
36 | * the granularity is per-role and not per-user
37 | * ownership manipulation over remote needs protocol changes
38 | 
39 | Proposed API:
40 | 
41 | ```
42 | interface AuthorizationManager {
43 | ...
44 | Set<String> getEntryRoles();
45 | void setEntryRoles(K key, String... role); // this operation can be performed either by the owner or an ADMIN
46 | }
47 | ```
48 | 


--------------------------------------------------------------------------------
/Graceful-shutdown-&-restore.md:
--------------------------------------------------------------------------------
 1 | ## Purpose
 2 | Implementing graceful shutdown and restore is essential for a datagrid which needs to be "cycled" for maintenance:
 3 | * shutdown: stop all nodes in a cluster in a safe way by persisting state and data
 4 | * restore: restart the whole cluster, apply previous state and preload the data from the storage (no data loss)
 5 | 
 6 | ### State configuration
 7 | State configuration should be done at the GlobalConfiguration level. It should include the following settings:
 8 | * a state directory, where state will be persisted
 9 | * the expected number of nodes in the cluster. State transfer will not initiate until all nodes have joined the cluster. The default is 0, which corresponds to the current behaviour.
10 | 
11 | ### Persistent State
12 | The persistent state will be stored in the state directory. State should consist of multiple state files: 
13 | * Global State File (___global.state)
14 |   * localUUID
15 | * Per-cache State File (cachename.state)
16 |   * hash.function = hash class name
17 |   * hash.numOwners = number of owners in the hash
18 |   * hash.numSegments = number of segments in the hash
19 |   * hash.members = number of members in the hash
20 |   * hash.member.n.uuid = the UUID of each member
21 | The state file will use the standard Java property file format for simplicity
22 | 
23 | ### Startup sequence
24 |     if statedir != null {
25 |         check for state existence stored in statedir
26 |         set UUID of local node to stored value
27 |         start transport
28 |         wait for expected view
29 |         coordinator pushes configuration to all
30 |         send consistent hash
31 |         potentially stagger cache starts to avoid storm
32 |     } else {
33 |         start transport
34 |         if !coordinator {
35 |             send join request
36 |         } else {
37 |             wait for numInitialMembers to join
38 |             send consistent hash
39 |         }
40 |     }
41 | 
42 | ### Shutdown sequence
43 |     for each running cache {
44 |         put cache in shutting down state
45 |         disallow new local ops/txs
46 |         optionally wait for pending ops/txs
47 |         send "ready to shutdown" to coord
48 |         coord sends ack and does not process any more CH updates
49 |         flush stores (passivation / async)
50 |     }
51 |     save state
52 | 


--------------------------------------------------------------------------------
/Conflict-resolution.md:
--------------------------------------------------------------------------------
 1 | On-demand operations:
 2 | ```
 3 | interface ConflictResolutionManager {
 4 |    Map<Address, CacheEntry> getAllVersions(Object key); // can run from any node
 5 |    Stream<Map<Address, CacheEntry>> getConflicts(); // can run from any node, chunked?
 6 |    void resolveConflict(Object key); // applies EntryMergePolicy and updates all owners using putIfAbsent/replace/(versioned replace), always runs from primary
 7 |    void resolveConflicts();
 8 | }
 9 | 
10 | // Not only for partition handling = ALLOW_ALL but also fixing failed writes
11 | @Experimental // don't set it in stone now
12 | interface EntryMergePolicy {
13 |    CacheEntry merge(CacheEntry preferredEntry, CacheEntry otherEntry);
14 | }
15 | // OOTB implementations:
16 | // version-based: any version is better than null, possibility to store timestamp in version to implement last update policy?
17 | // preferred-always (return null if preferred is null)
18 | // preferred-non-null (return preferred if none is null)
19 | 
20 | enum PartitionHandling {
21 |    DENY_ALL, // if the partition does not have all owners, both reads and writes are allowed
22 |    ALLOW_READS, // degraded but allows stale reads
23 |    ALLOW_ALL // allow the values to diverge and resolve the conflicts later
24 | }
25 | ```
26 | 
27 | Selecting preferred partition for given segment:
28 | 
29 | 1. (discutable) The partition that had all owners before split?
30 | 1. Overall size of partition
31 | 1. Order in members list (just make the choice deterministic)
32 | 
33 | Selecting preferred entry when there's no split brain (failed operation kept the cluster inconsistent): preferred entry is the one from primary owner.
34 | 
35 | Automatic merging after split-brain heal:
36 | 
37 | * before the merged entry is written to all partitions, partition A should not see modifications from B and vice versa (merge happens before installing a merged topology)
38 | * merge uses EntryMergePolicy from configuration
39 | 
40 | Configuration (<parition-handling />):
41 | 
42 | * deprecate `enabled="true|false"`
43 | * add `type="deny-all|allow-reads|allow-all"`
44 | * add `entry-merge-policy="version-based|preferred-always|preferred-non-null|custom class name"`
45 | 
46 | Other thoughts:
47 | 
48 | * for all-keys conflict resolution the primary needs to request whole segment, not just presently owned keys because it may miss some - we could reuse InboundTransferTask?
49 | 


--------------------------------------------------------------------------------
/TopologyId.asciidoc:
--------------------------------------------------------------------------------
 1 | == Topology Id Rework
 2 | 
 3 | The Infinispan Hot Rod client is able to know which node to direct requests to based on the key.
 4 | This has to be updated every time the topology changes (new node leaves or joins the cluster).
 5 | When the topology changes a counter is incremented that the server is aware of to signal this.
 6 | This client then is able to tell the server which topology id it last knew about and in the case it is
 7 | old the server will reply with the new id and the members.
 8 | 
 9 | Unfortunately, this has an interesting problem if the entire server cluster is shut down but a client
10 | is still around.
11 | The client will of course not function when the cluster is gone, however when the server starts back up
12 | the server topology id will be reset back to 0.
13 | This means a client will have a _higher_ topology so the server will never send the cluster information
14 | and the client will, until the server toplogy catches up, no longer know what servers a key map to and
15 | also send to servers that may not even exist any longer.
16 | 
17 | === Suggested Fix
18 | 
19 | Fixing this will require a few different things.
20 | 
21 | * Change the toopology id to be a more unique number that can be recreated instead of an incrementing number
22 | ** HashCode of pending (if not null) or read consistent hash
23 | ** HashCode of the CacheTopology itself
24 | * Change so the server will send a new topology if the id does not match instead of being less than
25 | * Ensure the client applies the topology update even if the id does not match instead of being greater than
26 | 
27 | Note that the server itself can still have an incrementing topology id, just the one it shares to the client
28 | will be a generated one.
29 | 
30 | === Backwards Compatibility
31 | 
32 | The changes suggested above are normally done as part of a new protocol, such as 3.2 or 4.0.
33 | But it may be desirable instead to apply retroactively to all versions of the protocol.
34 | 
35 | However, some clients (13.0.0 & 9.4.24) had a change that made it so they only applied a topology update if the returned topology
36 | id was greater than their own.
37 | This means if a user upgrades the server the fix above will still not work as the client may or may not apply
38 | the update.
39 | 
40 | This means we can either provide backwards compatibility to a subset of previous clients or to none, requiring a
41 | new protocol version to handle the new topology ids.
42 | It is unclear what approach should be done.
43 | Note, however, that for many users it is much simpler to update the server than it is for all clients.
44 | 


--------------------------------------------------------------------------------
/Clustered-cache-configuration-state.asciidoc:
--------------------------------------------------------------------------------
 1 | Clustered cache configuration / state
 2 | =====================================
 3 | 
 4 | Currently all members of a cluster manage their configuration, cache lifecycle and most of the state independently.
 5 | This means that in order for a cache to be created across a cluster, each node must define its configuration locally and
 6 | start it. There is also no validation that a cache configuration is compatible across nodes. Also, while some
 7 | configuration attributes can be changed at runtime, this does not happen cluster-wide (although it might make sense for
 8 | some configuration attributes to be asymmetric, e.g. capacity factor). Additionally, when new nodes join a running
 9 | cluster, they should also start any clustered caches so that symmetry is maintained.
10 | 
11 | Implementation
12 | --------------
13 | 
14 | * Change the current behaviour of the GlobalStateManager, enabling it internally by default. Persistence of state will however continue to be enabled only if requested by the user.
15 | * The GlobalStateManager adds a ___state cache handled by the InternalCacheRegistry. This cache will contain both cache configurations (stored as strings using the Infinispan XML schema) and a list of running caches (do we need to store more state ?).
16 | * Cache configuration and lifecycle performed via the usual DefaultCacheManager API will continue to behave as it does
17 | currently (i.e. each node is independent).
18 | * Add a new API for managing clustered configuration and state (see below)
19 | * Allow configuration Attributes to be either Global or Local.
20 | * Implement a variant of equals (equalsIgnoreLocal ?) for Configuration objects to validate congruency ignoring local
21 | attributes.
22 | * Support the +start+ configuration attribute for caches (which can be either +LAZY+ or +EAGER+) so that EAGER caches
23 | are automatically started when the CacheManager is started.
24 | * When a node joins a cluster it retrieves the list of running caches from the state cache and starts them  using
25 | the configuration from the state cache. The configuration coming from the state cache is validated against any local
26 | configuration which might be present in the DCM's ConfigurationManager.
27 | * Modifying a Global configuration Attribute at runtime will propagate the change to all nodes.
28 | 
29 | API
30 | ---
31 | 
32 | Add a cluster() method to the DefaultCacheManager which returns a cluster-affecting implementation of the EmbeddedCacheManager interface. Methods such as +defineConfiguration()+, +undefineConfiguration()+, +startCaches()+, will affect the entire cluster. All other methods will be delegated to the underlying local DefaultCacheManager.
33 | 
34 | 


--------------------------------------------------------------------------------
/Kubernetes-CLI.adoc:
--------------------------------------------------------------------------------
 1 | = Kubernetes CLI
 2 | 
 3 | The Infinispan CLI currently has two modes: command-line and interactive. 
 4 | To improve the user experience when using the Infinispan Operator we can enhance the CLI to add Kubernetes capabilities.
 5 | 
 6 | == Native
 7 | 
 8 | Using Quarkus we can build the CLI as a native executable for the three most common platforms: Linux x86_64, OS X, Windows.
 9 | 
10 | == Kubernetes mode
11 | 
12 | Installing the executable as `kubectl-infinispan` allows invocation as a `kubectl/oc` plugin, e.g.:
13 | 
14 | `kubectl infinispan ...`
15 | 
16 | Using the Fabric8 Kubernetes Client we can automatically detect the existence of a Kubernetes configuration (default to `~/.kube/config`) and automatically connect to the Kubernetes cluster.
17 | 
18 | === Kubernetes commands
19 | 
20 | The Infinispan Kubernetes CLI plugin uses the default namespace for all operations, unless overridden with the `--namespace` parameter.
21 | 
22 | ==== Install the operator
23 | 
24 | `kubectl infinispan install [-n namespace]`
25 | 
26 | Install the Infinispan Operator in the specified namespace. 
27 | This should also install the relevant OLM subscription and operator group so that the operator can be updated.
28 | 
29 | ==== Create an Infinispan cluster
30 | 
31 | `kubectl infinispan create cluster [-n namespace] [-r replicas] [--expose-type=type] [--expose-port=port] [--expose-host=hostname] [name]`
32 | 
33 | Creates an Infinispan cluster:
34 | * `-n namespace` overrides the namespace.
35 | * `-r replicas` selects the initial number of replicas. Defaults to 1.
36 | * `--expose-type=LoadBalancer|NodePort|Route` selects how the service should be exposed.
37 | * `--expose-port=port` specifies the exposed port for LoadBalancer and NodePort types.
38 | * `--expose-host=hostname` specifies the exposed hostname for the Route type.
39 | * `--security-secret=secretname` specifies the secret to use for endpoint authentication.
40 | * `--security-cert-secret=secretname` the secret that contains a service certificate, tls.crt, and key, tls.key, in PEM format. 
41 | * `--security-cert-service=servicename` the Red Hat OpenShift certificate service name.
42 | 
43 | ==== Scale an Infinispan service
44 | 
45 | `kubectl infinispan scale -r replicas [servicename]`
46 | 
47 | Scale an Infinispan cluster to the specified number of replicas
48 | 
49 | ===== Interact with a running Infinispan service
50 | 
51 | `kubectl infinispan shell [name]`
52 | 
53 | Connects to one of the active pods of the specified cluster and launches the CLI in interactive mode.
54 | 
55 | ==== Remove an Infinispan service
56 | 
57 | `kubectl infinispan delete cluster [-n namespace] [servicename]`
58 | 
59 | Stops and removes an Infinispan service.
60 | 
61 | ==== Uninstalling the operator
62 | 
63 | `kubectl infinispan uninstall [-n namespace]`
64 | 
65 | Removes the Infinispan operator
66 | 
67 | 


--------------------------------------------------------------------------------
/Task-Execution-Design.md:
--------------------------------------------------------------------------------
 1 | Some of the most powerful features in Infinispan involve execution of code "close to the data": distributed executors and map/reduce tasks allow running user tasks using the inherent data parallelism and the computing power of all the nodes in a cluster.
 2 | While it is interesting to allow running such tasks from HotRod, the concept can be extended even further to the execution of any kind of code, be it local to a single node or distributed among some/all the nodes in the cluster.
 3 | In the spirit of HotRod language-agnostic nature, we should not limit execution just to Java, but leverage the JDK's scripting engine API so that tasks can be developed in a multitude of languages.
 4 | 
 5 | # Stored scripts 
 6 | * Store scripts in a dedicated script cache (persistent, secure, etc)
 7 | * If the scripting engine supports it, precompile to bytecode to take advantage of HotSpot
 8 | * Each script can take multiple named parameters which will appear as bindings (i.e. variables) in the script
 9 | * Script bindings include the cacheContainer, cache, scriptingManager, marshaller (if in HotRod)
10 | * A script can optionally return a result
11 | * Scripts can be of various types supported by JSR-223 (Java, Javascript, Scala, etc)
12 | 
13 | # Remote execution over HotRod
14 | * Add EXEC op
15 | * Input parameters: (name: string, value: byte[])* < marshalled values
16 | * Returns: (byte[])?
17 | * Optional flags to specify execution type (local, distributed, map/reduce)
18 | 
19 | # Remote execution over REST
20 | * Invoke a POST on the URL http://server/rest/!/scriptname/cachename?param1=value1&param2=value2
21 | * Invoke a POST on the URL http://server/rest/!/scriptname/cachename and pass any parameters in the request body.
22 | * Manipulate sync/async execution using the 'performAsync' header used by other ops
23 | * The returned value will use the request variant to perform conversion appropriately to the requested type (text/plain, application/octet-stream, application/xml, application/json, application/x-java-serialized-object)
24 | 
25 | # Task manager
26 | * The task manager will be the main entry point for executing tasks and for retrieving execution history
27 | * Task execution is delegated to specific engines
28 | ** Scripts will be handled by the existing ScriptManager
29 | ** Server-deployed tasks will be handled by a DeployedTaskManager
30 | * The executor within which tasks are executed is global to all engines.
31 | * Track start/finish/what/who/where for tasks that have been executed. Store these in an internal, persistent cache which can be queried/filtered appropriately. Possibly have a limited retention.
32 | * Support aborting running jobs (best-effort, since it requires handling of InterruptedException in appropriate places. Badly-behaved tasks will not be stoppable until JVM termination).
33 | * Throttling ? (via some executor / thread priority modification)
34 | 
35 | 


--------------------------------------------------------------------------------
/Smoke-Testsuite.md:
--------------------------------------------------------------------------------
 1 | As the Infinispan project grows, there's a need to expand the tests to run them under different set ups, e.g. with uber jars, with client modules running in Wildfly, with normal jars...etc.
 2 | 
 3 | On top of different set ups, we are also finding that a subset of the tests we have also need to be run with different configurations, such as compatibility mode, with/without distribution, with/without replication, with/without transactions, with/without xsite...etc.
 4 | 
 5 | The current test set up means that each module has its own set of tests, so duplication can easily happen, and on top of that, to be able to add set ups to test, tests need to be duplicated or extended in a way not originally intended.
 6 | 
 7 | On top of the duplication of tests, there's a lot of waste of resources generated by the number of times caches/cachemanagers/clusters/sites are started and stopped in the entire testsuite, which leads to a general slow down of the testsuite.
 8 | 
 9 | During a recently held meeting, it was agreed that the following improvements would need to be made:
10 | 
11 | * Move the most relevant of all functional tests into a single `testsuite` project. We could try to use some tooling to determine the most relevant tests (student project?)
12 | * Define suites for which caches/cachemanagers/clusters/sites are started and then run a load of tests for that particular set up. The aim is to reduce time needed to run tests.
13 | * Running all tests in all configurations/set-ups would likely take too long, so we should consider randomising the number of configurations/set-ups run. Any randomising applied should allow to backtrack to figure out the exact configuration/set-up run to debug failures. Other OS projects such as Lucene are already using randomising of tests, so we should check what they do.
14 | * Randomising could be used as way to introduce failures and see how the tests behave in those conditions. Again, being able backtrack to the cause of failure would be necessary.
15 | * Incremental testing possibilities should be considered if available. In essence, incremental test would mean that when a change happens, only tests affected by those changes would be run. Not sure how it'd run in the presence of reflection?
16 | * Randomising test data would also be helpful to discover failures, e.g. primitive data, non-serializable data, serializable data, externalizer-marshallable data, nulls...etc.
17 | * It could be desirable to make all this work independent of the test framework used by creating our own DSL and defining tests in plain test. Sanne has done some similar work with ANTLR for one of Hibernate OGM parsers.
18 | 
19 | Some of the projects that could help with this are:
20 | * [JUnit Lambda](http://junit.org/junit-lambda.html)
21 | * [Ekstazi](http://www.ekstazi.org) - Lightweight Test Selection
22 | * [Pitest](http://pitest.org)
23 | * [Infinitest](http://infinitest.github.io) for continuous testing


--------------------------------------------------------------------------------
/Cache-Store-Subsystems.md:
--------------------------------------------------------------------------------
 1 | While the cache store configuration machinery in embedded is easily extensible, adding new cache stores to Infinispan Server is much more complex because of the "monolithic schema" approach that WildFly imposes on subsystems. This means that, in order to add a schema-enabled cache store to server, we have to modify the Infinispan subsystem itself. This applies to both "Infinispan-owned" cache stores which live outside the main Infinispan repo and to custom cache stores.
 2 | 
 3 | The proposed solution is to  create a dedicated subsystem for "external cachestores", akin to how datasources are currently configured. Cache stores would be defined in this subsystem and referenced from the main Infinispan subsystem using a naming convention (think JNDI).
 4 | 
 5 | We would need to define an archetype subsystem which can be extended for the specific cache store configuration logic with little effort.
 6 | 
 7 | Here is an example of the configuration for such a subsystem:
 8 | 
 9 |     <subsystem xmlns="urn:infinispan:server:cachestore:cassandra:8.2">
10 |        <cassandra-store name="mycassandrastore"
11 |                         auto-create-keyspace="true" keyspace="TestKeyspace"
12 |                         entry-table="TestEntryTable"
13 |                         consistency-level="LOCAL_ONE"
14 |                         serial-consistency-level="SERIAL">
15 |            <cassandra-server host="127.0.0.1" port="9042" />
16 |            <cassandra-server host="127.0.0.1" port="9041" />
17 |            <connection-pool heartbeat-interval-seconds="30"
18 |                         idle-timeout-seconds="120"
19 |                         pool-timeout-millis="5" />
20 |         </cassandra-store>
21 |     </subsystem>
22 | 
23 | The subsystem would be responsible for parsing the configuration and 
24 | registering an appropriate StoreConfigurationBuilder under a named 
25 | Service within the server. We need to be careful about class loader 
26 | visibility, but the datasource subsystem does something similar, so it 
27 | should be possible.
28 | 
29 | The main Infinispan subsystem would need to be extended to be able to 
30 | parse the following (simplified):
31 | 
32 |     <subsystem xmlns="urn:infinispan:server:core:8.2">
33 |        <cache-container name="clustered">
34 |          <distributed-cache name="default">
35 |            <store ref="infinispan/cachestore/cassandra/mycassandrastore"
36 |                   passivation="true" fetch-state="true" preload="true"
37 |                   purge="false" />
38 |          </distributed-cache>
39 |        </cache-container>
40 |     </subsystem>
41 | 
42 | It would also not be difficult, for symmetry, to support a compatible 
43 | schema for the embedded use-case.
44 | 
45 | The cachestores would be distributed both as a simple jar for embedded 
46 | use-cases and as a zip containing the necessary modules for server.
47 | Note that this would not leverage the deployable cachestores machinery 
48 | as subsystems need to be installed as modules.
49 | 


--------------------------------------------------------------------------------
/Create-Cache-over-HotRod.md:
--------------------------------------------------------------------------------
 1 | NB. This page refers to Hot Rod, but the concepts can be applied to REST as well.
 2 | 
 3 | Creating a cache from a remote client is an often requested feature:
 4 | 
 5 | * JCache's API CacheManager.createCache(String cacheName, C configuration)
 6 | * Hibernate OGM can optionally create caches (as part of schema generation)
 7 | * Teiid can create tables
 8 | 
 9 | Since cache configuration is very complex, remote cache creation will be limited to specifying an existing configuration template name. Additionally, caches created remotely might be temporary (i.e. they are not persisted back into the configuration, if applicable). The proposed API for the cache creation call (using the Java client as reference) would be:
10 | 
11 |     RemoteCache<K, V> RemoteCacheManager.createCache(String cacheName, String configurationName, boolean persistInConfiguration)
12 | 
13 | with the additional signature
14 | 
15 |     RemoteCache<K, V> RemoteCacheManager.createCache(String cacheName, String configurationName) {
16 |         return createCache(cacheName, configurationName, false);
17 |     }
18 | 
19 | For symmetry, a
20 | 
21 |     RemoteCacheManager.destroyCache(String cacheName)
22 | 
23 | method would remove a cache.
24 | 
25 | The protocol would be enhanced with the following opcodes:
26 | 
27 | 0x37 = create cache request [header + config name length (vInt)  + config name (String) + persist (1 byte)]
28 | 0x38 = create cache response
29 | 0x39 = destroy cache request [header]
30 | 0x40 = destroy cache response
31 | 
32 | The usual status / error codes will be used in the response.
33 | 
34 | Server-side, cache creation should adhere to the following rules:
35 | 
36 | * the operation is performed on all nodes
37 | * the operation is complete only when it is completed by all nodes
38 | * an empty configurationName will use the configuration of the default cache
39 | * if the temporary flag is not set, then the cache configuration will be persisted so that it will be recreated on startup
40 | * cache creation would be disallowed on a degraded cluster
41 | * the operation would require the ADMIN permission if authorization is enabled
42 | * if authorization is disabled, the operation would be allowed only when invoked over a loopback interface
43 | 
44 | New nodes joining an existing cluster with caches created by the above operation might not have the required cache in their configuration. Therefore the coordinator should send back a join response with the list of running caches which need to be created, including their configuration name as well as whether they are temporary.
45 | 
46 | Persisting the configuration will only be possible if the server is in a mode which allows it:
47 | * embedded server: N/A
48 | * standalone mode server: the server applies the configuration to its model so that it can be persisted by the subsystem writer. If part of a cluster, the operation will need to be rolled back in case of failures on other nodes
49 | * domain mode server: the server delegates the configuration change to its domain controller so that it is then applied to all servers in the server group. (TBD)
50 | 
51 | 


--------------------------------------------------------------------------------
/TransportSecurity.asciidoc:
--------------------------------------------------------------------------------
 1 | == Server Transport Security
 2 | 
 3 | JGroups offers various protocols to secure the transport, however they require complex setup which could be simplified by the server security configuration.
 4 | 
 5 | == Encryption
 6 | 
 7 | === SSL/TLS
 8 | 
 9 | JGroups does not directly offer the use of TLS/SSL for the TCP transport, although it does allow setting a custom `SocketFactory`.
10 | 
11 | By using the server identity of a security realm, we can supply JGroups with an SSL-enabled SocketFactory. The server identity must have both a keystore as well as a truststore so that certificates are mutually trusted.
12 | 
13 | By using namespaces, we can extend the transport configuration as follows:
14 | 
15 | [source,xml]
16 | ----
17 | <cache-container>
18 |     <transport>
19 |         <server:encryption>
20 |             <security-realm realm="cluster"/>
21 |         </server:encryption>
22 |     </transport>
23 | </cache-container>
24 | ----
25 | 
26 | To simplify the out-of-the-box configuration, we can include a `cluster` security realm with a commented-out server identity:
27 | 
28 | [source,xml]
29 | ----
30 | <server>
31 |     <security-realms>
32 |             <security-realm name="cluster">
33 |                <!-- Uncomment to enable TLS on the realm -->
34 |                <!-- server-identities>
35 |                   <ssl>
36 |                      <keystore path="cluster.pfx"
37 |                                keystore-password="password" alias="cluster"/>
38 |                     <truststore path="ca.pfx" password="password"/>
39 |                   </ssl>
40 |                </server-identities-->
41 |             </security-realm>
42 |          </security-realms>
43 | </server>
44 | ----
45 | 
46 | A big advantage of using SSL/TLS is that, if OpenSSL is available, we get native performance.
47 | 
48 | === SYM_ENCRYPT (optional)
49 | 
50 | The `SYM_ENCRYPT` protocol requires all nodes to use the same secret stored within a keystore. We can use the server's credential stores to supply the secret:
51 | 
52 | [source,xml]
53 | ----
54 | <cache-container>
55 |     <transport>
56 |         <server:encryption>
57 |             <credential-reference store="credentials" alias="trust"/>
58 |         </server:encryption>
59 |     </transport>
60 | </cache-container>
61 | ----
62 | 
63 | === ASYM_ENCRYPT (no)
64 | 
65 | The complexity of configuring `ASYM_ENCRYPT` doesn't provide any additional value compared to the SSL/TLS approach described above.
66 | 
67 | == Authentication (optional)
68 | 
69 | When paired with a realm which can authenticate, depending on its features, we can set up JGroups authentication.
70 | 
71 | === Trust realm
72 | 
73 | If the security realm contains a trust realm, we can automatically enable SSL client authentication without any further configuration
74 | 
75 | === Kerberos identity
76 | 
77 | If the security realm has a Kerberos identity, we can automatically set up the `AUTH` JGroups protocol with the `KrbToken`
78 | 
79 | === Others
80 | 
81 | Supporting realms with other types of authentication providers (properties, LDAP), while possible, would require encryption to prevent credential sniffing.
82 | 


--------------------------------------------------------------------------------
/Lock-Reordering-For-Avoiding-Deadlocks.asciidoc:
--------------------------------------------------------------------------------
 1 | === Context
 2 | Considering a from-the-book deadlock scenario, with two transactions Tx1 and Tx2 writing on two keys "a" and "b" in different order:
 3 | 
 4 | * Tx1: writes on “a” then “b”
 5 | * Tx2: writes on “b” then “a”
 6 | 
 7 | With some “right” timing => deadlock during prepare time.
 8 | A common approach for avoiding the above deadlock is by forcing all transactions to write the keys in the same order.
 9 | 
10 | Considering the previous example, if both Tx1 and Tx2 write the keys in lexicographical order:
11 | 
12 | * Tx1: writes on “a” then “b”
13 | * Tx2: writes on “a” then “b”
14 | 
15 | Now both transactions access (and lock) the keys in the same order so there's no possibility for deadlock.
16 | 
17 | Key reordering is not always possible though:
18 | . it is not always possible to know all the keys written within a transaction a priori
19 | . there is not always possible to define a link:http://en.wikipedia.org/wiki/Total_order[total order] relation over the set of keys.
20 | 
21 | ==== Ordering keys for optimistic locking
22 | The above mentioned shortcomings can be overcome in a distributed grid based on link:Optimistic-Locking-In-Infinispan[optimistic locking]:
23 | 
24 | . With optimistic locking, the lock acquisition takes place at prepare time. So before acquiring any lock, the set of keys for which locks are to be acquired is known
25 | . In order to infer a total order relation over the set of keys we can make use of the consistentHash function:
26 | ** during prepare for each transaction order the keys based on their consistent hash value
27 | *** this assures that all transactions deterministically order their keys, as consistent hash is a deterministic function
28 | ** acquire the locks in this sequence
29 | 
30 | This means that locks are NOT acquired in the order in which the corresponding operation has happened. This doesn't have any impact for the user: lock acquisition takes place at prepare time and at this stage the user does not have any control over the transaction.
31 | 
32 | There is a chance for two keys touched by the same transaction to have the same consistent hash value.
33 | 
34 | In this case:
35 | 
36 | * it is still possible to have deadlocks 
37 | * the possibility of consistent hash collision to happen is rather small, assuming the consistent hash has a uniform distribution
38 | * even if this collision happens, the consistency guaranteed are not violated
39 | 
40 | === Value
41 | This prepare-time ordering approach offers a simple and elegant way of reducing the number of deadlocks when the in memory-gird is configured with optimistic locking:
42 | 
43 | * no effort is required from the user to define an total order relation over the keys placed in the grid
44 | * the computational effort for reordering the key is not significant
45 | * the user does not have to know all the keys used within a transaction a priori, that mens a very flexible programming model   
46 | 
47 | === Related
48 | * this optimization is tracked by link:https://issues.jboss.org/browse/ISPN-1132[ISPN-1132] which also contains the suggested design for implementing it.
49 | * Infinispan's link:Optimistic-Locking-In-Infinispan[optimistic locking] model


--------------------------------------------------------------------------------
/Authorization.adoc:
--------------------------------------------------------------------------------
 1 | = Authorization
 2 | 
 3 | == Premise
 4 | This document outlines a proposal for improving authorization in Infinispan.
 5 | Currently, Infinispan authorization is disabled by default in both embedded and server.
 6 | Authorization has two main scopes: global and per-cache. 
 7 | Since per-cache operations authorization has a significant impact on performance, it needs to be explicitly enabled.
 8 | 
 9 | == Current limitations
10 | * Configuration of authorization is complex and there is no OOTB configuration
11 | * The internal roles for schema and script management are not friendly
12 | * The small number of permissions means that the scope of the ADMIN permission is quite broad
13 | 
14 | == Proposal
15 | The following changes are proposed:
16 | * Introduce a new CREATE permission
17 | * Include a default set of roles when the user doesn't provide any
18 | * Enable authorization out-of-the box in the server
19 | * (Optional) Introduce a way to "scope" permissions to resource names
20 | 
21 | === CREATE permission
22 | Add a CREATE permission to `org.infinispan.security.AuthorizationPermission`.
23 | CREATE allows creation/removal of caches, counters, schemas, scripts.
24 | CREATE supersedes and deprecates the existing `___schema_manager` and `___script_manager` internal roles.
25 | 
26 | === Default roles
27 | Enabling authorization without supplying any roles would define the following default roles:
28 | 
29 | * *admin* superuser, allowed to do everything (*ALL*)
30 | * *application* allowed to perform all read/write ops, but not allowed to create/remove caches, schemas, scripts (*ALL_READ*, *ALL_WRITE*, *LISTEN*, *EXEC*)
31 | * *builder* allowed to create/remove caches, schemas, scripts (*ALL_READ*, *ALL_WRITE*, *LISTEN*, *EXEC*, *CREATE*)
32 | * *observer* a read-only role. Can use the CLI/console but all write ops are forbidden (*ALL_READ*)
33 | 
34 | === Server authorization 
35 | 
36 | * The default server configurations should enable authorization OOTB for server/container ops.
37 | * The default role mapper should be the `cluster-role-mapper` which stores principal-to-role mappings in a replicated persistent cache across the cluster
38 | 
39 | === Cluster Role Mapper
40 | 
41 | * if empty, it behaves like the `identity-role-mapper` so that default behaviour doesn't change, i.e. the identity's principals are mapped 1:1 with roles. 
42 | This means an "admin" user will act with the "admin" role.
43 | * CLI `grant` and `deny` commands control the association between principals and roles, e.g. `grant myuser builder`
44 | 
45 | 
46 | === Role Scoping (Optional)
47 | 
48 | Because some permissions (ADMIN) are quite broad, it may make sense to introduce role scoping. 
49 | This would allow granting access to one or more roles to specific resources while preserving the usual semantics everywhere else.
50 | For example, to allow users having the `backup` role perrmissions to invoke privileged backup ops:
51 | 
52 | [source,xml]
53 | ----
54 | <role name="backup">
55 |   <scopes>
56 |     <scope name="/v2/cache-managers/*/backups/*" permissions="ADMIN"/>
57 |   </scopes>
58 | </role>
59 | ----
60 | 
61 | This implies that a role with no scopes would be equivalent to the wildcard scope:
62 | 
63 | [source,xml]
64 | ----
65 | <role name="backup">
66 |   <scopes>
67 |     <scope name="*" permissions="ADMIN"/>
68 |   </scopes>
69 | </role>
70 | ----
71 | 
72 | 


--------------------------------------------------------------------------------
/CacheMultimap.adoc:
--------------------------------------------------------------------------------
 1 | = Native cache Mutimap
 2 | 
 3 | The goal is to provide Infinispan Native CacheMultimap support. The proposal here is to start with the smallest interface implementation, in order to provide a proper API to the first multimap client, vert-x project.
 4 | However, this API will evolve, and this first implementation must work on Embedded and Server mode.
 5 | This first Multimap won't support duplicate values on the same key. 
 6 | 
 7 | [source, java]
 8 | .CacheMultimap.java
 9 | ----
10 | public interface MultimapCache<K, V> {
11 |    void put(K key, V value); // <1>
12 | 
13 |    Collection<V> get(K key); // <2>
14 | 
15 |    boolean remove(K key); //<3>
16 | 
17 |    boolean remove(K key, V value);
18 |    
19 |    void remove(Predicate<V> p); //<4>
20 | }
21 | ----
22 | <1> "Put" indicates that the value is being somehow replaced on a key if the key already exists. An alternative could be to call it "add"
23 | <2> We might want to return a Set instead of Collection interface, because duplicates on same key won't be supported yet.
24 | <3> Calling this method "reset" has been suggested
25 | <4> vert-x implementation needs a way to remove values depending on a given Predicate. To achieve this, we need to provide an API or CacheSet<K> keySet(); method.
26 | 
27 | An alternative is to support only Async API instead of the standard sync API. 
28 | 
29 | [source, java]
30 | .MultimapCache.java
31 | ----
32 | public interface MultimapCache<K, V> {
33 |    CompletableFuture<Void> put(K key, V value);
34 | 
35 |    CompletableFuture<Collection<V>> get(K key);
36 | 
37 |    CompletableFuture<Boolean> remove(K key);
38 | 
39 |    CompletableFuture<Boolean> remove(K key, V value);
40 |    
41 |    CompletableFuture<Void> remove(Predicate<V> p);   
42 | }
43 | ----
44 | 
45 | 
46 | The underlying implementation will wrap a normal Cache. Some of the suggestions :
47 | 
48 | 
49 | Apparently there is a problem with functional commands.
50 | They don't really work efficiently over Hot Rod (does get/replace in a loop).
51 | We would need to add some more handling in the protocol to allow for only partial replication
52 | of values and only 1 remote call.
53 | 
54 | 
55 | == Embedded Multimap
56 | 
57 | A new maven module is created.
58 | 
59 | To create a multimap cache
60 | 
61 | ```java
62 |       EmbeddedCacheManager cm = ...
63 |       MultimapCacheManager multimapCacheManager = EmbeddedMultimapCacheManagerFactory.from(cm);
64 |       multimapCache = multimapCacheManager.get("test");
65 | 
66 |       multimapCache.put("k", "v");
67 | 
68 | ```
69 | 
70 | 
71 | == Infinispan Server Multimap
72 | 
73 | There are at least two options for the implementation
74 | 
75 | === Option A
76 | 
77 | We define a new interface called RemoteMultimapCache that won't support methods with lambdas, but the most
78 | common and useful methods will be supported.
79 | 
80 | We implement it using the existing OperationsFactory. We enrich the header or we add a flag so the server will know that this
81 | call is meant to be over a multimap. So the internal implementation will grab the cache manager and the cache, create the
82 |  EmbeddedMultimapCache and call the methods over that instance instead of over the regular cache methods.
83 | 
84 | - + Code reuse
85 | 
86 | - - Spaghetti code risk
87 | 
88 | === Option B
89 | As for Counters, Locks and any other module built on top of the core, we will implement new operations for hotrod that are specific
90 | for multimaps.
91 | 
92 | - + Clear separation of modules
93 | - - Duplication of code risk
94 | 
95 | 
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------
/Custom-Cache-stores-(deployable).md:
--------------------------------------------------------------------------------
 1 | # Deployable Cache Stores high level design
 2 | 
 3 | This document describes Deployable Cache Stores implementation details.
 4 | 
 5 | ## Client perspective
 6 | 
 7 | The client will be able to deploy a custom Cache Store jar into Hotrod server (put it into `$HOTROD_SERVER/standalone/deployments`). The jar will need to contain one of the following service entries:
 8 | * /META-INF/services/org.infinispan.persistence.spi.AdvancedCacheWriter
 9 | * /META-INF/services/org.infinispan.persistence.spi.AdvancedCacheLoader
10 | * /META-INF/services/org.infinispan.persistence.spi.CacheLoader
11 | * /META-INF/services/org.infinispan.persistence.spi.CacheWriter
12 | * /META-INF/services/org.infinispan.persistence.spi.ExternalStore
13 | * /META-INF/services/org.infinispan.persistence.spi.AdvancedLoadWriteStore
14 | 
15 | Those services might used later used in the configuration.
16 | 
17 | ## Implementation details
18 | 
19 | _The implementation is based on Deployable Filters and Converters._
20 | 
21 | Currently all writers and loaders are instantiated in `PersistenceManagerImpl#createLoadersAndWriters`. This implementation will be modified to use `CacheStoreFactoryRegistry`, which will contain a list of `CacheStoreFactories`. One of the factories will be added by default - the local one (which will the same mechanism as we do now - `Util.getInstance(classAnnotation)`. Other `CacheStoreFactories` will be added after deployment scanning.
22 | 
23 | `PersistenceManagerImpl` is attached to each Cache separately and has a reference to `Configuration` object. During the startup process `PersistenceManagerImpl` scans for configured stores (`PersistenceManagerImpl#createLoadersAndWriters`) and creates separate references for `CacheLoaders` and `CacheWriters`. After modifications, not only local instances will be available (more precisely from local Classloader), but also deployed. 
24 | 
25 | ## FAQ
26 | 
27 | Q: What happens when you have both a cache loader and a cache writer deployed?
28 | A: `PersistenceManagerImpl` uses separate instances for `CacheLoaders` and `CacheWriters`. It is perfectly legal that a custom Cache Store class implements both of those interfaces. In that case the same instance will be used for loading and writing persistence data. Here is a sample block of code from, which illustrates this idea:
29 | ```
30 | PersistenceManagerImpl#createLoadersAndWriters:
31 | Object instance = Util.getInstance(classAnnotation);
32 | CacheWriter writer = instance instanceof CacheWriter ? (CacheWriter) instance : null;
33 | CacheLoader loader = instance instanceof CacheLoader ? (CacheLoader) instance : null;
34 | ...
35 | loaders.add(loader);
36 | writers.add(writer);
37 | ```
38 | 
39 | Q: What happens when you have a deployed Cache Store and it is also accessible locally?
40 | A: The deployed one will be used. There are 3 options we could use here:
41 | * throw some `DuplicatedCacheStore` exception to indicate such situation (since this is not a total disaster I wouldn't do that)
42 | * Use the local one
43 | * Use the deployed one (my vote)
44 | I think using the deployed one might be useful for upgrades (for example you have a MyCustomStore v. 1.0 accessible locally and MyCustomStore v. 1.1 deployed - you would probably prefer using the newer one).
45 | 
46 | Q: How to tie a deployed cache loader/writer with the configuration itself?
47 | A: It is already done via configuration. `PersistenceManagerImpl#createLoadersAndWriters` scans configuration and extracts Cache Store class name from:
48 | * `ConfigurationFor#value`
49 | * `CustomStoreConfiguration#customStoreClass`
50 | Later on this name is used to instantiate loader/writer.
51 | 
52 | Q: Are deployed Cache Stores "active" by default?
53 | A: Yes and no. They are accessible and can be instantiated, but without any configuration - they won't be attached to any Cache.
54 | 
55 | 
56 | 
57 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Infinispan Designs
 2 | ==================
 3 | 
 4 | This repository contains various design proposals related to Infinispan
 5 | 
 6 | * [A continuum of data structure and query complexity](A-continuum-of-data-structure-and-query-complexity.asciidoc)
 7 | * [Alias caches](Alias-Caches.md)
 8 | * [Asymmetric caches and manual rehashing](Asymmetric-Caches-and-Manual-Rehashing-Design.asciidoc)
 9 | * [Cache Store Subsystems](Cache-Store-Subsystems.md)
10 | * [Clustered Listeners](Clustered-listeners.md)
11 | * [Cluster Registry](Cluster-Registry.md)
12 | * [Compatibility 2.0](Compatibility-2.0.md)
13 | * [Conflict resolution](Conflict-resolution.md)
14 | * [Consistency guarantees in Infinispan](Consistency-guarantees-in-Infinispan.asciidoc)
15 | * [Continuous query design and indexless queries](Continuous-query-design-and-indexless-queries.asciidoc)
16 | * [Create Cache over HotRod](Create-Cache-over-HotRod.md)
17 | * [Custom Cache stores (deployable)](Custom-Cache-stores-(deployable).md)
18 | * [Deelog: direct integration with Debezium](Deelog%3A-direct-integration-with-Debezium.asciidoc)
19 | * [Design For Cross Site Replication](Design-For-Cross-Site-Replication.asciidoc)
20 | * [Design Wiki Rules](Design-Wiki-Rules.md)
21 | * [Distributed Stream Sorting](Distributed-Stream-Sorting.md)
22 | * [Distributed Stream Support](Distributed-Stream-Support.md)
23 | * [Dynamic JMX exposer for Configuration](Dynamic-JMX-exposer-for-Configuration.md)
24 | * [Fine-grained security for caches](Fine-grained-security-for-caches.md)
25 | * [Graceful shutdown & restore](Graceful-shutdown-&-restore.md)
26 | * [Handling cluster partitions](Handling-cluster-partitions.md)
27 | * [Health-check API](Health-check-API.asciidoc)
28 | * [Hot cache via Debezium](Hot-cache-via-Debezium.asciidoc)
29 | * [Incremental Optimistic Locking](Incremental-Optimistic-Locking.asciidoc)
30 | * [Index affinity proposal](Index-affinity-proposal.asciidoc)
31 | * [CLI](Infinispan-CLI.asciidoc)
32 | * [Hibernate Second-Level Cache improvements](Infinispan-Hibernate-Second-Level-Cache-improvements.md)
33 | * [Query - Design and Planning](Infinispan-Query---Design-and-Planning.asciidoc)
34 | * [Query language syntax and considerations](Infinispan-query-language-syntax-and-considerations.asciidoc)
35 | * [Java 8 API proposal](Java-8-API-proposal.md)
36 | * [Lock Reordering For Avoiding Deadlocks](Lock-Reordering-For-Avoiding-Deadlocks.asciidoc)
37 | * [Multimap as a first class data structure](Multimap-As-A-First-Class-Data-Structure.asciidoc)
38 | * [Multi-tenancy for Hotrod Server](Multi-tenancy-for-Hotrod-Server.asciidoc)
39 | * [Near-Caching](Near-Caching.md)
40 | * [Non-Blocking State Transfer](Non-Blocking-State-Transfer.asciidoc)
41 | * [Non-Blocking State Transfer V2](Non-Blocking-State-Transfer-V2.asciidoc)
42 | * [Off-Heap Data Container](Off-Heap-Data-Container.md)
43 | * [Off-Heap Implementation](Off-Heap-Implementation.md)
44 | * [Optimistic Locking In Infinispan](Optimistic-Locking-In-Infinispan.asciidoc)
45 | * [RAC: Reliable Asynchronous Clustering](RAC%3A-Reliable-Asynchronous-Clustering.asciidoc)
46 | * [RAC: Implementation Details](RAC-implementation.md)
47 | * [Remote Admin Client Library](Remote-Admin-Client-Library.md)
48 | * [Remote Command Handler](Remote-Command-Handler.md)
49 | * [Remote Hot Rod Events](Remote-Hot-Rod-Events.asciidoc)
50 | * [Remote Iterator](Remote-Iterator.md)
51 | * [Remote Listeners improvement proposal](Remote-Listeners-improvement-proposal.md)
52 | * [Scattered Cache design doc](Scattered-Cache-design-doc.md)
53 | * [Security](Security.asciidoc)
54 | * [Smoke Testsuite](Smoke-Testsuite.md)
55 | * [Spring 5 features, ideas and integration](Spring-5-features,-ideas-and-integration.md)
56 | * [Task Execution Design](Task-Execution-Design.md)
57 | * [Topology Id Rework](TopologyId.asciidoc)
58 | * [Total Order non Transactional Cache](Total-Order-non-Transactional-Cache.md)
59 | * [XSite Failover for Hot Rod clients](XSite-Failover-for-Hot-Rod-clients.md)
60 | 


--------------------------------------------------------------------------------
/Server Restructuring.md:
--------------------------------------------------------------------------------
 1 | Infinispan Server Restructuring
 2 | ===============================
 3 | 
 4 | Status quo
 5 | ----------
 6 | 
 7 | Infinispan Server is currently based on the full WildFly distribution.
 8 | It is composed of the following additional components:
 9 | 
10 |   * a JGroups subsystem which is mostly aligned with the WildFly with some additions (e.g. SASL integration)
11 |   * an Infinispan subsystem, originally forked from the WildFly one, but much more complete in terms of feature coverage, with support for deployments
12 |   * an Endpoint subsystem, unique to Infinispan Server
13 |   * the Infinispan admin console
14 |   * a CLI which extends the WildFly CLI with some additional convenience commands
15 | 
16 | We use the WildFly feature pack and server provisioning tools to create the Infinispan Server package based on the above subsystem and on any additional modules we require (including new versions of modules already supplied by WildFly).
17 | We also remove some unnecessary modules based on the output of a Python script which computes module/subsystem dependencies.
18 | Unfortunately the "trimming process" is not as effective as it could be because some of the modules, while not needed by Infinispan Server, are hard dependencies of other server modules. In particular we rely on the Datasource and Transaction subsystem which have a lot of dependencies.
19 | 
20 | Basing Infinispan Server on WildFly has a number of advantages and disadvantages.
21 | 
22 | Advantages
23 | 
24 |    * JBoss Modules combined with the service model provides module and service dependency tracking with concurrent loading
25 |    * User deployment support
26 |    * DMR with support for inspecting / modifying / persisting configuration at runtime, with propagation to all nodes in the domain, accessible via CLI, Console and remoting
27 |    * Security realms with support for certificate stores, LDAP integration, Kerberos, etc
28 |    * The subsystems can be packaged as a WildFly layer to provide newer features to WF deployments
29 |    
30 | Disadvantages
31 | 
32 |    * Integrating with the server configuration model adds a lot of overhead
33 |    * Larger than needed footprint
34 |    * Much of the functionality supplied by WildFly is unnecessary in container deployments such as OpenShift
35 | 
36 | Redesign proposals
37 | ------------------
38 | 
39 | For future development we want to pursue two paths: one which maintains compatibility with the current situation but resolves some of the above disadvantages and a more aggressive one tailored for containers
40 | 
41 | Plan A: Base the server on WildFly Core
42 | 
43 |    * Make the dependency on the transactions and datasource subsystems optional by providing our own internal implementations by directly integrating with Narayana and HikariCP
44 |    * Still work with the transactions/datasource subsystem if they are present (to support the WildFly layering)
45 |    * Need to still support deployable JDBC drivers
46 |    
47 | Plan B: Create an uber-jar slim server
48 | 
49 |    * Add some bootstrapping code 
50 |      * no need for WildFly Swarm or Spring Boot
51 |    * Configuration
52 |      * use the embedded Infinispan schema
53 |      * configure endpoints and other components via the global modules API
54 |      * persist configuration changes using the existing configuration serializer
55 |      * add runtime clustered configuration capabilities (store configs in an internal cache)
56 |    * Security
57 |      * Depend on Elytron for LDAP, Properties, Certificates, Kerberos, KeyCloak, etc integration
58 |    * Custom user "deployments"
59 |      * Load custom code placed in an extension directory (use JBoss Modules?)
60 |      * Scan only at startup
61 |      * use META-INF/services/ discovery
62 |    * Management
63 |      * Runtime management performed via JMX
64 |      * Revamp the Embedded CLI
65 |      * The web console can be integrated with Jolokia which exposes JMX MBeans as RESTful endpoints
66 |   
67 | 


--------------------------------------------------------------------------------
/Distributed-Stream-Sorting.md:
--------------------------------------------------------------------------------
 1 | Infinispan recently added support for streams to be used in a performant way on a distributed cache.  Unfortunately some methods such as sort were not implemented in a distributed way.  Instead the entire contents of the cache was copied to the originating node and a final sort was done on all of the contents in memory.  This while valid can be an extremely inefficient usage of memory and in some cases even run the node out of memory resulting in an OutOfMemoryException.
 2 | 
 3 | This however will change when Distributed Sorting is introduced.  The general idea is a [merge sort](https://en.wikipedia.org/wiki/Merge_sort) where each node contains a sub set of data that is independently sorted first.  Then the data is finally sorted at the originator.  This way we can also utilize the CPUs across the cluster to do initial sorting on each node so the coordinator only has to sort sets of already sorted data which is faster.
 4 | 
 5 | Even doing just a merge sort will not help with memory.  In actually it can be a bit worse because now each node could possibly run into OutOfMemory errors if they have cache stores that have more data than they could hold in memory.  Thus we need some way of using batched sizes to prevent this as well.  The distributedBatchSize is used to control how many sorted elements are retained at a time on each node before sending those batches to the originator.  This then requires each node to iterated upon a number of times equal to N / b times where N is the number of primary owned elements on that node and b is the batch size.
 6 | 
 7 | ## Non Rehash
 8 | 
 9 | Below is the pseudo code for the remote and coordinator side sorting for use when doing a sort when rehash is not supported.
10 | 
11 | ### Remote Node Sort
12 | 
13 |     array = new Array[batchSize * 2]
14 |     offset = 0
15 |     foreach element in cache {
16 |         if offset < batchSize || array[batchSize -1] < element
17 |             array[offset++] = element
18 |             if offset == array.length
19 |                 sort array
20 |                 offset = batchSize
21 |     }
22 | 
23 | Then when the entries are used for a stream the array is sorted and only the elements up to batchSize are returned.  Subsequent iterations only add an element to the array if the element is larger than the last element from the last batch.
24 | 
25 | The local coordinator holds all the responses from the various nodes including itself with a BoundedQueue which blocks that thread until it has sent all of its responses.
26 | 
27 | ### Coordinator Sort
28 | 
29 |     array = new Array[nodeCount]
30 |     nodeReturnedLastValue = null
31 |     completedNodes = 0
32 |     
33 |     method getNext
34 |     if nodeReturnedLastValue == null
35 |         offset = 0
36 |         foreach node
37 |             array[offset++] = node.queue.next
38 |         sort array
39 |     else
40 |         value = nodeReturnedLastValue.queue.next
41 |         if value != null
42 |             array[completedNodes] = nodeReturnedLastValue.queue.next
43 |             sort array
44 |         else
45 |             completedNodes++
46 | 
47 |     nodeReturnedLastValue = node where value originated
48 |     return array[completedNodes]
49 | 
50 | ## Rehash sort
51 | 
52 | Rehash sort is quite a bit similar to the non rehash.  If a node finds that a rehash occurs while iterating upon the cache entries it will send a message to the originator to say what the new view id is.  The originator upon receiving said message will wait for all nodes to send the same message or a completion message (in the case the node completed iteration without the rehash).  The originator after this state has occurred will only allow elements received until 1 node has run out but wasn't completed.  At this point any additional elements retrieved must be ignored.  The originator will then send a new batch of requests but will also send along the highest value it has seen to filter out any values that are smaller than it in the iteration process.


--------------------------------------------------------------------------------
/Infinispan-CLI.asciidoc:
--------------------------------------------------------------------------------
 1 | This document illustrates the requirements that an Infinispan CLI should implement in order to be useful and flexible. The CLI should be able to expose most (all) of the features of the Core API and of any modules which may be present on a running instance. Currently the only two methods which allow remote interaction with an Infinispan instance are JMX and Hot Rod, but both have limitations on the type of operations that can be performed.
 2 | 
 3 | === Operations
 4 | The CLI should implement the following operations:
 5 | 
 6 | * Selecting the default cache on which operations will be applied when an explicit cache is not specified
 7 | * CRUD ops such as GET, PUT, REPLACE and REMOVE (with attributes for expiration)
 8 | * Bulk ops such as CLEAR, PUTALL, possibly reading input from an external file (CSV, XML, JSON)
 9 | * Set specific Flags
10 | * Batching START and END
11 | * Transaction BEGIN, COMMIT and ROLLBACK
12 | * Manually control otherwise automatic processes (eviction, rehashing, etc)
13 | * Query indexed caches
14 | * Start / Stop containers and caches
15 | * Modifying configuration at runtime
16 | * Obtaining statistics
17 | * Integration with some form of scripting (no need to invent something, just reuse JSR-223) to be used with Distributed Executors, the Map Reduce API and possibly Listeners
18 | 
19 | === Syntax
20 | * The syntax should be concise but comprehensible, using simple verb commands like GET, PUT, etc.
21 | 
22 |  put key value
23 |  get key
24 | 
25 | * Expiration, max idle times and flags:
26 | 
27 |  put key value expires 1h maxidle 10m
28 |  put key value flags (force_sync)
29 | 
30 | * Quoting of values should be optional where unambiguous.
31 | 
32 |  put key 'my value'
33 | 
34 | * Type for primitive values should be inferred when possible, using the usual Java notation
35 | 
36 |  put key 1 # 1 is parsed as an int, and therefore boxed as an Integer when inserted in the cache
37 |  put key '1' # 1 is parsed as a string
38 | 
39 | * Non-primitive values should be specified using JSON (Jackson) or XML (XStream) notation
40 | 
41 |  put key { 'a', 'b' 'c' } # the value is an ArrayList
42 |  put key '<com.mycompany.myapp.Item><name>Widget</name></com.mycompany.myapp.Item>'
43 | 
44 | * Multiple values (e.g. for putall) could be as follows:
45 | 
46 |  put { key1: value1, key2: value2, key3: value3 }
47 | 
48 | * Data should be displayed using different formatters
49 | 
50 |  get key # display data using the value’s toString() method
51 |  get key as xml # display data in XML format (possibly using xstream)
52 |  get key as json # display data in JSON format (possibly using jackson)
53 | 
54 | * The target cache for operations can either be a default cache or defined explicitly
55 | 
56 |  cache myCache # selects the default cache
57 |  put key value # puts the entry in myCache
58 |  put otherCache.key value # puts the entry in otherCache
59 | 
60 | * Scripting functions are declared inline as follows:
61 | 
62 |  exec myCache callable:function(cache, keys) {}
63 |  mapreduce myCache mapper:function() {} reducer:function() {} collator:function() {}
64 | 
65 | === Processing
66 | The command interpreter would reside entirely within the server: the client is merely responsible for performing input and output and possibly "finger-candy" such as tab completion of commands. The interpreter also needs to handle statefulness of the CLI session (and therefore session management).
67 | 
68 | === Connectivity
69 | Most databases (SQL and noSQL alike) use their remote client port also for admin ops. We have the following possibilities:
70 | 
71 | * Use Hot Rod (pro: our remote protocol, cons: requires the server to be running, we need to implement security)
72 | * Use JMX (pro: ubiquitous, security as part of the protocol, nothing to develop, cons: ?)
73 | * Use a custom port (possibly using the simple com.sun.net.httpserver as transport) (pro: dedicated for admin ops, cons: more implementation overhead)
74 | 
75 | === Security
76 | In case the transport we select does not provide its own security already, we need to consider at least authentication and possibly encryption.


--------------------------------------------------------------------------------
/Near-Caching.md:
--------------------------------------------------------------------------------
 1 | The scope of Infinispan Near Caches is to add optional L1 caching to Hot Rod client implementations in order to keep recently accessed data close to the user. This L1 cache is essentially a local Hot Rod client cache that gets updated whenever an a remote entry is retrieved via `get` or `getVersioned` operations. By default, Hot Rod clients have near caching disabled.
 2 | 
 3 | Near Cache consistency is achieved with the help of remote events, which send notifications to clients when entries are modified or removed. At the client level, near caching can be configured to either be Lazy or Eager. When enabling near caching, the user must decide between these two modes.
 4 | 
 5 | # Lazy Near Caches
 6 | 
 7 | Entries are only added to the lazy near caches when they are retrieved remotely via `get` or `getVersioned`. If the cache entry is modified or removed server side, the Hot Rod client receive events which in turn invalidate the near cache entries by removing them from the near cache. This way of keeping near cache consistent is very efficient because the events sent back to the client only contain key information. The downside is that if a cache entry is retrieved after it's been modified, the Hot Rod client will have to fetch it from the remote server.
 8 | 
 9 | # Eager Near Caches
10 | 
11 | Eager near caches are eagerly populated as entries are created in the server, and when entries are modified, the latest value is sent along with the notification to the client and it stores it in the near cache. Eager caches, same as lazy near caches, are also populated when an entry is retrieved remotely (if not already present). The advantage of eager near caches is that it can reduce the number of trips to the server by having newly created entries already present in the near cache before any requests to retrieve it have been received. Also, if modified entries are re-queried by the client, these can be fetched directly from the near cache. The disadvantage of eager near caching is that events received from the server bigger in size due to the need to ship back value information, and entries could be sent to client that will never be queried.
12 | 
13 | To limit the cost the eager near caching, created events are only received for those entries that are created after the client has connected to the Hot Rod servers. Existing cached entries will only be added to the near cache if the user queries them
14 | 
15 | ## Filtering
16 | 
17 | Some of the downsides of eager near caches could be mitigated by enabling users to add filtering of near cache entries. In other words, the users could potentially in the future provide a filter implementation that defines which keys to fetch eagerly. This filter could be used to enable existing cached entries to be sent back to the client when the client connects to the Hot Rod servers, and could also be used to filter which created events to send back once it's already connected. **This feature is not currently planned for inclusion**
18 | 
19 | # Eviction
20 | 
21 | Being able to keep control the size of near caches is important. Even though not enabled by default, eviction of near caches can be configured by defining the maximum number of elements to keep in the near cache. When eviction is enabled, an LRU LinkedHashMap is used (protected by a `ReentrantReadWrite` lock to deal with concurrent updates). A better solution based around a BoundedCHMv8 is planned.
22 | 
23 | # Cluster
24 | 
25 | Near caches are implemented using Hot Rod remote events, which underneath use cluster listeners as technology for receiving events across the cluster. A key characteristic of cluster listeners is that within a cluster they are installed in a single node, with the rest of nodes sending events to the this node. So, it's possible for node that runs the near cache backing cluster listener to go down. In such situation, another node takes over running the cluster listener. When such thing happens, a client failover event callback can be defined to be invoked. For near caches, such callback will be implemented and its implementation will consist of clearing the near cache. Clearing is the simplest and most efficient thing to do, since during the failover events might have been missed.


--------------------------------------------------------------------------------
/Off-Heap-Data-Container.md:
--------------------------------------------------------------------------------
 1 | Currently the data container that holds all of the entries in memory stores them in the Java heap.  This area is subject to garbage collection and others.  Unfortunately the default garbage collector can cause stop the world pauses where it can halt the entire JVM from processing.
 2 | 
 3 | The easiest way to avoid these performance issues is to not store these entries in the Java Heap.  Easier said than done!  Below you will find a few ways that you could do this
 4 | 
 5 | 1. Use a third party tool to implement DataContainer (ie. MapDB, ~~ChronicleMap~~ etc.) - http://stackoverflow.com/questions/7705895/is-there-a-open-source-off-heap-cache-solution-for-java
 6 | 2. Allow for data container to store nothing and only ever go to Store.  Then we could utilize something like SIFS using a filesystem pointing to ramcache
 7 | 3. Implement our own using netty as a an example of how to do the memory allocations.
 8 | 
 9 | # Things that are incompatible
10 | 
11 | ### Write skew check
12 | Currently this relies on object instance equality
13 | ### Atomic Map
14 | This updates instance directly, thus ICE factory update has to make a new instance.  Also proxy requires some additional changes.  This doesn't seem to work properly also because of Read Committed below...
15 | ### Class Resolver
16 | Can't figure out quite yet how to get the class loader to work properly with new DBContainer
17 | ### Read Committed
18 | Read committed cannot be used since everything is serialized when stored.  Only repeatable read may be used.  This is besides the fact that read committed has issues in a DIST cache anyways.
19 | 
20 | 
21 | ## Map DB array perf
22 | Result: 251528.669 ±(99.9%) 3849.842 ops/s [Average]
23 |   Statistics: (min, avg, max) = (204159.349, 251528.669, 271807.636), stdev = 16300.471
24 |   Confidence interval (99.9%): [247678.827, 255378.510]
25 | 
26 | 
27 | Run complete. Total time: 00:13:36
28 | 
29 | Benchmark|                      (cacheName)|   Mode|  Cnt|        Score|       Error|  Units|
30 | ---------|---------------------------------|--------|----|--------------|-----------|--------|--
31 | IspnBenchmark.testGet|           nonTxCache|  thrpt|  200|  1657296.817| ± 26651.726|  ops/s|
32 | IspnBenchmark.testPutImplicit|   nonTxCache|  thrpt|  200|   251528.669| ±  3849.842|  ops/s|
33 | 
34 | ## MapDB Direct perf
35 | 
36 | Result: 106203.466 ±(99.9%) 1703.474 ops/s [Average]
37 |   Statistics: (min, avg, max) = (86628.725, 106203.466, 117797.812), stdev = 7212.614
38 |   Confidence interval (99.9%): [104499.992, 107906.939]
39 | 
40 | 
41 | Run complete. Total time: 00:13:37
42 | 
43 | Benchmark|                      (cacheName)|   Mode|  Cnt|        Score|       Error|  Units|
44 | ---------|---------------------------------|--------|----|--------------|-----------|--------|--
45 | IspnBenchmark.testGet|           nonTxCache|  thrpt|  200|  1608557.747| ± 21212.189|  ops/s|
46 | IspnBenchmark.testPutImplicit|   nonTxCache|  thrpt|  200|   106203.466| ±  1703.474|  ops/s|
47 | 
48 | 
49 | ## Master perf
50 | Result: 1318669.580 ±(99.9%) 17467.592 ops/s [Average]
51 |   Statistics: (min, avg, max) = (1043144.741, 1318669.580, 1453850.008), stdev = 73958.877
52 |   Confidence interval (99.9%): [1301201.989, 1336137.172]
53 | 
54 | 
55 | Run complete. Total time: 00:13:35
56 | 
57 | Benchmark|                      (cacheName)|   Mode|  Cnt|        Score|       Error|  Units|
58 | ---------|---------------------------------|--------|----|--------------|-----------|--------|--
59 | |IspnBenchmark.testGet|           nonTxCache|  thrpt|  200|  20555714.561| ± 274213.492|  ops/s|
60 | IspnBenchmark.testPutImplicit|   nonTxCache|  thrpt|  200|   1318669.580| ±  17467.592|  ops/s|
61 | 
62 | ## SFS perf
63 | Result: 74038.493 ±(99.9%) 1512.844 ops/s [Average]
64 |   Statistics: (min, avg, max) = (42949.523, 74038.493, 81948.143), stdev = 6405.476
65 |   Confidence interval (99.9%): [72525.649, 75551.337]
66 | 
67 | 
68 | Run complete. Total time: 00:13:36
69 | 
70 | Benchmark|                      (cacheName)|   Mode|  Cnt|         Score|        Error|  Units|
71 | ---------|---------------------------------|--------|----|--------------|-----------|--------|--
72 | IspnBenchmark.testGet|           nonTxCache|  thrpt|  200|  774529.207| ± 12164.148|  ops/s|
73 | IspnBenchmark.testPutImplicit|   nonTxCache|  thrpt|  200|   74038.493| ±  1512.844|  ops/s|
74 | 
75 | 


--------------------------------------------------------------------------------
/Total-Order-non-Transactional-Cache.md:
--------------------------------------------------------------------------------
 1 | #Summary
 2 | 
 3 | Use the total order protocol to handle commands in non transactional cache.
 4 | 
 5 | #Changes in configuration:
 6 | 
 7 | * `transaction-protocol` attribute will be removed from `<transaction>` tag.
 8 | * `TOTAL_ORDER` option will be added `locking-mode`.
 9 | 
10 | #Description (for a normal scenario):
11 | 
12 | A client invoked a write operation in node `N`:
13 | * `N` sends a total order message to all the owners with the write operation;
14 | * All owners deliver the write operation and perform it;
15 | * `N` waits for all the replies and returns the result back to the client;
16 | 
17 | **Why it works?**
18 | 
19 | Since the write operations are deliver in total order by all the owners of the key(s), then when the write operation is deliver, it sees the same state and performs the same steps, generation the same new state.
20 | 
21 | **Note:**
22 | In the initial version, the commands are processed directly in the JGroups thread. If it justifies, a new non-transactional total order manager may be created and the operations start to be processed in the remote executor service.
23 | 
24 | **Optimizations:**
25 | The originator does not need to wait for all the replies. The reply can be marked as `STABLE` or `UNSTABLE` (or other name; the goal is to mark as unstable the replies from commands processed during the state transfer). Then, the originator waits:
26 | 
27 | 1. for the first reply is the reply is marked as `STABLE`, or it is an exception, or `IGNORE_RETURN_VALUE` is set and the operation is non-conditional;
28 | 2. for the first non-null reply when the reply is marked as `UNSTABLE`.
29 | 
30 | #State Transfer
31 | 
32 | As in total order transactional caches, the cache topology commands which modifies the topology ID are sent in total order. Having this in mind, we have:
33 | 
34 | * **when the command topology ID is different from the current topology ID**
35 | 
36 | Exception is throw. Originator re-sends the operation.
37 | 
38 | **Why?**
39 | 
40 | If the operation is delivered in a different topology ID, then the new owners will not see it, breaking the data consistency.
41 | 
42 | **Can't it forward the operation to the new owners?**
43 | 
44 | No. No order is guarantee when forward operations. Imagine the following scenario:
45 | 
46 | 1. Node `N` sends two operations `C1` and `C2` for the same key. `C1` is sent in topology `i` and `C2` in topology `i+1`.
47 | 2. `C1` and `C2` are delivered in topology `i+1` and `C1` is delivered before `C2`. 
48 | 3. If `C1` is forward, the new owner could delivered the operation `C1` following by `C2` or `C2` following `C1`. If the later happens, the data consistency is broken and the algorithm too.
49 | 
50 | **optimization:**
51 | In replicated caches, if only nodes are leaving, the operation can be processed normally.
52 | 
53 | * **when a conditional operation is delivered and state transfer is in progress**
54 | 
55 | Exception is throw. Originator should block the operation until the state transfer is finished.
56 | 
57 | **Why?**
58 | 
59 | When the operation is delivered in the new owners, they don't have the current value to check the condition.
60 | 
61 | **Can't it fetch the current value from old owners?**
62 | 
63 | No. The remote get is not ordered with the operation, so it can return a value after or before the operation. If a value before the operation is returned, the consistency is broken.
64 | 
65 | **Can't the command wait in the new owner until the state is received?**
66 | 
67 | No. It is going to block the total order deliver thread and it will never deliver the new topology ID.
68 | 
69 | **optimization:**
70 | if the key is not affected by the state transfer (i.e. the ownership didn't change), then the conditional operation can be processed normally. However, it causes the following:
71 | 
72 | 1. in replicated caches, a node joining will block all the conditional operations;
73 | 2. in replicated caches, a node leaving does not block any operations.
74 | 
75 | * **when a non-conditional operation is delivered and state transfer is in progress**
76 | 
77 | The command is processed normally and the originator takes the replies and return the value different from `null` (if any).
78 | 
79 | **Why?**
80 | 
81 | First, the operation does not dependent on the state of the data. Second, the old owners will return the current value when the operation is deliver and the new owner can return `null` or the current value (if it was already received from state transfer).


--------------------------------------------------------------------------------
/Multimap-As-A-First-Class-Data-Structure.asciidoc:
--------------------------------------------------------------------------------
  1 | = Dynamic RBAC
  2 | 
  3 | == Tracking Issue
  4 | 
  5 | https://issues.redhat.com/browse/ISPN-13853
  6 | 
  7 | == Description
  8 | 
  9 | Multimap is a data structure that has gotten the attention from Infinispan users.
 10 | The structure today lacks evolutions and support of features and new API of Infinispan:
 11 | 
 12 | * REST API support
 13 | * Supporting duplicates
 14 | * Reactive API integration and development
 15 | * Integration with Quarkus through the extension
 16 | * Search and query support
 17 | 
 18 | == Compatibility impact
 19 | 
 20 | Ensure that the old Remote API and configuration still works.
 21 | 
 22 | Under the hood, we will still keep using distributed caches, but we will know with the configuration that those
 23 | caches are meant to behave as multimaps. This way we will be able to change the implementation details in
 24 | the future if the structure becomes even more popular.
 25 | 
 26 | Indexing and Query should evolve to detect that the content to be indexed is inside the list/set of the key/value.
 27 | Query will be supported *only* if indexing is enabled.
 28 | 
 29 | == Security impact
 30 | 
 31 | N/A
 32 | 
 33 | == Configuration Schema
 34 | 
 35 | [source,xml]
 36 | ----
 37 | <xs:complexType name="multimap">
 38 |     <xs:complexContent>
 39 |       <xs:extension base="tns:distributed-cache">
 40 |         <xs:attribute name="duplicates" type="xs:boolean" default="false">
 41 |           <xs:annotation>
 42 |             <xs:documentation>Determines if the multimap supports duplicates in the values. True or false (default).</xs:documentation>
 43 |           </xs:annotation>
 44 |         </xs:attribute>
 45 |       </xs:extension>
 46 |     </xs:complexContent>
 47 | </xs:complexType>
 48 | ----
 49 | 
 50 | 
 51 | == Public API
 52 | 
 53 | The work will be done for the new API only.
 54 | 
 55 | == Deprecations
 56 | 
 57 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCache`
 58 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCacheManager`
 59 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCacheManagerFactory`
 60 | * `org.infinispan.client.hotrod.multimap.MultimapCacheManager`
 61 | * `org.infinispan.client.hotrod.multimap.RemoteMultimapCache`
 62 | * `org.infinispan.client.hotrod.multimap.MetadataCollection`
 63 | 
 64 | == Hot Rod API
 65 | 
 66 | The new API will be developed. Provide implementations for the available APIs:
 67 | 
 68 | * org.infinispan.api.async.SyncMultiMap
 69 | * org.infinispan.api.async.SyncMultiMaps
 70 | * org.infinispan.api.async.MutinyMultiMap
 71 | * org.infinispan.api.async.MutinyMultiMaps
 72 | * org.infinispan.api.async.AsyncMultiMap
 73 | * org.infinispan.api.async.AsyncMultiMaps
 74 | 
 75 | == REST API
 76 | 
 77 | Multimap REST API should be the almost the same as the Cache one.
 78 | The difference is in the endpoint.
 79 | 
 80 | Examples:
 81 | [source]
 82 | ----
 83 | GET /rest/v2/multimaps/
 84 | 
 85 | GET /rest/v2/multimaps/{multimapName}?action=???
 86 | 
 87 | DELETE /rest/v2/multimaps/{multimapName}/{multimapKey}
 88 | ----
 89 | 
 90 | * CRUD multimap (providing a configuration to create a multimap, templates not supported)
 91 | * Display Multimap content entries
 92 | * Get values by key
 93 | * In Indexing is enabled, multimaps support query
 94 | * CRUD Key/Value. If a key exists we add the value. To remove a key/value we need to send the value as well.
 95 | * CRUD Key/AllValues. We remove the entry completely.
 96 | * Clearing Multimap operation
 97 | * Enable/Disable rebalancing
 98 | * Availability get/set
 99 | * Replace not supported
100 | * Listeners support to enable interactions with near-caching
101 | 
102 | 
103 | == CLI
104 | 
105 | * `*create multimap* --file=mymultimap.xml myMultimap` Create a new multimap
106 | * `*ls multimap* --file=mymultimap.xml myMultimap` List all multimaps
107 | * `*describe multimap/name*  View the configuration
108 | 
109 | etc ...
110 | 
111 | Provide the same CLI methods but targeting multimap instead of caches.
112 | 
113 | == Console
114 | 
115 | The console should use the new REST API and provide a new interface separate from the Caches tab
116 | where we will be able to CRUD operations over multimaps and display correctly the key/values of
117 | the multimap.
118 | The UX team should be in the loop to check out the design.
119 | 
120 | == Operator
121 | 
122 | Provide a MultimapCR
123 | 
124 | == Quarkus Integration
125 | Integrate in Quarkus developing a new annotation `@Multimap` that will allow the use of dependency inyection
126 | to work with Multimaps in Quarkus.


--------------------------------------------------------------------------------
/Off-Heap-Implementation.md:
--------------------------------------------------------------------------------
 1 | Infinispan currently only stores entries on the Java heap.  This can cause issues if the cache is used in such a way that a large amount of stop the world garbage collections are performed.  One such fix is to write these objects to a cache store, however this can cause performance issues.  Instead Infinispan would like to instead of storing entries on the Java heap but rather to store them in off heap memory managed by Infinispan itself.  This would allow for more memory to be used and should reduce significantly the number of garbage collections required.
 2 | 
 3 | 
 4 | This requires quite of a bit of work to be done in the core module of Infinispan.  Before I go into the details of how the overall design for core will go I will list some limitations
 5 | 
 6 | ## Limitations
 7 | 
 8 | 1. Write skew check
 9 | Currently this relies on object instance equality, which will not work since objects are required to be deserialized/serialized
10 | 2. Atomic Map
11 | This updates instance directly, thus ICE factory update has to make a new instance.  Also proxy requires some additional changes.  This doesn't seem to work properly also because of Read Committed below...
12 | 3. Read Committed
13 | Read committed cannot be used since everything is serialized when stored.  Only repeatable read may be used.  This is besides the fact that read committed has issues in a DIST cache anyways.
14 | 4. DeltaAware
15 | DeltaAware would technically work, but it will take a significant performance hit due to the fact that objects would have to be fully deserialized then apply delta and then serialized again.
16 | 5. Store as Binary
17 | Both this and off heap together do not make sense
18 | 
19 | ## Configuration
20 | 
21 | Off heap memory would be configured under a new element ("STORAGE"?, "MEMORY"?, under data container?) along with storeAsBinary and wrapByteArray.  The wrapByteArray is a new configuration to replace equivalence that allows for byte[] to be used as key or values.  This element is mutually exclusive as it wouldn't make sense to use them together.  Compatibility would fall under this but we are planning on removing Compatibility in Infinispan 9 as well
22 | 
23 | ## Put operation
24 | 
25 | The general idea for core is as following for a general put:
26 | 
27 | 1. Originator immediately deserializes the object as a wrapped byte[] (Do we use hashCode for byte[] or object?)
28 | 2. Originator sends wrapped byte[] to the primary owner and normal forwarding occurs
29 | 3. The primary and backup owners have an interceptor installed that converrts the byte[] to an off heap object before writing this instance to the data container
30 | 4. The previous value (if needed) is sent back as a byte[] and the originator finally deserializes it before returning it to the originator
31 | 
32 | ## Get operation
33 | 
34 | A get will work in a similar fashion:
35 | 
36 | 1. Originator immediately deserializes the object as a wrapped byte[] (Do we use hashCode for byte[] or object?)
37 | 2. Originator sends wrapped byte[] to the owner(s) to retrieve the value
38 | 3. Hacky but the wrapped byte[] can compare equivalence to the off heap object without having to allocate a byte\[\] (indexed gets) - (TODO: check bulk get byte[] - requires read only copy though)
39 | 4. Originator deserializes byte[] to an object before returning to caller
40 | 
41 | ## Bulk operations
42 | 
43 | Bulk methods such as distributed streams will require deserialization of entries on each node.  If we store the hash along with the off heap instance we do not need to deserialize non owned keys as the hash code would determine which segment it is in.
44 | 
45 | ## Statistics
46 | 
47 | We need to make statistics available that show how much direct memory is in use (blocks? available? used?).  Unfortunately this may have an issue with separating information by cache level and may only provide per node usage.
48 | 
49 | # Server considerations
50 | 
51 | The server doesn't require as many changes as in the core, but there are a few things to implement and investigate
52 | 
53 | ## Reuse Netty buffer
54 | 
55 | Netty can utilize direct byte buffers for its data read from the socket.  This can be useful to reuse these to prevent an additional byte[] allocation.  It is unclear if this will work as intended, but if possible this will be added.  Even if this works it may not save any performance however for writes as JGroups currently only allows for byte[] to be written and not off heap memory.
56 | 
57 | ## Compatibility
58 | 
59 | Compatibility cannot be enabled with off heap and will throw an exception preventing the cache from starting.


--------------------------------------------------------------------------------
/Deelog:-direct-integration-with-Debezium.asciidoc:
--------------------------------------------------------------------------------
 1 | == Contributors
 2 | 
 3 | Emmanuel Bernard, Randall Hauch
 4 | 
 5 | Initial proposal 2016-12-09
 6 | 
 7 | == Goal
 8 | 
 9 | Make Infinispan a Debezium source connector.
10 | Improve as much as possible change event capture minimizing the risk of change loss.
11 | The state in Kafka must be eventually consistent with the Infinispan state.
12 | 
13 | This proposal also integrates lightly with Infinispan: it does not require a native event log system in Infinispan.
14 | 
15 | == Proposal
16 | 
17 | The total order would not be global across the system but per key.
18 | The accepted order will be the one eventually captured in Kafka.
19 | 
20 | Integrate Debezium as a library and Infinispan interceptor in each node.
21 | That Debezium library will collect changes and write them in a Kafka queue.
22 | 
23 | Each node has a Debezium connector instance embedded that listens to the
24 | operations happening (primary and replicas alike).
25 | This can be done via an Infinispan interceptor sending the events to a queue.
26 | That queue is listened by a Debezium thread.
27 | All of this process is happening async compared to the operation.
28 | 
29 | Per key, a log of operations is kept in memory (it contains the key, the
30 | operation, the operation unique id and a ack status.
31 | 
32 | If on the key owner, the operation is written by the Debezium connector
33 | to Kafka when it has been acked (whatever that means is where I'm less
34 | knowledgable and needs to be clarified).
35 | 
36 | On a replica, the kafka partition is read regularly to clear the
37 | in-memory log from operations effectively stored in Kafka.
38 | If the replica becomes the owner, it reads the kafka partition to see
39 | what operations are already in and writes the missing ones.
40 | 
41 | There are a few cool things:
42 | 
43 | * few to no change in what Infinispan does
44 | * no global ordering simplifies things and frankly is fine for most
45 |  Debezium cases. In the end a global order could be defined after the
46 |  fact (by not partitioning for example). But that's a pure downstream
47 |  concern.
48 | * everything is async compared to the Infinispan ops
49 | * the in-memory log can remain in memory as it is protected by replicas
50 | * the in-memory log is self cleaning thanks to the state in Kafka
51 | 
52 | This model works as long as we don't lose two owners consecutively before enough replicas have caught up (queue change wise).
53 | 
54 | == When two owners die too fast
55 | 
56 | We're in trouble.
57 | 
58 | When two owners die too fast consecutively, we lose the event queue.
59 | We might also lose the state if the state transfer has not finished.
60 | 
61 | In both situations, we are left with a state in Kafka which is different than the state in Infinispan and for which we cannot guarantee the eventual consistency.
62 | So we need to send the full state of the lost segments into Kafka as a state _reset_ and make sure that any entry present in Kafka but not in Infinispan are tombstoned.
63 | 
64 | There are a few possible solutions:
65 | 
66 | 1. from the embedded Debezium, read the Kafka history (could be very long) and compare it from the state transfered state (either clean if we lost state or read back from the cache store); based on the comparison, send the appropriate key create/update and delete. This can be a slow and memory intensive process
67 | 2. from the embedded Debezium, send a global tombstone for the whole segment or even the whole state and send the state afresh to Kafka. This avoids reading Kafka's history (slowness and memory consumption) but the full retransfer of state can be quite long
68 | 3. use a two staged process: send the fact that we might have lost changes as a tombstone to Kafka plus the full segment state. A Debezium consumer would read that information and based on a kept local state build the diff and send less drastic stream of change to actual Debezium end consumers.
69 | 
70 | ATM option 2 or 3 feel the most appropriate as building the diff on the Infinispan side would be costly and memory consuming.
71 | 
72 | Would it help to do a many to one between segments and kafka partitions?
73 | 
74 | == On initial cluster start
75 | 
76 | We are also somewhat in trouble.
77 | How do we know that we had a clean stop with full queue flushed?
78 | 
79 | If the queues have not been flushed, then we are back to the problem of the two owners dying too fast (see above).
80 | If the queues have been flushed properly, Kafka is in a correct state and we can carry on.
81 | 
82 | == Additional cleaning
83 | 
84 | When a replica is elected as new owner, we need to differentiate two statuses:
85 | 
86 | * the new owner has no queue and thus we have lost change events and eventual consistency
87 | * the new owner has a queue which is either non empty (catch up to do) or empty because it was already synced. In this situation we are good on our eventual consistency promise.
88 | 
89 | == Opened questions
90 | 
91 | How to capture transaction
92 | 
93 | == Alternative
94 | 
95 | Rely on a full fledge log handled by Infinispan itself: see [Gustavo's proposal](https://github.com/infinispan/infinispan/wiki/Remote-Listeners-improvement-proposal).
96 | On that front, it looks like a full fledge log is more complex to get right and we can start with proposal Deelog before exploring the full-fledge log.


--------------------------------------------------------------------------------
/Incremental-Optimistic-Locking.asciidoc:
--------------------------------------------------------------------------------
 1 | === Context
 2 | When using link:Optimistic-Locking-In-Infinispan[optimistic locking] we still have a deadlock situation which is not covered by the link:Lock-Reordering-For-Avoiding-Deadlocks[lock reordering]:
 3 | 
 4 | * Tx1 and Tx2 transactions executing in parallel on two nodes `N1` and `N2` and writing the keys `{a,b}`
 5 | * `consistentHash(a) = {N3}` and `consistentHash(b) = {N4}`
 6 | * with some right timing, during prepare time it is possible for these two transactions to deadlock:
 7 | ** Tx1 lock acquired on "a" @ N3
 8 | ** Tx2 lock acquired on "b" @ N4
 9 | ** Tx1 cannot progress acquiring  lock on "b" @ N4, that lock is acquired by Tx2
10 | ** Tx2 cannot acquire lock on "a" @ N3 as that lock is held by Tx1
11 | ** Tx1 and Tx2 are waiting for each other => deadlock
12 | 
13 | NOTE: this problem stands disregarding the number of owners for a key.
14 | 
15 | === Solution
16 | The suggested solution for solving the above described deadlock is by enforcing transactions to acquire locks on remote nodes in the same order. Transactions, at prepare time, order the nodes it "touches" based on some criteria, and issues lock acquisition incrementally: it doesn't acquire lock on a node the the order sequence until it has a lock on the previous node in the sequence (and consequently on all the nodes before it).
17 | 
18 | Node ordering can be defined based on cluster's view. An alternative ordering approach is `consistentHash(nodeAddress)`, but that might lead to conflicts when node that have addresses mapping to the same value.
19 | 
20 | In the example above, considering the view = `{N1, N2, N3, N4}` and incremental lock acquisition:
21 | 
22 | * Tx1 acquires lock on a@ N3
23 | * Tx2 tries to acquire lock on a@ N3. It cannot acquire and waits for Tx1 to release it
24 | * Tx1 acquires locks on b@ N4, completes and releases locks
25 | * Tx2 acquires lock on a@ N3
26 | * Tx2 acquires lock on b@ N4 and completes as well
27 | 
28 | Following sections discusses two approaches for acquiring locks incrementally.
29 | 
30 | ==== Direct incremental locking
31 | In this approach, N1 does all the RPCs needed for acquiring locks in sequence. The diagram below illustrates how this happens, measuring performance based on link:http://en.wikipedia.org/wiki/Latency_(engineering)[one-way network latency].
32 | 
33 | image::https://community.jboss.org/servlet/JiveServlet/downloadImage/16657/direct.png[]
34 | 
35 | The numbers on the arrows identify one-way network calls: N1 first locks a@N2 (1) and then waits for acknowledgement (2), then b@N3 (RPC 2) etc. With this approach, the cost of an transactions, estimated as one-way network calls is: 2 x number of nodes touched by a transaction. The next section describes a less costly approach.
36 | 
37 | ==== Async incremental locking
38 | In this approach, the transaction coordinator multicasts the lock acquisition request to all the transaction participants. At the next step, the node having the lowest index in the sequence of nodes touched by the transactions (ordering given by cluster view), acquires lock and then sends an async lock acquisition request to the next node in the sequence. This is depicted in the following diagram:
39 | 
40 | image::https://community.jboss.org/servlet/JiveServlet/downloadImage/16657/async.png[]
41 | 
42 | 1a, 1b and 1c happen in parallel (multicast). When the 1a RPC thread arrives at N2 it also tries to acquire lock on "a". On the other hand 1b and 1c don't try lock acquisition, but wait a confirmation from N2 and N3 respectively. After 1a acquired lock it sends an async message to N3, informing it that it can move on and acquire lock on "b". Then lock is acquired on b@N3, then N3 issues an async call to N4 etc.
43 | The cost of this approach is equal to the number of nodes touched by a transaction,  which is, in theory, twice as good as the direct incremental locking approach.
44 | 
45 | ===== What if the async ack call fails?
46 | The async call needs to be send in a pseudo-reliable way i.e. guaranteed to happen if the node is still alive. If the node crashes the transaction can retry the entire lock acquisition process based on the new owner of the data.
47 | 
48 | === Hybrid incremental approach
49 | This is a variation of the async incremental approach in which, during the initial multicast, all nodes try to acquire locks with a very small timeout (potentially 0). If all succeed then transaction proceeds to the 2nd phase of 2PC. If at least one node fails, the locks are being released on all the nodes touched by the transaction which follow its sequence in the view. This is depicted in the following diagram:
50 | 
51 | image::https://community.jboss.org/servlet/JiveServlet/downloadImage/16657/hybrid.png[]
52 | 
53 | ==== Notes on the diagram
54 | * 1a, 1b and 1c happen in parallel as 2-way rpcs
55 | * 1b fails to acquire lock on b
56 | * when multicast 1 finishes, N1 is aware that lock acquistion failed on N3 but succeeded on N2 and N4
57 | * N1 tells N2 to start the incremental lock acquisition (2a). At the same time tells N4 to release it's lock on "c" (2c)
58 | * 2c is needed in order to avoid potential deadlocks
59 | 
60 | This hybrid approach is a good fit for scenarios where there is some contention and deadlocks, but not a lot of it: the initial multicast is optimistic in the sense that it doesn't expect deadlocks to happen - but if there are any then it reacts quickly.
61 | 
62 | === Related
63 | * This optimization is tracked by link:https://issues.jboss.org/browse/ISPN-1219[ISPN-1132]
64 | * link:Optimistic-Locking-In-Infinispan[Infinispan's optimistic locking]
65 | * link:Lock-Reordering-For-Avoiding-Deadlocks[Lock reordering]


--------------------------------------------------------------------------------
/Remote-Iterator.md:
--------------------------------------------------------------------------------
  1 | ## Intro
  2 | (by Tristan)
  3 | 
  4 | Currently the HotRod protocol implements a few bulk operations (BulkGetRequest, BulkGetKeysRequest, QueryRequest) which return arrays of data. The problem with these operations is that they are very inefficient since they build the entire payload on the server, thus potentially requiring a lot of transient memory, and then send it to the client in a single response, which again might be to large to handle. It is also not possible to pre-filter the data on the server according to some criteria to avoid sending to the client unneeded entries.
  5 | 
  6 | For symmetry with the embedded distributed iterator, HotRod should implement an equivalent remote distributed iterator which would:
  7 | 
  8 | - allow a client to specify a filter/scope (local, global, segment, custom, etc)
  9 | - allow the client and the server to iterate through the retrieved entries using an iterator paradigm appropriate to the environment (java.lang.Iterator in Java, <iterator> in C++, IEnumerator in C#) in a memory efficient fashion
 10 | - allow a server to efficiently batch entry in multiple responses
 11 | - leverage the already existing KeyValueFilter and Converters, including deployment of custom ones into the server
 12 | 
 13 | ## Proposal
 14 | 
 15 | ### API changes
 16 | 
 17 | A new method on `org.infinispan.client.hotrod.RemoteCache`:
 18 | 
 19 | ```java
 20 | /**
 21 |   * Retrieve entries from the server
 22 |   *
 23 |   * @param filterConverterFactory Factory name for {@link KeyValueFilterConverter} or null
 24 |   * @param segments The segments to iterate on or null if all segments should be iterated
 25 |   * @param batchSize The number of entries transferred from the server at a time
 26 |   */
 27 | CloseableIterator retrieveEntries(String filterConverterFactory, Set<Integer> segments, int batchSize);
 28 | ```
 29 | <br>
 30 | ### HotRod Protocol
 31 | 
 32 | ##### Operation: ITERATION_START
 33 | 
 34 | Request:
 35 | 
 36 | | Field name      | Type    | Value |
 37 | | ----------------| --------| ------
 38 | | SEGMENTS        | byte[ ] | segments requested 
 39 | | FILTER_CONVERT  |  String | Factory name for FilterConverter
 40 | | BATCH_SIZE      | vInt    | batch to transfer from the server
 41 | 
 42 | >About segments
 43 | 
 44 | > One way to encode the list of segments ids to be sent is to use 
 45 | > N bits to represent each of the N segments, being a flipped bit a desired segment id. With the default number of segments(60) it'd use only 60 bits, but if the number of segments is gigantic this could cause issues.
 46 | 
 47 | Response:
 48 | 
 49 | | Field name      | Type    | Value |
 50 | | ----------------| --------| ------
 51 | | ITERATION_ID    | String  | UUID of the iteration
 52 | 
 53 | <br>
 54 | 
 55 | ##### Operation: ITERATION_NEXT
 56 | 
 57 | Request:
 58 | 
 59 | | Field name      | Type    | Value |
 60 | | ----------------| --------| ------
 61 | | ITERATION_ID    | String  | UUID of the iteration 
 62 | 
 63 | Response:
 64 | 
 65 | | Field name      | Type    | Value |
 66 | | ----------------| --------| ------
 67 | | ITERATION_ID      | String  | id of the iteration
 68 | | FINISHED_SEGMENTS | byte[ ]  | segments that finished iteration
 69 | | ENTRIES_SIZE      | vInt    | number of entries transfered
 70 | | KEY 1               | byte    | entry 1 key
 71 | | VALUE 1             | byte    | entry 1 value
 72 | |  ...               |  ...    |  ...
 73 | | KEY N               | byte    | entry N key
 74 | | VALUE N             | byte    | entry N value   
 75 | 
 76 | <br>
 77 | ##### Operation: ITERATION_END
 78 | 
 79 | Request:
 80 | 
 81 | | Field name      | Type    | Value |
 82 | | ----------------| --------| ------
 83 | | ITERATION_ID    | String  | UUID of the iteration 
 84 | 
 85 | Response:
 86 | 
 87 | | Field name      | Type    | Value |
 88 | | ----------------| --------| ------
 89 | | STATUS    | byte  | ACK of the operation 
 90 | 
 91 | <br>
 92 | ### Client side 
 93 | 
 94 | Upon calling ```retrieveEntries``` an ITERATION_START op is issued and the client will store the tuple ```[Address, IterationId]``` so that all subsequent operations will go to the same server.
 95 | 
 96 | ```CloseableIterator.next()``` will receive a batch of entries and will keep the keys internally on a per segments basis. 
 97 | Whenever a segments is finished iterating (received on field FINISHED_SEGMENTS of ITERATION_NEXT response), those keys will be discarded. 
 98 | <br>
 99 | ### Server Side
100 | A ITERATION_START request will create and obtain a ```CloseableIterator``` from a ```EntryRetriever``` with the optional ```KeyValueFilterConverter```, batch size, and set of segments required. The ```CloseableIterator``` will be associated with a UUID and will be disposed upon receiving a ITERATION_CLOSE request. The ```CloseableIterator``` will also have a ```SegmentListener``` associated so that it is notified of finished segments and send them back the client on the ITERATION_NEXT responses.
101 | 
102 | An ITERATION_CLOSE will dispose the ```CloseableIterator```
103 | 
104 | ### Failover
105 | If the server backing a ```CloseableIterator``` dies, the client will restart the iteration in other node, filtering out already finished segments.
106 | 
107 | One drawback of the per-segment iteration control is that if a particular segment was not entirely iterated before the server goes down, it will have to be iterated again, leading to duplicate values.
108 | 
109 | One way to avoid duplicates values to the caller of ```CloseableIterator.next()``` is to have the client to store the keys of the unfinished segments and use them to filter out duplicates after the failover happens.
110 | 
111 | ### References
112 | 
113 | JIRA: [ISPN-5219] (https://issues.jboss.org/browse/ISPN-5219)
114 | 
115 | 


--------------------------------------------------------------------------------
/scaling-without-state-transfer.asciidoc:
--------------------------------------------------------------------------------
  1 | = Scaling up without state transfer
  2 | 
  3 | The goal is to be able to add nodes to the cluster and make them own new entries,
  4 | without also owning any of the old entries.
  5 | 
  6 | Simplifying assumptions:
  7 | 
  8 | * Only one node is being added at a time.
  9 | * Single owner
 10 | * No transactions
 11 | * When scaling down data is just lost
 12 | 
 13 | The basic idea is that all keys are written on the newest member.
 14 | The location of all inserted keys is kept in a replicated cache
 15 | (the **anchor cache**) and used for further reads/updates/removals.
 16 | 
 17 | It's important to note that clients don't have access to the anchor cache,
 18 | so clients access servers in a round-robin fashion.
 19 | 
 20 | 
 21 | == Implementation
 22 | 
 23 | The implementation will live in a module named `anchored-keys`.
 24 | 
 25 | Since we can't deterministically map keys to segments, we don't need a distributed cache.
 26 | Instead we can use an invalidation cache and the `ModuleLifecycle` implementation
 27 | can replace the invalidation interceptor with a custom interceptor.
 28 | 
 29 | The interceptor will first look up affected keys in the anchor cache.
 30 | If the key doesn't exist in the anchor cache, reads can assume the value is missing.
 31 | Writes must insert the current writer's location in the anchor cache,
 32 | then forward to the current writer for the actual write.
 33 | 
 34 | When a key is removed, the interceptor must also remove the key from the anchor cache.
 35 | When a node leaves, the interceptor must remove all the keys mapped to that node
 36 | from the anchor cache.
 37 | 
 38 | 
 39 | == Configuration
 40 | 
 41 | Configuration means a custom element `anchored-keys`
 42 | with a single attribute `enabled`.
 43 | 
 44 | ```
 45 | <invalidation-cache name="name">
 46 |     <anchored-keys enabled="true"/>
 47 | </invalidation-cache>
 48 | ```
 49 | 
 50 | 
 51 | == Performance considerations
 52 | 
 53 | === Latency / RPCs
 54 | 
 55 | The client doesn't know the owner, so any read or write has a
 56 | `(N-1)/N` probability of requiring a unicast RPC from the processing server to the owner.
 57 | 
 58 | In addition to this, the first write to a key also requires a replicated cache write,
 59 | which means a multicast RPC plus a `(N-1)/N` probability of another unicast RPC.
 60 | 
 61 | If `FailoverRequestBalancingStrategy` knew whether the next request
 62 | was a read or a write, we could make it always send write requests
 63 | to the last server.
 64 | However, that would only work if updates and removals are minimal,
 65 | otherwise the last server could be overloaded.
 66 | 
 67 | === Memory overhead
 68 | 
 69 | The anchor cache contains copies of all the keys and their locations,
 70 | plus the overhead of the cache itself.
 71 | 
 72 | The overhead is lowest for off-heap storage:
 73 | 21 bytes in the entry itself plus 8 bytes in the table,
 74 | assuming no eviction or expiration.
 75 | The location is another 20 bytes, assuming we keep the serialized owner's address.
 76 | 
 77 | Note: We could reduce the location size to <= 8 bytes
 78 | by encoding the location as an integer.
 79 | 
 80 | Addresses are interned, so an address already uses only 4 bytes
 81 | with OBJECT storage and `-XX:+UseCompressedOops`.
 82 | But the overhead of the ConcurrentHashMap-based on-heap cache is much bigger,
 83 | at least 32 bytes from CHM.Node, 24 bytes from ImmortalCacheEntry,
 84 | and 4 bytes in the table.
 85 | 
 86 | Note: because replicated writes are sent to the primary owner,
 87 | which forwards them to all the other members, the keys cannot be de-duplicated.
 88 | 
 89 | === State transfer
 90 | 
 91 | There will be no state transfer for the main cache, but the anchor cache still needs
 92 | to transfer all the keys and the location information.
 93 | 
 94 | Assuming that the values are much bigger compared to the keys and to the cache overhead per entry,
 95 | the anchor cache's state transfer should also be much faster compared to the state transfer for the main cache.
 96 | 
 97 | The initial state transfer should not block a joiner from starting,
 98 | because the joiner can ask an existing node for the location.
 99 | 
100 | == Alternative implementations
101 | 
102 | === Key generator
103 | If the keys could be generated by Infinispan, e.g. in a server task,
104 | we could generate the keys to map to a segment owned by the current writer
105 | and we wouldn't need to keep track of key locations.
106 | 
107 | 
108 | === Cluster loader
109 | Instead of keeping track of the key locations in a replicated cache,
110 | an anchored cache could send a broadcast request to all the members.
111 | 
112 | Read hits would be slowed down a bit, because they would send out
113 | additional get requests, but the difference would be small, because
114 | they would only have to wait for the first non-null response.
115 | Read misses, on the other hand, would be much slower, because
116 | the originator would have to wait for responses from all the members
117 | before telling the client that the key does not exist in the cache.
118 | 
119 | Writes would be faster, because they would only need a unicast RPC
120 | to the current writer, without the replicated cache write.
121 | 
122 | The RPCs might be reduced by maintaining on each node a set of bloom filters,
123 | one for each member.
124 | 
125 | === Single cache
126 | Store the data and the location information in a single `REPLICATED_SYNC` cache.
127 | There is a single segment, and a custom `ConsistentHashFactory` makes the newest member
128 | the primary owner of that segment.
129 | 
130 | The distribution/triangle interceptor would be replaced with a custom interceptor
131 | that replaces the value in backup write commands with a reference to the primary owner.
132 | For reads it checks if the value in the context is a reference to another node,
133 | and makes a remote get request to that node.
134 | 
135 | The state provider would also be replaced to send the location information
136 | to the joiners instead of the actual values.
137 | 


--------------------------------------------------------------------------------
/cluster-backup-tool.md:
--------------------------------------------------------------------------------
  1 | Cluster Backup Tool
  2 | ====================
  3 | # Requirements
  4 | ## Hard
  5 | - Ability to:
  6 |     - Backup data to a file(s) at a given point in time
  7 |     - Recreate data from archive
  8 | - Must be backwards Compatible across x major versions
  9 | - Supported in client/server mode
 10 | 
 11 | ## Optional/Future
 12 | - Topology aware restoration
 13 |     - Disabled by default, focus is on data
 14 | 
 15 | - Import failure modes:
 16 |     - Clusterwide: Rollback all on first error
 17 |     - Per Cache: Only rollback importing of caches which encounter errors
 18 | 
 19 | - Cache specific imports
 20 |     - Greater flexibility
 21 |     - Ability to retry failed cache imports
 22 |     - Convert from one cache type to another
 23 | 
 24 | - Configuration only backup/import
 25 | 
 26 | - Enable "online" backups where the cluster creates a snapshot from a live cluster
 27 |     - Utilise `CommitManager` as used by state transfer and conflict resolution
 28 |     
 29 | 
 30 | # Architecture
 31 | The backup tool should be implemented on the server and be accessible via the `/v2/cluster` REST endpoint, allowing the various
 32 | Infinispan clients (CLI, Console, Operator) to utilise it. Backups can be initiated via a GET call and restored by
 33 | uploading an archive via a POST.
 34 | 
 35 | Backup:
 36 | ```
 37 | GET /v2/cluster?action=backup
 38 | ```
 39 | 
 40 | Restore:
 41 | ```
 42 | POST /v2/cluster?action=restore
 43 | Content-Disposition: form-data; name=`"file`"; filename=`"backup.zip`"
 44 | Content-Type: application/zip
 45 | Content-Transfer-Encoding: binary
 46 | ```
 47 | 
 48 | # Archive Format
 49 | A cluster backup consists of a global manifest which contains clusterwide metadata, such as the version. As well as a
 50 | directory structure that contains all backed up cache-containers. Each cache-container then contains it's configured
 51 | cache templates, cache instances and counters. This directory structure is then distributed
 52 | as a archive format, which can be passed to the server in order to initiate a import.
 53 | 
 54 | ```
 55 | containers/
 56 | containers/<name>
 57 | containers/<name>/cache-container.xml
 58 | containers/<name>/container.properties
 59 | containers/<name>/cache-configs/
 60 | containers/<name>/cache-configs/some-template.xml
 61 | containers/<name>/cache-configs/another-template.xml
 62 | containers/<name>/caches/
 63 | containers/<name>/caches/example-user-cache.dat
 64 | containers/<name>/caches/example-user-cache.xml
 65 | containers/<name>/counters/
 66 | containers/<name>/counters/counters.dat
 67 | 
 68 | manifest.properties
 69 | ```
 70 | 
 71 | The above files will be packaged as a single `.zip` distribution to aid backup/import; this provides compression as well
 72 | as being compatible with both Linux and Windows.
 73 | 
 74 | > Utilising the above directory structure simplifies future features such as per container or per cache imports.
 75 | 
 76 | ## Manifest Properties
 77 | Basic java properties file with key/values which contains cluster wide information.
 78 | 
 79 | ```java
 80 | version=11.0.0.Final
 81 | cache-containers="default"
 82 | ```
 83 | 
 84 | The Infinispan version can be used by the Server to quickly determine if a migration to the new server version is possible
 85 | based upon the number of major versions we support migrations across.
 86 | 
 87 | ## Container Files
 88 | Each container is identified in the directory structure by it's name attribute and contains several sub directories.
 89 | 
 90 | ### Container Properties
 91 | `container.properties` file that lists all of the configured templates, caches and counters.
 92 | 
 93 | ```java
 94 | cache-configs="some-template, another-template"
 95 | caches="org.infinispan.CONFIG", "example-user-cache"
 96 | counters="example-counter"
 97 | ```
 98 | 
 99 | ### Container XML
100 | The container XML can be used to determine if the backup depends on any additional user classes, e.g. serialization marshallers,
101 | and can check that the server contains these resources on it's classpath, failing fast if they are not present.
102 | 
103 | ### Template files
104 | Each defined template is represented by a `<template-name>.xml` file.
105 | 
106 | ### Cache Files
107 | A cache backup consists of a `<cache-name>.xml` and a `<cache-name>.dat` file. The `.xml` file contains
108 | the cache configuration and is used to create the initial empty cache. The `.dat` file contains the cache
109 | content and is read by the server in order to restore entries to the cache.
110 | 
111 | Entries in the `.dat` file are stored as a stream of Protobuf messages:
112 | 
113 | ```protobuf
114 | message CacheBackupEntry {
115 |     bytes key = 1;
116 |     bytes value = 2;
117 |     bytes metadata = 3;
118 |     PrivateMetadata internalMetadata = 4;
119 |     int64 created = 5;
120 |     int64 lastUsed = 6;
121 | }
122 | ```
123 | We store Metadata implementations as generic bytes to account for custom implementations.
124 | 
125 | > If the cache is empty during the backup, then no `.dat` file is created.
126 | 
127 | #### Storage MediaTypes
128 | Currently all supported MediaTypes are stored as `byte[]`, with the exception of `application/x-java-object`, so it's
129 | possible to store them as is. If a cache has a key and/or value with `application/x-java-object`, it's necessary to
130 | first convert the objec to a byte[] via the `PersistenceMarshaller`.
131 | 
132 | #### Internal Caches
133 | Volatile internal caches should not be included in a backup. Internal caches can be excluded during the creation of the
134 | backup archive, with no `.xml` or `.dat` files created.
135 | 
136 | The counter cache should also be omitted in favour of a dedicated `counters.dat` file.
137 | 
138 | ### Counters File
139 | The `counters.dat` file that contains the informated required in order to recreate all counters and their values
140 | at backup time. If no counters exist, this file can be omitted from the archive.
141 | 
142 | ```protobuf
143 | message CounterBackupEntry {
144 |     string name = 1;
145 |     CounterConfiguration configuration = 2;
146 |     int64 currentValue = 3;
147 | }
148 | ```
149 | 
150 | # CLI Integration
151 | The CLI will be the first client to expose the backup/restore capabilities, proposed syntax:
152 | 
153 | Backup:
154 | ```bash
155 | bin/cli.sh -c localhost:11222 --backup <archive-file>
156 | ```
157 | 
158 | Restore:
159 | ```bash
160 | bin/cli.sh -c localhost:11222 --restore <archive-file>
161 | ```
162 | 
163 | # Missing Capabilities
164 | List of missing features that are a prerequisite for the backup/restore feature.
165 | 
166 | ## Endpoints
167 | * Ability to reject all requests in backup mode?
168 | 


--------------------------------------------------------------------------------
/Asymmetric-Caches-and-Manual-Rehashing-Design.asciidoc:
--------------------------------------------------------------------------------
 1 | Manual/delayed rehashing and asymmetric caches have very similar requirements and so they will be implemented together.
 2 |  
 3 | === Virtual Cache Views
 4 | A big part of the implementation will be supporting a separate view for each cache with a global component called CacheViewsManager.
 5 |  
 6 | * The CacheViewsManager on the coordinator will receive REQUEST_JOIN(cachename) and REQUEST_LEAVE(cacheName) commands whenever a cache is started/stopped on one of the nodes.
 7 | ** A node leaving the JGroups cluster will also be interpreted as a leave.
 8 | * The coordinator coalesces join requests, so if several nodes join in rapid succession it will be installed in a single step. Leaves need to be treated differently, in the first phase there will be no coalescing for leaves.
 9 | ** Might need a pluggable policy for handling this, or maybe just a configurable policy where the user can configure a "cooldown" interval from the last topology change together with a maximum numbers of joiners, maximum number of leavers and maximum timeout since the first uncommitted topology change.
10 | ** The user should be also able to dynamically disable automatic installation of views and only install new views manually (e.g. when the entire cluster is shutting down).
11 | *** Any member should be able to send a REQUEST_VIEW_UPDATE command to the coordinator in order to trigger a new view with the current members.
12 | *** Any member should be able to send a DISABLE_VIEW_UPDATES command to the coordinator in order to suspend further view updates.
13 | *** Both of these should be accessible through JMX or the AS console.
14 | * The coordinator sends a PREPARE_VIEW(cacheName,viewId,oldMembers,newMembers) command to all the nodes that have the cache started (including the one that sent the join request). Each node can do its state transfer, lock transfer etc while handling the PREPARE_VIEW command and any exception will be propagated back to the coordinator.
15 | * After all the nodes responded to the PREPARE_VIEW command successfully, the coordinator sends a COMMIT_VIEW(viewId) to all the nodes and they install the new view.
16 | * If any of the node fails, the coordinator sends a ROLLBACK_VIEW(viewId) command to all the nodes and they return to the old view. The coordinator should retry to install the view after a "cooldown" amount of time, just like it would do with a join request.
17 | * If the coordinator changes or in a merge, the new coordinator will have its own copy of the last committed view but it will have to send a RECOVER_VIEW command to all the nodes in the cluster in order to rebuild the pending requests to join or to leave the cluster.
18 | ** The coordinator was the one tallying PREPARE_VIEW responses, so the view should be automatically rolled back by all the members when the coordinator dies.
19 | *** There is a slight possibility of a race condition here, if only some of the nodes got the COMMIT_VIEW command before the old coordinator failed - unless JGroups ensures that either all or none of the recipients will receive the multicast and we really do use JGroups multicast.
20 | ** For a given cache, if the old coordinator didn't have the cache running, the new coordinator will retry to install the view; otherwise there will be a new view without the old coordinator.
21 | * The CacheViewsManager will not decide a "winning partition" or help in any other way with conflict resolution during a merge.
22 | ** Nodes will have different "last committed" views, so each node may need to use its own "old members" list instead of the coordinator's in order to determine what state to transfer.
23 |  
24 | NOTE: There is a plan to simplify classloading issues in AS by creating a separate CacheManager for each deployment unit and multiplexing them on a single JGroups channel. That will not work well with our approach, since we are requiring that the CacheManager (and particularly the CacheViewsManager contained within) does exist on the JGroups coordinator even if none of its caches are started.
25 | A possible workaround would be to change CacheViewsManager into a JGroups protocol. That way we know it will always be started and it can keep state for more all the CacheManagers that are sharing that JGroups channel.
26 |  
27 | === Blocking State Transfer
28 | The StateTransferManager component will use the view callbacks provided by the CacheViewsManager instead of the JGroups channel in order to trigger state transfer. This will ensure that the "old" consistent hash and "new" consistent hash are the same on all the nodes.
29 |  
30 | Delayed view callbacks will mean that at any given time some of the owners of a key may be stopped, so the DistributionInterceptor/ReplicationInterceptor will need to complete a write operation successfully in that case.
31 |  
32 | With the current state transfer algorithm, all write operations are blocked during state transfer. Incoming write operations will reply with a StateTransferInProgressException and the originator will have to retry the operation after the state transfer has finished.
33 |  
34 | === Non-blocking state transfer / lock transfer
35 | See link:Non-Blocking-State-Transfer[non-blocking state transfer designs] for more details.
36 | 
37 | == Comments
38 | === Manik Surtani
39 | * Is this pluggable policy (cooldown, max joiners and leavers, etc) scheduled for 5.1?  And if so, for BETA1?  Or is it better to split this into a separate JIRA for later?
40 | * Manual rehashing should also be a separate JIRA so we can split it out/defer if necessary.
41 | * Manual rehash control should be JMX only.  JBoss AS admin console hooks into JMX.
42 | * Class loading and cache managers per JBoss AS application: the problem you mentioned can be solved by injecting the same CacheViewsManager instance into each of the CacheManagers created.  The same way the same JChannel instance is injected.  This will mean the CacheViewsManager logic can remain in Infinispan and still work in this setup.
43 | * Non-blocking state transfer should be a separate JIRA as well.
44 | * Locking: you say each cache has a view change lock.  Should this be each cache, or cache manager?  Or cache view manager?
45 | 
46 | ===  Dan Berindei (in response to Manik Surtani)
47 | * Pluggable policy: it's not scheduled for 5.1, I'll create a separate JIRA.
48 | * Manual rehashing: there's already ISPN-1394
49 | * I might be wrong, but I think AS7 is no longer starting the JMX subsystem by default.
50 | * Injecting the same CacheViewsManager in all application: this sounds much simpler than what I had in mind.
51 | * I'll create the JIRA.
52 | * I'd say each cache, because I don't want to block anything on an running cache while starting up a new one.


--------------------------------------------------------------------------------
/Schemas-API.adoc:
--------------------------------------------------------------------------------
  1 | # Proposal: Schema Manipulation in Hotrod for Protostream Infinispan
  2 | 
  3 | ## Goal
  4 | To enhance schema evolution and management in Infinispan's Protostream serialization framework, we propose replacing the current API with a new schema manipulation mechanism.
  5 | 
  6 | ## Hotrod client side marshaller
  7 | 
  8 | ### Client side api today
  9 | 
 10 | [source, java]
 11 | ----
 12 | Schema schema = // programmatic schema
 13 | MagazineMarshaller magazineMarshaller = new MagazineMarshaller(); // custom marshaller 1
 14 | BookMarshaller bookMarshaller = new BookMarshaller(); // custom marshaller 2
 15 | 
 16 | ProtoStreamMarshaller protostreamMarshaller = new ProtoStreamMarshaller();
 17 | SerializationContext serializationContext = marshaller.getSerializationContext();
 18 | FileDescriptorSource fds = FileDescriptorSource.fromString(schema.getName(), schema.toString());
 19 | serializationContext.registerProtoFiles(fds);
 20 | serializationContext.registerMarshaller(magazineMarshaller);
 21 | serializationContext.registerMarshaller(bookMarshaller);
 22 | builder.marshaller(protostreamMarshaller);
 23 | builder.remoteCache("cacheName").marshaller(protostreamMarshaller);
 24 | ----
 25 | 
 26 | ### Client side api proposal
 27 | 
 28 | [source, java]
 29 | ----
 30 | Schema schema = // programmatic schema
 31 | MagazineMarshaller magazineMarshaller = new MagazineMarshaller(); // custom marshaller 1
 32 | BookMarshaller bookMarshaller = new BookMarshaller(); // custom marshaller 2
 33 | 
 34 | ProtoStreamMarshaller protostreamMarshaller = new ProtoStreamMarshaller();
 35 | protostreamMarshaller.register(schema, magazineMarshaller, bookMarshaller);
 36 | 
 37 | builder.remoteCache("cacheName").marshaller(marshaller);
 38 | ----
 39 | 
 40 | ### Administration api today
 41 | Hotrod specific api does not exist. However, an existing REST API exists.
 42 | 
 43 | Example of what clients are doing today to manage errors:
 44 | 
 45 | [source, java]
 46 | ----
 47 |     public static void uploadAndReindexCaches(RemoteCacheManager remoteCacheManager, GeneratedSchema schema, List<IndexedEntity> indexedEntities) {
 48 |         var key = schema.getProtoFileName();
 49 |         var current = schema.getProtoFile();
 50 | 
 51 |         var protostreamMetadataCache = remoteCacheManager.<String, String>getCache(InternalCacheNames.PROTOBUF_METADATA_CACHE_NAME);
 52 |         var stored = protostreamMetadataCache.getWithMetadata(key);
 53 |         if (stored == null) {
 54 |             if (protostreamMetadataCache.putIfAbsent(key, current) == null) {
 55 |                 logger.info("Infinispan ProtoStream schema uploaded for the first time.");
 56 |             } else {
 57 |                 logger.info("Failed to update Infinispan ProtoStream schema. Assumed it was updated by other Keycloak server.");
 58 |             }
 59 |             checkForProtoSchemaErrors(protostreamMetadataCache);
 60 |             return;
 61 |         }
 62 |         if (Objects.equals(stored.getValue(), current)) {
 63 |             logger.info("Infinispan ProtoStream schema is up to date!");
 64 |             return;
 65 |         }
 66 |         if (protostreamMetadataCache.replaceWithVersion(key, current, stored.getVersion())) {
 67 |             logger.info("Infinispan ProtoStream schema successful updated.");
 68 |             reindexCaches(remoteCacheManager, stored.getValue(), current, indexedEntities);
 69 |         } else {
 70 |             logger.info("Failed to update Infinispan ProtoStream schema. Assumed it was updated by other Keycloak server.");
 71 |         }
 72 |         checkForProtoSchemaErrors(protostreamMetadataCache);
 73 |     }
 74 | 
 75 |     private static void reindexCaches(RemoteCacheManager remoteCacheManager, String oldSchema, String newSchema, List<IndexedEntity> indexedEntities) {
 76 |         if (indexedEntities == null || indexedEntities.isEmpty()) {
 77 |             return;
 78 |         }
 79 |         var oldPS = KeycloakModelSchema.parseProtoSchema(oldSchema);
 80 |         var newPS = KeycloakModelSchema.parseProtoSchema(newSchema);
 81 |         var admin = remoteCacheManager.administration();
 82 | 
 83 |         indexedEntities.stream()
 84 |                 .filter(Objects::nonNull)
 85 |                 .filter(indexedEntity -> isEntityChanged(oldPS, newPS, indexedEntity.entity()))
 86 |                 .map(IndexedEntity::cache)
 87 |                 .distinct()
 88 |                 .forEach(cacheName -> updateSchemaAndReIndexCache(admin, cacheName));
 89 |     }
 90 | ----
 91 | 
 92 | Example of what clients are doing today to retrieve errors:
 93 | 
 94 | [source, java]
 95 | ----
 96 | // For errors
 97 | private static void checkForProtoSchemaErrors(RemoteCache<String, String> protostreamMetadataCache) {
 98 |     var errors = protostreamMetadataCache.get(ProtobufMetadataManagerConstants.ERRORS_KEY_SUFFIX);
 99 |     if (errors == null) {
100 |         return;
101 |     }
102 |     for (String errorFile : errors.split("\n")) {
103 |         logger.errorf("%nThere was an error in proto file: %s%nError message: %s%nCurrent proto schema: %s%n",
104 |         errorFile,
105 |         protostreamMetadataCache.get(errorFile + ProtobufMetadataManagerConstants.ERRORS_KEY_SUFFIX),
106 |         protostreamMetadataCache.get(errorFile));
107 |     }
108 | }
109 | ----
110 | 
111 | https://github.com/keycloak/keycloak/blob/636fffe0bc37a63bd5a2578b6bbcd815364c41d8/model/infinispan/src/main/java/org/keycloak/marshalling/KeycloakIndexSchemaUtil.java#L71-L97
112 | 
113 | ### Administration API
114 | 
115 | [source, java]
116 | ----
117 | public interface RemoteCacheManagerAdmin {
118 |     SchemasAdministration schemas();
119 | }
120 | ----
121 | 
122 | [source, java]
123 | ----
124 | public interface SchemasAdministration {
125 |     CompletionStage<SchemaOpResult> createAsync(Schema schema);
126 |     SchemaOpResult create(Schema schema);
127 | 
128 |     CompletionStage<SchemaOpResult> updateAsync(Schema schema);
129 |     SchemaOpResult update(Schema schema);
130 | 
131 |     CompletionStage<SchemaOpResult> createOrUpdateAsync(Schema schema);
132 |     SchemaOpResult createOrUpdate(Schema schema);
133 | 
134 |     CompletionStage<SchemaOpResult> deleteAsync(String name);
135 |     SchemaOpResult delete(String name);
136 | 
137 |     CompletionStage<SchemaErrors> retrieveSchemaErrors();
138 | 
139 |     SchemaErrors retrieveSchemaErrors(Schema schema);
140 |     // # Optional. Check if this is needed or possible. Not creating a schema and validating it only.
141 |         CompletionStage<SchemaOpResult> validateAsync(Schema schema);
142 |         SchemaOpResult validate(Schema schema);
143 |     // # end of optional
144 | }
145 | ----
146 | 
147 | SchemaOpResult: Specify the content and return values when implementing it
148 | based on real code examples around (like keycloak or what quarkus or spring users do)
149 | 
150 | 


--------------------------------------------------------------------------------
/Optimistic-Locking-In-Infinispan.asciidoc:
--------------------------------------------------------------------------------
 1 | === Context
 2 | At the moment (Infinispan 5.0) two locking schemes are supported:
 3 | 
 4 | * *pessimistic*, in which locks are being acquired remotely on each transactional write.  This is named eager locking and is detailed here.
 5 | * a *hybrid* optimistic-pessimistic locking approach, in which local locks are being acquired as transaction progresses and remote locks are being acquired at prepare time. This is the default locking scheme.
 6 | 
 7 | This document describes a replacement for the hybrid locking scheme with an optimistic scheme. 
 8 | 
 9 | === Shortcomings of the hybrid approach
10 | In the current hybrid approach local locks are acquired at write time and remote write locks are acquired at prepare time as well.
11 | E.g. considering the following code that runs on node A and consistentHash("a") = {B}.
12 | 
13 |  transactionManager.begin();
14 |  cache.put("a", "aVal"); // this acquires a WL on A. "a" is not locked on B
15 |  Object result = cache.get("b"); //this doesn't acquire any lock
16 |  transactionManager.commit(); // this acquires a WL on B as well, then it release it after applying the modification
17 | 
18 | This locking model has some shortcomings, especially in distributed mode:
19 | 
20 | ==== Less throughput
21 | The overall throughput of the system is reduced in the following scenario:
22 | On node A two transactions are running:
23 | 
24 | * Tx1: writes to keys {a, b, c} in that order
25 | * Tx2: writes to keys {a, c, d} in that order
26 | 
27 | Let's assume that consistentHash(a) = {B}. In other nodes node B is the main data owner of key "a".
28 | 
29 | These two transactions execute in sequence: after Tx1 locks “a”, Tx2 is not able to make any progress until Tx1 is finished. Making Tx2 wait for "a" 's lock doesn't guarantee Tx1 the chances to complete the transaction: another transaction running on node C might still be able to acquire lock on "a" before Tx1.
30 | 
31 | ==== Different locking semantic based on transaction locality
32 | With the hybrid approach, two transactions competing for the same key will be serialized if run on the same node, but would execute in parallel  if run on two different nodes. Even more, if a key locked by a transaction maps(consistent hash) to the same node where the transaction runs, an "eager lock" is practically acquired - so again the locking semantic is influenced by where the transaction runs. Whilst this is not necessarily incorrect it certainly brings a degree of unneeded complexity in using and understanding Infinispan's transactions.  
33 | 
34 | == Optimistic locking
35 | In order to overcome the above shortcomings, an optimistic locking scheme will replace the hybrid one in Infinispan 5.1. With the optimistic approach no locks are being acquire until prepare time.
36 | 
37 | === How does it work
38 | Infinispan associates a transaction context (TC) with each Transaction, on the node where the transaction runs. This is a map-like structure used to store all the data touched by the transaction, as follows:
39 | 
40 | * On a get(key) operation
41 | .. if key is in the TC then the value associated with it is returned. If not:
42 | .. if key maps, according to the consistent hash, to a remote node then value is fetched from the remote node (rpc).
43 | ... If writeSkewCheck is enabled, then key's version at the moment of the read is also fetched. Then (key,value, version) is placed in the TC.
44 | ... If writeSkewCheck is disabled the (key,value) pair is then placed in TC.
45 | ... Note that in both above cases value can be null.
46 | .. if key maps to the local node the value is obtained from the local data container. The (key,value) and potentially version (see 1.2.1) is then placed in TC
47 | .. value (potentially null) is then returned to the caller.
48 | * On a put(key,value) operation
49 | .. if the TC contains an entry for key 
50 | ... existing value associated with is cached to be returned to the caller
51 | ... TC updated to reflect new (key, value) pair
52 | ... value, as read as 2.1.1 is returned to the caller
53 | .. if the TC doesn't contain an entry for key
54 | ... If unreliableReturnValues is enabled then (key, value) is added to the TC and null is returned 
55 | ... if unreliableReturnValues is not enabled (default) then
56 | .... a get(key) is executed first, as described in 1. The return of the get is cached to be returned to the caller
57 | .... (key, value) is added to the TC
58 | .... The value cached at 2.2.2.1 is then returned to the user
59 | * During transaction completion:
60 | .. At prepare time,
61 | ... a prepare message is multicasted to all the nodes owning keys written by the transaction. The prepare message contains the keys that should be locked together with the new values and potentially the versions read at 1.2.1 or 1.3.
62 | ... Locks are acquired remotely, on each one of the keys written by the transaction. No locks are acquired for keys that were only read.
63 | ... If a remote lock cannot be acquired within lockAcquistionTimeout milliseconds then an exception is thrown and prepare fails.
64 | ... If all remote locks are successfully acquired
65 | .... If writeSkewCheck is enabled:
66 | ..... for each remotely locked key, if its current version is different than the version read at 1.2.1 or 1.3 then an exception is thrown and transaction is rolledback
67 | ..... these check does not require a new RPC, but executes in the scope of the RPC sent for acquiring the lock
68 | ...  If writeSkewCheck is disabled the above check does not take place
69 | .. At commit time:
70 | ... A commit message is sent from the node where transaction originated to all nodes where locks were acquired
71 | ... On each participating node
72 | .... if writeSkewCheck is enabled then the version of the entry is increased
73 | .... the old values are overwritten by new ones
74 | .... locks are released
75 | * The atomic operations, as defined by the ConcurrentHashMap, don't fit well within the scope of optimistic transaction: this is because there is a discrepancy between the value returned by the operation and the value and the fact that the operation is applied or not:
76 | ** E.g. putIfAbsent(key, value) might return true as there's no entry for key in TC, but in fact there might be a value committed by another transaction.
77 | **  Later on, at prepare time when the same operation is applied on the node that actually holds key, it might not succeed as another transaction has updated key in between, but the return value of the method was already evaluated long before this point.
78 | **  In order to solve this problem, if an atomic operations happens within the scope of a transaction, Infinispan forces a writeSkewCheck at commit time. The  writeSkewCheck, optional otherwise, makes sure that the decision made at prepare time still stands at commit time.
79 | 
80 | === Related
81 | * The JIRA tracking the implementation for this is link:https://issues.jboss.org/browse/ISPN-1131[ISPN-1131]


--------------------------------------------------------------------------------
/ServerNG.md:
--------------------------------------------------------------------------------
  1 | Infinispan ServerNG
  2 | ====================
  3 | 
  4 | Infinispan ServerNG is a reboot of Infinispan's server which addresses the following:
  5 | 
  6 | * small codebase with as little duplication of already existing functionality (e.g. configuration)
  7 | * embeddable: the server should allow for easy testability in both single and clustered configurations
  8 | * RESTful admin capabilities
  9 | * Logging using [JBoss Logging logmanager](https://github.com/jboss-logging/jboss-logmanager)
 10 | * Security using [Elytron](https://github.com/wildfly-security/wildfly-elytron)
 11 | 
 12 | # Layout
 13 | 
 14 | The server is laid out as follows:
 15 | 
 16 | * `/bin` scripts
 17 |    * `server.sh` server startup shell script for Unix/Linux
 18 |    * `server.ps1` server startup script for Windows Powershell 
 19 | * `/lib` server jars
 20 |    * `infinispan-server.jar` uber-jar of all dependencies required to run the server.
 21 | * `/server` default server instance folder
 22 | * `/server/log` log files
 23 | * `/server/configuration` configuration files
 24 |    * `infinispan.xml`
 25 |    * keystores
 26 |    * `logging.properties` for configuring logging
 27 |    * User/groups property files (e.g. `mgmt-users.properties`, `mgmt-groups.properties`) 
 28 | * `/server/data` data files organized by container name
 29 |    * `default`
 30 |       * `caches.xml` runtime cache configuration
 31 |       * `___global.state` global state
 32 |       * `mycache` cachestore data
 33 | * `/server/lib` extension jars (custom filter, listeners, etc)
 34 | 
 35 | # Paths
 36 | 
 37 | The following is a list of _paths_ which matter to the server:
 38 | 
 39 | * `infinispan.server.home` defaults to the directory which contains the server files.
 40 | * `infinispan.server.root` defaults to the `server` directory under the `infinispan.server.home`
 41 | * `infinispan.server.configuration` defaults to `infinispan.xml` and is located in the `configuration` folder under the `infinispan.server.root`
 42 | 
 43 | # Command-line
 44 | 
 45 | The server supports the following command-line arguments:
 46 | 
 47 | * `-b`, `--bind-address=<address>`
 48 | * `-c`, `--server-config=<config>`
 49 | * `-o`, `--port-offset=<offset>`
 50 | * `-s`, `--server-root=<path>`
 51 | * `-v`, `--version`
 52 | 
 53 | # Configuration
 54 | 
 55 | The server configuration extends the standard Infinispan configuration adding server-specific elements:
 56 | 
 57 | * `security` configures the available security realms which can be used by the endpoints.
 58 | * `cache-container` multiple containers may be configured, distinguished by name.
 59 | * `endpoints` lists the enabled endpoint connectors (hotrod, rest, ...).
 60 | * `socket-bindings` lists the socket bindings.
 61 | 
 62 | An example skeleton configuration file looks as follows:
 63 | 
 64 | ```
 65 | <infinispan>
 66 | 
 67 |    <!-- Global security configuration -->
 68 |    <security>
 69 |       <security-realm name="ManagementRealm">
 70 |          <authentication>
 71 |             <properties path="mgmt-users.properties" relative-to="infinispan.server.config.dir"/>
 72 |          </authentication>
 73 |          <authorization map-groups-to-roles="false">
 74 |             <properties path="mgmt-groups.properties" relative-to="infinispan.server.config.dir"/>
 75 |          </authorization>
 76 |       </security-realm>
 77 |       <security-realm name="PublicRealm">
 78 |          <server-identities>
 79 |             <ssl>
 80 |                 <keystore path="application.keystore" relative-to="infinispan.server.config.dir" keystore-password="password" alias="server" key-password="password" generate-self-signed-certificate-host="localhost"/>
 81 |             </ssl>
 82 |          </server-identities>
 83 |          <authentication>
 84 |             <properties path="application-users.properties" relative-to="infinispan.server.config.dir"/>
 85 |          </authentication>
 86 |          <authorization>
 87 |             <properties path="application-roles.properties" relative-to="infinispan.server.config.dir"/>
 88 |          </authorization>
 89 |       </security-realm>
 90 |    </security>
 91 |    
 92 |    <!-- JGroups configuration -->
 93 |    <jgroups transport="org.infinispan.remoting.transport.jgroups.JGroupsTransport">
 94 |       <stack-file name="udp" path="jgroups-udp.xml"/>
 95 |       <stack-file name="tcp" path="jgroups-tcp.xml"/>
 96 |    </jgroups>
 97 |    
 98 |    <!-- Cache containers -->
 99 |    <cache-container name="default" ...>
100 |       <transport .../>
101 |    </cache-container>
102 |    
103 |    <!-- Endpoints -->
104 |    <endpoints>
105 |       <hotrod-connector cache-container="default" socket-binding="hotrod" ... />
106 |       <rest-connector socket-binding="rest" ... />
107 |       <admin-connector socket-binding="management" security-realm="ManagementRealm" />
108 |    </endpoints>
109 |    
110 |    <socket-bindings>
111 |       <socket-binding name="hotrod" address="${infinispan.bind.address}" port="11222"/>
112 |       <socket-binding name="rest" address="${infinispan.bind.address}" port="8080"/>
113 |       <socket-binding name="management" address="${infinispan.bind.address.management}" port="9990"/>
114 |       <socket-binding name="cluster" multicast-address="${infinispan.bind.address.multicast}" multicast-port="45700"/>
115 |    </socket-bindings>
116 | </infinispan>
117 | ```
118 | 
119 | # Logging
120 | 
121 | Logging is handled by JBoss Logging's LogManager. This is configured through a `logging.properties` file in the 
122 | `server/configuration` directory. The following is an example:
123 | 
124 | ```
125 | loggers=org.jboss.logmanager
126 | 
127 | # Root logger
128 | logger.level=INFO
129 | logger.handlers=CONSOLE
130 | 
131 | logger.org.jboss.logmanager.useParentHandlers=true
132 | logger.org.jboss.logmanager.level=INFO
133 | 
134 | handler.CONSOLE=org.jboss.logmanager.handlers.ConsoleHandler
135 | handler.CONSOLE.formatter=PATTERN
136 | handler.CONSOLE.properties=autoFlush,target
137 | handler.CONSOLE.autoFlush=true
138 | handler.CONSOLE.target=SYSTEM_OUT
139 | 
140 | formatter.PATTERN=org.jboss.logmanager.formatters.PatternFormatter
141 | formatter.PATTERN.properties=pattern
142 | formatter.PATTERN.pattern=%d{HH:mm:ss,SSS} %-5p [%c] (%t) %s%E%n
143 | ```
144 | # Internals
145 | 
146 | The following is a dump of various internal facts about the server, in no particular order:
147 | 
148 | * All containers handled by the same server share the same thread pools and transport.
149 | * When a server starts it locks the `infinispan.server.root` so that it cannot be used by another server concurrently.
150 | * The `admin-connector` endpoint is a special type of `rest-connector` with additional ops
151 | * The CLI connects to the `admin-connector` using either
152 |    * the `local-user` SASL mech provided by Eltryon when running on the same host/user
153 |    * any HTTP auth supported by the rest endpoint
154 |    
155 |  
156 | 


--------------------------------------------------------------------------------
/Single_port.adoc:
--------------------------------------------------------------------------------
  1 | = Single port support for Infinispan Server
  2 | 
  3 | The main goal of a single port support for Infinispan Server is to expose some (if not all) endpoints with a single port.
  4 | The functionality is targeted mainly (but not limited to) for Cloud environments (where exposing everything as a single port is very convenient).
  5 | 
  6 | == Single port from technical perspective
  7 | 
  8 | The single port implementation requires a mechanism for switching communication protocols. Such mechanism has already been invented
  9 | and implemented in HTTP/1.1 and HTTP/2+TLS/ALPN. Other protocols such as Hot Rod, Web Sockets or Memcached don't support it. In other words
 10 | the client will have only one chance to negotiate protocol (switch from HTTP to a custom protocol).
 11 | 
 12 | === Switching from HTTP/1.0
 13 | 
 14 | https://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7230.html#header.upgrade[Not supported]:
 15 | 
 16 | > A server must ignore an Upgrade header field that is received in an HTTP/1.0 request.
 17 | 
 18 | === Switching from HTTP/1.1
 19 | 
 20 | HTTP/1.1 supports so called https://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7230.html#header.upgrade[Upgrade Header], which allows to re-negotiate
 21 | protocol for the same HTTP connection. The re-negotiate request might be sent either by the client or the server.
 22 | 
 23 | .Shortened description of the switching procedure
 24 | ...................................
 25 | The client sends the Upgrade header with a list of supported protocols. The server responds with `101` Switching protocols and its own list of protocols.
 26 | After choosing destination protocol, the client starts sending messages using negotiated connection details.
 27 | ...................................
 28 | 
 29 | More information: https://issues.jboss.org/browse/ISPN-6676
 30 | 
 31 | === Switching from HTTP/2.0
 32 | 
 33 | HTTP/2.0 supports only backwards-compatible HTTP/1.1 Upgrade procedure. With HTTP/2, negotiating destination protocol happens during connection initialization -
 34 | during https://http2.github.io/http2-spec/#rfc.section.3.2[TLS handshake]. The procedure requires ALPN support.
 35 | 
 36 | .Shortened TLS+ALPN handshake procedure
 37 | ...................................
 38 |    Client                                              Server
 39 | 
 40 |    ClientHello                     -------->       ServerHello
 41 |      (ALPN extension &                               (ALPN extension &
 42 |       list of protocols)                              selected protocol)
 43 |                                                    Certificate*
 44 |                                                    ServerKeyExchange*
 45 |                                                    CertificateRequest*
 46 |                                    <--------       ServerHelloDone
 47 |    Certificate*
 48 |    ClientKeyExchange
 49 |    CertificateVerify*
 50 |    [ChangeCipherSpec]
 51 |    Finished                        -------->
 52 |                                                    [ChangeCipherSpec]
 53 |                                    <--------       Finished
 54 |    Application Data                <------->       Application Data
 55 | ...................................
 56 | 
 57 | Unfortunately ALPN support has been scheduled for JDK9 (http://openjdk.java.net/jeps/244[JEP244]). Luckily,
 58 | https://github.com/undertow-io/undertow/blob/master/core/src/main/java/io/undertow/protocols/ssl/ALPNHackSSLEngine.java[Undertow] as well as
 59 | http://netty.io/wiki/requirements-for-4.x.html#tls-with-jdk-jetty-alpnnpn[Netty] implemented some hacks to support it. However with Netty
 60 | we still need to modify the boot classpath, which is pretty bad. We might consider using Undertow hack for JDK8.
 61 | 
 62 | More information: https://issues.jboss.org/browse/ISPN-6899
 63 | 
 64 | == Rewriting REST Server
 65 | 
 66 | Our REST implementation is based on https://github.com/resteasy/Resteasy/tree/master/server-adapters/resteasy-netty[RestEasy Netty].
 67 | Unfortunately this needs to be changed in the near future because there are plans for intelligent HTTP/2 clients (with topology
 68 | information). An early prototype might be found https://github.com/AntonGabov/infinispan/blob/topologyId/server/rest/src/main/scala/org/infinispan/rest/http2/Http2Handler.java[here].
 69 | 
 70 | Once the REST Server is rewritten into pure Netty, we could start implementing the single port support.
 71 | 
 72 | == Single port support - the implementation
 73 | 
 74 | The single port implementation will be based on the https://github.com/infinispan/infinispan/tree/master/server/router[multi-tenant router].
 75 | 
 76 | As a reminder, the multi-tenant router (or shorter, the router) allows to redirect requests from a single endpoint (might be called frontend)
 77 | into multiple `CacheContainer`s and `Cache`s (might be called backends).
 78 | 
 79 | The implementation will be slightly altered to support both https://netty.io/4.1/api/io/netty/handler/codec/http/HttpServerUpgradeHandler.html[HTTP/1.1 Upgrade]
 80 | and https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/ssl/ApplicationProtocolConfig.java[HTTP/2+TLS/ALPN].
 81 | 
 82 | Once the implementation is ready, the server endpoint will need to be altered to support new configuration elements. An exemplary
 83 | configuration will look like the following:
 84 | 
 85 | .Router configuration
 86 | ----
 87 | <router-connector name="router" hotrod-socket-binding="..." rest-socket-binding="..." single-port-socket-binding="...">
 88 |    <multi-tenancy>
 89 |       ... existing multi-tenancy support ...
 90 |    </multi-tenancy>
 91 |    <single-port>
 92 |       <rest name="rest-1">
 93 |       <hotrod name="hotrod-2">
 94 |    </single-port>
 95 | </router-connector>
 96 | ----
 97 | 
 98 | Note that it is perfectly possible to use the single port functionality next to with multi-tenancy (but the same port
 99 | which will be used for single-port won't support multi-tenancy). Another interesting
100 | aspect is that REST connector is not necessary for the switching. Since the switching logic will be included
101 | inside the Router, it will be possible to have a single port only with Hot Rod endpoint. In that case a client will need
102 | to connect using HTTP/1.1 or TLS/ALPN and switch to Hot Rod client.
103 | 
104 | == Limitations
105 | 
106 | There are number of features which won't be available after initial implementation:
107 | 
108 | * Embedding a Router endpoint inside a single port. Even though recursion might seem like a good idea, it may lead to deadlock
109 |   during server startup (Router `A` depends on `B` whereas Router `B` depends on `A`). As a side effect, the single port endpoints
110 |   will not support multi-tenancy at the same time.
111 | * Authentication won't be implemented in the Router. Since the authentication mechanisms are slightly different for
112 |   Hot Rod and REST, it would hard to implement all of them in the Router.
113 | * In order to make the implementation simple, switching from HTTP/1.1 to HTTP/2 in the REST interface will also be
114 |   implemented by the Router.
115 | 
116 | == TODO list
117 | 
118 | * [ ] Refactor REST interface to Netty
119 | * [ ] (Optional) Implement missing authentication mechanisms for REST
120 | * [ ] Implement switching logic in the Router
121 | * [ ] Implement multi-protocol Hot Rod client


--------------------------------------------------------------------------------
/Scattered-Cache-design-doc.md:
--------------------------------------------------------------------------------
  1 | This design doc is out of date - please refer to [package documentation](https://github.com/rvansa/infinispan/blob/ISPN-6645/core/src/main/java/org/infinispan/scattered/package-info.java)
  2 |  that's part of [preview PR #4458](https://github.com/infinispan/infinispan/pull/4458).
  3 | 
  4 | ## Idea
  5 | Distributed caches have fixed owners for each key. Operations where originator is one of the owners require less messages (and possibly less roundtrips), so let's make the originator always the owner. In case of node crash, we can retrieve the data by inspecting all nodes.
  6 | 
  7 | To find the last-written entry in case of crashed primary owner, entries will keep write timestamps (counter value rather than wall-clock time) in metadata. These timestamps also determine order of writes; we don't have to use locking anymore.
  8 | 
  9 | For the time being, let's implement only the case resilient against one node failure (equivalent to 2 owners in distributed cache).
 10 | 
 11 | * + faster writes
 12 | * + no locking contention
 13 | * - reads always have to go to the primary (slower writes in small clusters)
 14 | * - complex reconciliation during rebalance
 15 | 
 16 | ## Name
 17 | I believe that this behaviour definitely desires its own cache mode, not just a configuration attribute. As much as I'd like to call it Radim's Cache as Bela proposed, for practical reasons let's use Scattered Cache. If anyone has better name, please suggest it.
 18 | 
 19 | ## Operations
 20 | I'll assume non-transactional mode below. Transactional mode is quite similar, though (but non-tx will be POCed/implemented first).
 21 | 
 22 | We need to keep tombstones with timestamp after entry removal. Luckily, these tombstones have limited lifespan - we keep them around only until the invalidations are applied on all nodes.
 23 | 
 24 | The timestamps have to grow monotonically; therefore the timestamp counter won't be per-entry but per segment (as tombstone will be eventually removed, per-entry timestamp would be lost).
 25 | 
 26 | ### Single entry write (put, getAndPut, putIfAbsent, replace...)
 27 | * originator == primary owner:
 28 | 
 29 | 1. Primary increments timestamp for segment
 30 | 2. Primary commits entry + timestamp
 31 | 3. Primary picks one node (let's say the next member) and sends backup RPC
 32 | 4. Backup commits entry
 33 | 5. Backup sends RPC response
 34 | 6. Primary returns after it gets the response
 35 | 7. Primary schedules invalidation of entry with lower timestamps
 36 |  * This requires just one sync RPC
 37 |  * Selection of backup could be random, but having it ~fixed probably reduces overall memory consumption
 38 |  * Updating value on primary before backup finishes does not change data consistency - if backup RPC fails, we can't know whether backup has committed the entry and so it can be published anyway.
 39 | 
 40 | * originator != primary owner
 41 | 
 42 | 1. Origin sends sync RPC to primary owner
 43 | 2. Primary checks op if conditional
 44 | 3. Primary increments timestamp for segment
 45 | 4. Primary commits entry + timestamp
 46 | 5. Primary returns response with timestamp (+ return value if appropriate)
 47 | 6. Origin commits entry + timestamp
 48 | 7. Origin schedules invalidation of entry with lower timestamps
 49 |  * Invalidation must be scheduled by origin, because primary does not know if backup committed
 50 | 
 51 | ### Single entry read
 52 | * originator == primary owner: Just local read
 53 | * originator != primary owner:
 54 | 
 55 | 1. Origin locally loads entry + timestamp with SKIP_CACHE_LOAD
 56 | 2. Origin sends sync RPC including the timestamp to primary
 57 | 3. Primary compares timestamp with it's own
 58 |  * If timestamp matches, value is not sent back
 59 |  * If timestamp does not match, value + timestamp has to be sent 
 60 | 
 61 | * Optional configuration options:
 62 |  * Return value after 1) if present (risk of stale reads)
 63 |  * Store read value locally with expiration (L1 enabled) - as invalidations are broadcast anyway, there's not much overhead with that. This will still require RPC on read (unless stale reads are allowed) but not marshalling the value.
 64 | 
 65 | ### Multiple entries writes
 66 | 
 67 | Entries for which this node is the primary owner won't be backed up just to the next member, but to a node that is primary owner of another entries in the multiwrite. That way we can spare some messages by merging the primary -> backup and origin -> backup requests.
 68 | 
 69 | ### Multiple entries reads
 70 | 
 71 | Same as single entry reads, just merge RPCs for the same primary owners.
 72 | 
 73 | ## Invalidations
 74 | 
 75 | It would be inefficient to send invalidations (key + timestamp) one-by-one, so these will be merged and sent in batches. Invalidations from other nodes can remove scheduled 'outdated' invalidations, but it requires additional complexity and the overall gain is debatable.
 76 | 
 77 | ## State Transfer
 78 | 
 79 | ### Node leave/crash
 80 | 
 81 | Operations after node crash are always driven by the new primary owner of given segment
 82 | 
 83 | * If it was the primary owner in previous topology:
 84 |  * Replicate all data from this segment to the next node 
 85 |  * Send invalidate request with all keys + timestamps to all nodes but the next one
 86 |  * The above is parallel and processed in chunks.
 87 |  * Write to entry can proceed in parallel with this process; ST-invalidation cannot overwrite newer entry, though write-invalidation can arrive before the ST-replication (then we would have 3 copies of each entry).
 88 |   * After chunk is replicated (synchronously), primary owner compares actual timestamps with those replicated and sends invalidations to the backup with the modified timestamps
 89 | * If the primary owner has just become a new primary owner:
 90 |  * Synchronously retrieve highest timestamp for this segment from all nodes
 91 |   * All writes are blocked until the timestamp is retrieved
 92 |  * Request all nodes to send you all keys + timestamps from this segment (and do that locally as well)
 93 |  * Compute max timestamps for each key
 94 |  * Retrieve values from nodes with highest timestamp, send invalidation to others
 95 |  * If there is a concurrent write during the retrieval, primary owner has to retrieve key + timestamp (+ value if needed) from all nodes, write can proceed only after all nodes respond
 96 |   * Second write during the ST to the same entry can be processed without the all-nodes-read. This can be achieved by storing the timestamp retrieved at the start of ST - if local entry timestamp is higher than this value, the write can safely proceed.
 97 | 
 98 | ### Clean rebalance (after join, no data is lost)
 99 | 
100 | If node has become primary owner:
101 | * Retrieve segment timestamp counter value + old owner starts rejecting further reads & writes
102 | * Send request for all keys + timestamp + value to primary owner
103 | * Update local values as these come
104 | * Write is similar to the crash case, but the value is retrieved just from primary owner
105 | * Request old owner to remove/L1-ize all entries
106 | 
107 | ### Crash during clean rebalance
108 | 
109 | This shouldn't be different from regular crash, but there might be technical difficulties to cancel the ongoing ST. Node is always considered a new primary owner for a segment that had unfinished ST, since the previous new owner did not have all entries but had some new modifications.
110 | 
111 | ## Open questions
112 | 
113 | * Do we need remote-thread pool when we don't use locks?
114 | 
115 | 


--------------------------------------------------------------------------------
/Remote Command Handler.md:
--------------------------------------------------------------------------------
 1 | # Summary
 2 | 
 3 | In clustered caches, the data is always replicated to other nodes. When a node receives a command, it is processed immediately. However, this approach have some issues because the commands are not independent among them (they can depend on other commands or conflict) that can make the thread enter in a waiting state. The worst scenario occurs when all threads are waiting for an event and none is available to process this notification. In this scenario, the cluster blocks and exceptions start to be thrown (usually `TimeoutException`).
 4 | 
 5 | # Goal 
 6 | 
 7 | Implement a "smarter" algorithm to handle remote commands. The algorithm will order the commands in order to minimize conflicts and waiting state between them.
 8 | 
 9 | # Algorithm
10 | 
11 | In the following, it explains how the remote commands are handled. A command is considered "blocking" if it needs to wait for an external event, such as, lock release, reply from other node, etc. 
12 | 
13 | ## Introducing the `RemoteLockCommand`, the `RemoteLockOrderManager` and the `LockLatch`
14 | 
15 | The `RemoteLockCommand` is the interface that the `RpcCommands` must implement to be handled by this algorithm. This interface has a single method to return the key(s) to acquire lock, `Collection<Object> getKeysToLock()`.
16 | 
17 | The `RemoteLockOrderManager` main goal is to order the commands. Two or more commands conflict if they update in one or more common key(s). Since the lock can only be acquired by a single command, the `RemoteLockOrderManager` will define which command advances while the others wait. The interface only has a single method, `LockLatch order(RemoteLockCommand)`.
18 | 
19 | The `LockLatch` is a latch that notifies when the command can be processed. Also, it notifies when the command is done with its work and unblocks other commands. It has two methods:
20 | 
21 | * `boolean isReady()`: returns `true` when it is probable that the `Lock` is released.
22 | * `void release()`: must be invoked after its work is done and the `Lock` is released. This will notify waiting command and may unblock other commands.
23 | 
24 | Finally, when a command is ready, it is sent to the remote executor service.
25 | 
26 | ## How the handling is made?
27 | 
28 | ### Read Command
29 | 
30 | No changes are made. When a read command is delivered, it is processed directly in the JGroups executor service.
31 | 
32 | ### Non-Transactional Cache
33 | 
34 | We have two types of write commands: the ones delivered in the primary owner, which will acquire the lock and it sends the update to the backup owners; and the ones delivered in the backup owner. Since the later ones does not block, they are processed immediately in the JGroups executor service. The first ones, are managed by the `RemoteLockOrderManager` and processed in the remote executor service.
35 | 
36 | **State Transfer**
37 | No changes needed. If the command if processed in the wrong topology id, an exception is thrown and the command is retried. The `LockLatch` and the `Lock` are released before the retry.
38 | 
39 | **Update: 16/09**
40 | Needs to be implemented for the `PutMapCommand` and the `ClearCommand`. They will be similar to the description above, but the `LockLatch` implementation will be a collection of single key `LockLatch`es.
41 | 
42 | ### Transactional Cache
43 | 
44 | **Update: 16/09**
45 | Not implemented yet...
46 | 
47 | #### Pessimist Caches
48 | 
49 | In pessimistic caches, the algorithm handles the following commands:
50 | 
51 | * `LockControlCommnad` is managed in `RemoteLockOrderManager` and the `LockLatch` is associated to the transaction. Later the command is processed in the remote executor service.
52 | * `PrepareCommand` is processed directly in JGroups executor service if L1 is **disabled**. In this case, it will not acquire any locks neither wait for any other events. But if L1 is **enabled**, it is processed in remote executor service because it needs to invalidate the L1 (synchronous operation). Note that it isn't managed by the `RemoteLockOrderManager` and it releases the `LockLatch`es associated to the transaction.
53 | * `RollbackCommand` processed directly in JGroups executor service andit releases the `LockLatch`esassociated to the transaction.
54 | 
55 | **State Transfer**
56 | It needs some mechanism to create `LockLatch` when the transactions transferred. Note that the `LockLatch` does not need to send to the other node, but the transaction needs to create open `LockLatch` in the new primary owner. _(to think better about it)_
57 | 
58 | #### Optimistic Caches
59 | 
60 | In optimistic caches, the algorithm handles the following commands:
61 | 
62 | * `PrepareCommand` it is managed by `RemoteLockOrderManager` and are processed in the remote executor service. The `LockLatch` is associated to the transaction.
63 | * `CommitCommand` processed directly in JGroups executor service when L1 is **disabled**. When L1 is **enabled**, it is processed in remote executor service since it needs to invalidate the L1 (synchronous operation). Also, it releases the `LockLatch`es associated to the transaction.
64 | * `RollbackCommand` processed directly in JGroups executor service and it releases the `LockLatch`es associated to the transaction.
65 | 
66 | **State Transfer**
67 | The same as in pessimistic caches _(think...)_.
68 | 
69 | # Changes in default configuration
70 | 
71 | The default configuration uses a non-queued executor service with a large number of thread. With this new algorithm, it would be better to have a large queue and a reasonable number of threads.
72 | 
73 | Also, it would be good to merge the total order executor service with the remote executor service. In the end, they both have the same goal and it is not needed to configure two executor services. 
74 | 
75 | # Known Issues
76 | 
77 | If the remote executor service queue is not large enough to process all the "blocking" commands, it can hit the same problem again. The algorithm assumes that "blocking" commands are processed in remote executor service and the remaining in JGroups executor service.
78 | 
79 | An idea to solve this, is the executor service to expose their internal state (running threads and queue occupation). This way, it would make it possible to the algorithm to queue a bit longer the commands in order to prevent an overload of the executor service.
80 | 
81 | ## L1 invalidation deadlock
82 | 
83 | When L1 is enabled, each update (or transaction) will generate an invalidation message to invalidate the L1 cache in non-owners. 
84 | 
85 | **non transactional caches**
86 | It has at least 3 messages to be processed in the same pool. The message sent to the primary owner to lock generates an invalidation message and a message to the backup owners. At the same time, the backup owners generate another invalidation message. when the invalidation is synchronous, the executor service may be full with threads awaiting the ACK from invalidation (or from the backup owners) and no threads are available to process other requests.
87 | 
88 | **transactional caches**
89 | The commit generates an invalidation message. If the executor service is full, the same problem as described above can occur.
90 | 
91 | ## State Transfer Forwarding
92 | 
93 | This problem happens only with transactional caches. When a topology changes, the nodes involved in the transaction will forward the command (prepare and commit) for the new owners. Since the forward commands are processed in the same executor service and the "non-forward" commands, it can happen the executor service to be full.


--------------------------------------------------------------------------------
/Clustered-listeners.md:
--------------------------------------------------------------------------------
  1 | #Context
  2 | The current(as of Infinispan 6.0) eventing support in Infinispan has the following limitation: the registered listeners are local to the node where the event has produced. E.g. considering an distributed cache with _numOwners=2_ which runs on nodes {A,B,C,D,E} and _K_ being a key in the cache, with owners(K)={A,B}. A [@CacheListener](http://docs.jboss.org/infinispan/6.0/apidocs/org/infinispan/server/websocket/CacheListener.html) instance registered on node C does not receive notifications on any events related to the key _K_ such as the entry being created, removed or updated. It is only the **local** listeners in A and B which receive this notifications.
  3 |  
  4 | This wiki page describes an design approach for implementing **cluster** listeners: listeners that are to be notified on certain events disregarding where the events actually occurred. Resuming the previous example, a **cluster** listener registered on C can subscribe to notifications of events originated by adding, removing or updating _K_.
  5 | 
  6 | ## Befits
  7 | There are several problems that **clustered** listeners help solving:
  8 | * [materialized views] (http://en.wikipedia.org/wiki/Materialized_view): for queries that are run very frequently against the grid (e.g. few k times a second) instead of repeating the query on every request, one can register a cluster listener that defines the query and keep track of the query result. This fits very nicely for situations where the query is run frequently but the result of the query is amended rarely. 
  9 | * [Complex Event Process (CEP)] (http://en.wikipedia.org/wiki/Complex_event_processing) because the listeners allow keeping state between invocations, certain basic CEP conditions can be implemented within the listeners. For more complex state processing (or in order be allowed to define CEP logic dynamically) an proper CEP engine can be plugged in, such as [Drools](http://www.jboss.org/drools/)
 10 | 
 11 | #Suggested solution
 12 | For consistency and simplicity we plan to build the cluster listener logic on top of the existing  [@CacheListener](http://docs.jboss.org/infinispan/6.0/apidocs/org/infinispan/server/websocket/CacheListener.html) API. 
 13 | 
 14 | ##API changes 
 15 | 
 16 | The existing `Cache` API adds an overloaded `addListener` method with the following signature: 
 17 | ```java
 18 | /**
 19 | * Registers a listener that will be notified on events that pass the filter condition.
 20 | * @param filter decides which events should be feed to the listener. Particularly relevant for cluster 
 21 | * listeners as it reduces the network traffic by only propagating the relevant events to the node where the 
 22 | * listener is registered.
 23 | * @param converter creates a projection on the filtered entry's value which is then returned by the 
 24 | * CacheEntryCreatedEvent#getValue(). Particularly relevant for clustered listeners in order to reduce the 
 25 | * size of the data sent by the event originating node to the event consuming node (where the listener is 
 26 | * located). 
 27 | */
 28 | Cache.addListener(Object listener, Filter filter, Converter converter);
 29 | ```
 30 | 
 31 | The listener annotation is enhanced with two new attributes:
 32 | ```java
 33 | public @interface Listener {
 34 |   /**
 35 |   * Defines whether the annotated listener is clustered or not. 
 36 |   * Important: Clustered listener can only be notified for @CacheEntryRemoved, @CacheEntryCreated and   
 37 |   * CacheEntryModified events.
 38 |   */
 39 |   boolean clustered() default false;
 40 | 
 41 |   /**
 42 |   * For clustered listeners only: if set to true then the entire existing state within the cluster is 
 43 |   * evaluated. For existing matches of the value, an entryAdded events is triggered against the listener    
 44 |   * during registration.  
 45 |   **/
 46 |   boolean includeCurrentState() default false;
 47 | }
 48 | ```
 49 | 
 50 | ```java
 51 | /**
 52 | * Accepts or rejects an entry based on key, value and metadata. In the case of entries being removed, the 
 53 | * value and the metadata are null. 
 54 | */
 55 | interface Filter<K,V> {
 56 |   boolean accept(K key, V value, Metadata metadata)
 57 | }
 58 | ```
 59 | 
 60 | ```java
 61 | /**
 62 | * Builds a projection of V which is send to the node where the listener is registered.
 63 | */
 64 | interface Convertor<K,V,C> {
 65 |   C convert(K,V, Metadata metadata);
 66 | }
 67 | ```
 68 | 
 69 | ##Lifecycle
 70 | The (cluster) listener can be registered with the `Cache.addListener` method described above and is active till one of the following two events happen:
 71 | * it is explicitly unregistered by invoking `Cache.removeListener`
 72 | * the node on which the listener was registered crashes
 73 | 
 74 | ##Guarantees
 75 | ###Ordering
 76 | For cluster listeners, the the order in which the cache is updates is reflected in the sequence of notifications received.
 77 | ###Singularity
 78 | The cluster listener does not guarantee that an event is sent once and only once. Implementors should guard agains a situation in which the same event is send twice or multiple times, i.e. the listener implementation should be idempotent. Implementors can expect singularity to be honored for stable clusters and outside of the time span in which synthetic events are generated as a result of _includeCurrentState_.
 79 | 
 80 | ##Internals
 81 | The sequence diagram below describes how we plan to implement listeners. 
 82 | ![Generated with Astah community](https://lh5.googleusercontent.com/na8Tm1WbHyhyQm0tnydnyFXBSo5mFHigg2mWdirFgcoF5JlkibOpbYE_7ru_Np3mNrb0cNguVUM=w1256-h898)
 83 | * 1.1: the filter attached to the listener is broadcasted to all the nodes in the cluster. Each node holds a structure Filter -> {Node1, NodeX} mapping the set of nodes interested in events matching the given filter
 84 | * 1.1.1.checkFilterAlreadyRegistered is a potential optimization for the case in which the same filter is used by multiple listeners within the cluster
 85 | * 2: for each write request, the the registered filters are invoked and the corresponding remote listener(s) are notified
 86 | 
 87 | ###Handling includeCurrentState=true
 88 | * in parallel with 1.1.1 a Map/Reduce command is run against the cluster that would feed @CacheEntryCreated events into the listener instance
 89 | * after 1.1.1 completes, for every operation matching the filter, instead of 2.1.1:notifyRemoteFilterOnMatch, the node keeps a (bounded) queue of the modifications for future processing
 90 | * after the Map/Reduce task is finalized the node where the listener is installed broadcasts a message to the cluster allowing the queue messages to be flushed
 91 | * this approach guarantees ordering
 92 | 
 93 | ###Filter housekeeping
 94 | If a node crashes, then the filters originated from that node are removed from the list (assuming the filter is not in use by listeners registered on other nodes). This can be implemented using a ViewChangeListener.
 95 | 
 96 | ###Handling topology changes
 97 | Cluster registration information should be propagated as part of the state transfer, for new joining nodes. This would be an additional step in the state transfer process, together with the memory state and the persistent state. Other functionality might need to transfer state as well (e.g. remote listeners to transfer the remote listener information) so we might want to make this a pluggable mechanism.
 98 | 
 99 | #Related
100 | JIRA: [ISPN-3355] (https://issues.jboss.org/browse/ISPN-3355)
101 | A relevant [discussion](http://markmail.org/thread/qanjofgmpyvdjnmg) on the mailing list is (even though interlaced with the partitioning one).


--------------------------------------------------------------------------------
/A-continuum-of-data-structure-and-query-complexity.asciidoc:
--------------------------------------------------------------------------------
  1 | :linkcss!:
  2 | :source-highlighter: highlightjs
  3 | :toc:
  4 | 
  5 | This essay is a discussion on the various data structure complexities,
  6 | and how to implement various queries on top of them.
  7 | 
  8 | ## A continuum of data structure complexity
  9 | 
 10 | Data as *blob* is the simplest form.
 11 | Being opaque, the datastore cannot do much about it.
 12 | 
 13 | Data as *map* (set of property/values) offers the ability for the datastore to expose each "property".
 14 | 
 15 | Data as a *self contained object graph* is more flexible than map.
 16 | *Embedded objects* and collection of embedded objects are expressible.
 17 | A per entry or shared *schema* can be imposed on entries and offer validation.
 18 | Note that data is not duplicated between different keys here.
 19 | 
 20 | Data as a set of *connected objects* offer the most flexibility.
 21 | While each entity can contain (collections of) embedded objects,
 22 | they can also connect to other entities.
 23 | 
 24 | [NOTE]
 25 | .Critical definitions
 26 | ====
 27 | An embedded object lifecycle is entirely dependent of its owning object.
 28 | An entity has an independent lifecycle compared to other objects.
 29 | 
 30 | * Entities connected to each other are connected objects.
 31 | * Entities only pointing to embedded objects are self contained object graphs.
 32 | ====
 33 | 
 34 | For reference, document stores tend to offer the self contained object graph approach.
 35 | Connected objects have to be built on top by the application.
 36 | 
 37 | ### OO
 38 | 
 39 | OO implies polymorphism, so when someone looks up or query an +Animal+,
 40 | it can be a +Cat+, +Dog+, +Human+ etc.
 41 | Some queries are addressed to a specific subtype. Some are to the higher type.
 42 | 
 43 | ## A continuum of query
 44 | 
 45 | This section describes the possible approaches to implement query on your data set.
 46 | This realistically imposes restrictions on how query-able data can be stored on a given system.
 47 | 
 48 | I am not trying to list all possible techniques and optimizations.
 49 | Enough to get a cost and understanding of what is at stake.
 50 | In particular and for brievety, I focus on the filtering part of a query, not the aggregation:
 51 | +from ... where ...+ as opposed to +select ... group by ... having+.
 52 | 
 53 | ### Query by primary key
 54 | 
 55 | When data is a blob, the efficient way to query is by its *primary identifier* (key in the data grid).
 56 | All other queries require one or more passes of *full scan*.
 57 | 
 58 | ### Query by property value
 59 | 
 60 | When data is a map and to most extend a self contained object graph,
 61 | you can additionally query data by their *property values*.
 62 | To do that efficiently, an index (property-value->primary key) is maintained.
 63 | For a relatively modest cost at CUD time, we avoid the full scan of the data set.
 64 | An index can also offer additional features like full-text search.
 65 | 
 66 | An index can be maintained manually by the application or delegated to the datastore.
 67 | 
 68 | IMPORTANT:
 69 | Full scan are to be avoided if possible, especially when part of the data set is passivated.
 70 | Even when fully in memory, a full scan can keep a thread busy for a while compared to an index.
 71 | 
 72 | An alternative is to physically store data ordered by the property value we are looking for.
 73 | This is a trick often used for time series.
 74 | This is a single-shot pistol though as you can only select one property.
 75 | 
 76 | For the curious, systems like Google Dremel store data completly differently
 77 | and pay the cost at different time.
 78 | 
 79 | ### Query between entities
 80 | 
 81 | Queries between related entities can be done in several ways.
 82 | 
 83 | Materialized view::
 84 | A structure that physically represents the results of a query.
 85 | It has to be maintained.
 86 | When a CUD happens, all related materialized views need to be updated.
 87 | This is essentially a denormalization.
 88 | 
 89 | Data denormalization::
 90 | Store the entity A and its associated entities B (and C and D...) under one key.
 91 | When you denormalize, something must keep the duplicated copies synchronized.
 92 | The benefit is that these queries are becoming "as simple as" query by property value.
 93 | 
 94 | Index denormalization::
 95 | Index the necessary object graph and store that information in the index.
 96 | For a given entity A, the list of Bs (and Cs and Ds) associated will be stored in the index.
 97 | To achieve that, something must know that A is associated to these Bs and Cs and must maintain the index.
 98 | The benefit is that these queries are becoming "as simple as" query by property value using indexes.
 99 | 
100 | Computed via full scan (on the fly join)::
101 | This scenario is orders of magnitude worse than the full-scan for a given query by property value.
102 | Assuming a join between A and B.
103 | For each matching A, you need to find and load the matching Bs.
104 | If the join is applied on the non primary keys, we are looking at best at 2 full scans.
105 | And in a distributed world, the full scan must encompass all primary nodes.
106 | Multiple joins leads to an exponential explosion.
107 | 
108 | All of the above::
109 | In practice, a datastore needs to implement a combination of these techniques.
110 | A query planner is then necessary to decide which technique to use for a given part of a query and data set.
111 | This planner must be fed with data statistics and with the relations between these entities and properties.
112 | Pretty much like a RDBMS.
113 | 
114 | Note that when one needs to load a set of data (entities A) to find the related set of data (entities B),
115 | one often end up with something called the n+1 problem. n+1 lookups must be performed.
116 | 
117 | ## A continuum of use cases (kind of the conclusion)
118 | 
119 | If the system only accesses data by its primary key, life is easy.
120 | 
121 | If the system only has maps or self contained object graphs,
122 | the data for a given entity type is all contained in the same cache
123 | and indexes can be built to speed common queries.
124 | Full scan can cover the rest with a higher price.
125 | 
126 | If the system stores connected entities (whether in the same cache or not),
127 | life becomes complicated:
128 | 
129 | * you need to write a pretty *smart query planner / executor* or be shit slow
130 | * more importantly, *something needs to know about the relations between the entities* and
131 |   maintain the denormalization structures (index, actual data duplication etc)
132 | ** to maintain an index with denormalized data
133 | ** to maintain physically denormalized data
134 | ** because multiple full scans for each query should be a non starter
135 | 
136 | That something can be:
137 | 
138 | 1. the user application (manually)
139 | 2. a framework or paradigm which deals with interconnected entities and store them in the datastore
140 | 3. the datastore itself
141 | 
142 | Infinispan is not in the business of 3.
143 | It is not a relational database.
144 | And the currently exposed API are a long way from achieving this.
145 | 
146 | The user would be hard pressed to implement and maintain the data denormalizations we are talking about
147 | unless they are very limited and don't evolve.
148 | Same when it comes to use this denormalized data to write efficient queries.
149 | I suspect that in practice, this is a nightmare.
150 | 
151 | That leaves a framework or paradigm dealing in interconnected entities.
152 | 
153 | I believe a *cache API and Hot Rod are well suited to address up to the self contained object graph* use case
154 | with a couple of relations maintained manually by the application but that cannot be queried.
155 | 
156 | For the connected entities use case, only a high level paradigm is suited like JPA.


--------------------------------------------------------------------------------
/Conflict-resolution-perf-improvements.md:
--------------------------------------------------------------------------------
  1 | Conflict Resolution Performance Improvements
  2 | ====================
  3 | Currently conflict resolution (CR) performance is dog slow, due to the following:
  4 | 
  5 | 1. All segments and their cache entities are checked
  6 |     - Performance degrades rapidly as the size of the cluster increases
  7 | 
  8 | 2. Centralised CR
  9 |     - The merge coordinator is responsible for comparing all cache entities
 10 |         - In a distributed cache, this requires the coordinator to retrieve all segments from both partitions in order
 11 |         to compare them
 12 |     - User triggered CR occurs on the node the request is made
 13 | 
 14 | 3. No parallelism
 15 |     - A single segment is processed at a given time in order to avoid the centralised CR coordinator suffering a OOM exception
 16 | 
 17 | # Proposals
 18 | The following proposals are listed in the order of the anticipated performance benefit, with those with greatest impact
 19 | presented first.
 20 | 
 21 | ## Maintaining a Merkle tree
 22 | To minimise the number of entries that need to be checked during CR, we should maintain a Merkle tree of cache entry hashes
 23 | on a per segment basis, with the root of the tree being a hash of all the entry hashes in the segment. Given the root hash
 24 | of a segment from three different nodes, we can determine that no conflict exists simply by checking if all three values
 25 | are equal. If any of the three hashes differ, then it's then necessary to compare the segment's entries and perform CR.
 26 | Amazon Dynamo DB utilises this technique https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.
 27 | 
 28 | The simplest solution for producing such a tree, is to simple have a tree of depth 2. Where each cache entry is a leaf node
 29 | and the root is a hash of all leaf hash values. This can be easily implemented by simply iterating over the segment container
 30 | in order to calculate the hash of individual entries and the combined hash.
 31 | 
 32 | TRACKED BY: [ISPN-8412](https://issues.redhat.com/browse/ISPN-8412)
 33 | 
 34 | ### Hash to use
 35 | MurmurHash3?
 36 | 
 37 | ### Calculating the entry hash
 38 | Simply hash the `.hashcode()` returned by the `InternalCacheEntry` implementation.
 39 | 
 40 | The `equals` and `hashcode` methods of our `InternalCacheEntry` implementations should be updated to take into account
 41 | the `EntryVersion` stored in the Metadata. Currently this is ignored by CR, however including the check would allow the
 42 | following conflicts to be resolved deterministically via `LATEST_VERSION` resolution strategy:
 43 | 
 44 | ```java
 45 | put(k, v, 1)  // 1. put k/v with version 1
 46 | put(k, v', 2) // 2.
 47 | put(k, v, 3)  // 3. Missed by partition 1
 48 | ```
 49 | By comparing the entry versions during CR, it's possible to ascertain that the original value missed by a partition in
 50 | step 3 is in fact the latest version.
 51 | 
 52 | Even if a different merge strategy was used, maintaining an entries `EntryVersion` value is necessary in order for 
 53 | HotRod conditional operations to work as expected for the winning value post CR.
 54 | 
 55 | ### Calculating the tree
 56 | The computation of non-leaf nodes should only occur at the start of CR, as the additional iteration of entries would
 57 | adversely affect cache write operation performance. Furthermore, it's necessary to ensure that both the in-memory and
 58 | store entries are included in the creation of the tree. The cost of calculating invidual leaf node hashes should be minimal,
 59 | depending on the hash used, so this could be computed actively or lazily.
 60 | 
 61 | Cache operations that occur during the CR phase should be treated as the latest value and should overwrite any writes
 62 | that occurr as part of CR. Therefore, once the tree has being created at the start of CR it's state should be immutable
 63 | until CR is resolved/aborted.
 64 | 
 65 | ### Increasing Merkle Tree Depth
 66 | A segment root with all entries being it's children is not very efficient when a segment contains a large number of values,
 67 | as when an inconsistency is discovered, it's still necessary to send all entries within the segment over the wire. Increasing
 68 | the depth of the tree means that it's possible to perform more fine-grained consistency checks, with a larger tree depth
 69 | and smaller amount of leaf nodes resulting in less entries having to be sent over the wire when inconsistencies do occur.
 70 | 
 71 | #### Constant lookup/removal tree - Depth 3
 72 | This is a simple implementation which adds an additional layer of depth to the tree in order to increase CR granularity:
 73 | 
 74 | * A segment is further divided into `X` nodes, which are the root's childen
 75 |   - These nodes represent a range of a cache's `key` hashcodes, e.g. 0..5, 6..10 etc
 76 |   - `(hash(key) & Integer.MAX_VALUE) \ X`
 77 |   - MurmurHash3 can also be used here
 78 |   - Where `X` could be configurable, but should probably just be an internal constant
 79 | 
 80 | * Each node contains:
 81 |     - A Map to store leaf nodes, which enables amortised constant lookup and to accomodate `.hashcode()` conflicts on keys
 82 |     - A node hash field containing a hash of all the leaf node's hashes combined
 83 | 
 84 | * The tree, excluding leafs, can be represented as a int and int[X] (Root+X hashes) in a RPC
 85 | 
 86 | * During CR if two tree's root values don't match, it's then possible to compare all `X` hashes and only request the indexes
 87 | of the array which have conflicting hashes. At which point the participating nodes only return the InternalCacheEntries
 88 | associated with those nodes.
 89 | 
 90 | ### Offline Backup Consistency Check
 91 | The Merkle tree hashes can also be utilised to ensure the consistency of data being dumped to an offline backup. When a
 92 | user initiates a backup:
 93 | 
 94 | 1. The cluster initiates CR to ensure that no inconsistencies exist before the backup
 95 | 2. A new Merkle tree is created, with the root node being the hash of all primary replica segment hashes
 96 | 3. The root hash is stored as part of the backup metadata
 97 | 4. When a cluster is restored from a backup, the clusterwide Merkle tree can be recreated and the new root hash can be
 98 | compared to the value stored in the metadata.
 99 | 
100 | > Utilising Merkle trees in this manner would mean that it's not possible to make changes to how entries are hashed
101 | and how the tree is created without an appropriate migration strategy between Infinispan versions.
102 | 
103 | ## Distributed CR processing
104 | Instead of CR being coordinated by a single node, the coordinator should instruct the primary owner of each segment to
105 | initiate CR. As it's likely that a single node will be the primary for multiple segments in a small cluster, we should
106 | still limit each node to executing CR for a single segment at a time.
107 | 
108 | Distributing CR also benefits from the use of Merkle trees, as it means that the primary owner in the preferred partition
109 | who is coordinating the CR would not have to send their tree over the wire.
110 | 
111 | TRACKED BY: [ISPN-9084](https://issues.redhat.com/browse/ISPN-9084)
112 | 
113 | ## Prioritise segments based upon requests during merge
114 | During a partition merge, if CR is in progress and a request is made on a specific key before it's segment has been processed,
115 | then we should perform CR on the key in place.
116 | 
117 | The advantage is that conflicts on an active entitiy are resolved sooner, furthemore, if (Merkle Trees)[Maintaining-a-Merkle-tree>]
118 | are implemented, it potentially could result in a conflict(s) being resolved before the segment is checked, meaning all entries
119 | in the segment do not have to be retried.
120 | 
121 | The disadvantages are the increased complexity, which will only increase if (#Distributed CR processing) is implemented.
122 | 
123 | TRACKED BY: [ISPN-9079](https://issues.redhat.com/browse/ISPN-9079)
124 | 


--------------------------------------------------------------------------------
/RAC-implementation.md:
--------------------------------------------------------------------------------
  1 | Infinispan Reliable Asynchronous Clustering
  2 | ===========================================
  3 | 
  4 | Original Document: [RAC: Reliable Asynchronous Clustering](RAC%3A-Reliable-Asynchronous-Clustering.asciidoc)
  5 | Implementation: Pedro Ruivo, 2019
  6 | 
  7 | # Changes from the original design
  8 | 
  9 | * The `Updaters` are the owners of the `key` and the `primary-owner` is responsible to send 
 10 | the updates to the remote site.
 11 | 
 12 | * The *Conflict Resolution* was re-done, and it uses versioning (`vector-clock`) to detect conflicts and duplicates.
 13 | It will be described later in this document.
 14 | 
 15 | # Overview
 16 | 
 17 | This section describes a short overview of the algorithm.
 18 | 
 19 | When a `key` is updated (doesn't matter if it is a transaction or not), all the owners keeps track the `key` by 
 20 | storing it in a queue.
 21 | Periodically, the `primary-owner` iterates over all `keys` queued and sends its value to the remote site.
 22 | To fetch the values, it peeks the `DataContainer` and `CacheWriter`.
 23 | If the value isn't found, it sends a `remove` request to the remote site, 
 24 | otherwise an update request with the `key` and its value.
 25 | 
 26 | After a successful replication (and `ack` is sent back from the remote site), the `primary-owner` sends a `cleanup`
 27 | request to the `backup-owners`.
 28 | They drop the `key` from the queue.
 29 | 
 30 | Note that, the `lock` for the `key` isn't acquired during this process.
 31 | 
 32 | # Conflict Resolution
 33 | 
 34 | The conflict happens when two or more sites update the same `key` concurrently. 
 35 | It uses `vector-clocks` to detect such scenarios and perform the conflict resolution to decide which
 36 | update stays and which update is discarded.
 37 | 
 38 | The vector clock has the following structure:
 39 | 
 40 | ```
 41 | {
 42 |   site-1: [topology, version]
 43 |   site-2: [topology, version]
 44 |   ...
 45 | }
 46 | ```
 47 | 
 48 | The `site-1` and `site-2` are the site's names as configured in `RELAY2`.
 49 | 
 50 | The `topology` is a number associated to the topology changes for that site. 
 51 | When a node joins/leaves/crashes, the `topology` increases and the `version` resets to `0`.
 52 | 
 53 | The `version` is a number that increments on each update for a segment. 
 54 | To improve the concurrency and reduce the memory footprint, the `version` is associated to a segment,
 55 | instead of having one per `key`.
 56 | 
 57 | *Note:* The `clear` operation is destructive, and it doesn't generate any new `version` neither checks for conflicts!
 58 | 
 59 | ### Version comparator
 60 | 
 61 | To define the order for a `[topology, version]` pair, it compares the elements. 
 62 | The `topology` is compared first and, if `p1.topology` < `p2.topology`, than pair `p1` is less than pair `p2`.
 63 | If the `topology` is the same, then `version` with the same logic as above to determine the ordering.
 64 | Examples:
 65 | 
 66 | ```
 67 | [1,10] is less than [2,0]
 68 | [1,11] is higher than [1,10]
 69 | ```
 70 | 
 71 | Finally, to compare `vector-clocks`, it compares the pairs for each site.
 72 | If for all sites, the pair is lower, then the `vector-clock` is less than the other. 
 73 | On other hand, if for some sites the pairs are less  and for other the pair is higher, then we have a conflict.
 74 | Example:
 75 | 
 76 | .vector-clock 1
 77 | ```
 78 | {
 79 |   "LON": [1,1]
 80 |   "NYC": [0,0]
 81 | }
 82 | ```  
 83 | 
 84 | .vector-clock 2
 85 | ```
 86 | {
 87 |   "LON": [0,0]
 88 |   "NYC": [1,1]
 89 | }
 90 | ```
 91 | 
 92 | 
 93 | ### Resolving conflicts
 94 | 
 95 | The current conflict resolution algorithm uses the site names to resolve the conflict.
 96 | It uses the lexicographic order to determine which update wins.
 97 | In the example above, the update from site `LON` will be applied, and the update from site `NYC` will be discarded.
 98 | 
 99 | The future plan includes a SPI so users can do their own conflict resolution.
100 | See [ISPN-11802](https://issues.redhat.com/browse/ISPN-11802).
101 | 
102 | ### Tombstones
103 | 
104 | When a `key` is removed, it stores a `tombstone` with the `vector-clock` corresponding to that update operation.
105 | The `primary-owner` sends the `tombstone` to the remote site, together with the `remove` request,
106 |  in order the remote site to perform a correct conflict resolution. 
107 | 
108 | # Implementation details
109 | 
110 | ## Classes
111 | 
112 | ### `IracManager`
113 | 
114 | It is responsible to keep track the `keys` updated and to send them to the `remote site`.
115 | 
116 | ### `IracVersionGenerator`
117 | 
118 | It is responsible to generate the `vector-clock` and to store the `tombstones`.
119 | 
120 | ### `*BackupInterceptor`
121 | 
122 | It intercepts the `WriteCommand` and notified the `IracManager` with the `keys` modified by them. 
123 | 
124 | ### `*IracLocalInterceptor`
125 | 
126 | It intercepts the local updates, and it does the following:
127 | 
128 | * In the `primary-owner` generates a new `vector-clock` and stores it in the `WriteCommand` for the `backup-owners`.
129 | * In the `backup-owners` just fetches the `vector-clock` from the `WriteCommand` and store it.
130 | 
131 | It interacts with the `IracVersionGenerator`.
132 | 
133 | ### `*IracRemoteInterceptor`
134 | 
135 | It intercepts the updates from the remote site and the `primary-owners`performs the conflict resolution.
136 | 
137 | ## Transactions
138 | 
139 | The transactional caches need some special handling. 
140 | They have the following issues:
141 | 
142 | * The `vector-clock` must be generated during the `prepare` phase and not when the `WriteCommand` is executed.
143 | * The conflicts must be detected before the transaction reaches it final outcome.
144 | * The transaction originator may not be the `primary-onwer` for a particular `key` and
145 | only the `primary-onwer` must generate the `vector-clock`.
146 | 
147 | ### Optimistic Transactions
148 | 
149 | The Optimistic Transactions commits in 2 phases, make it easer to target the issues above. 
150 | During the prepare phase the `primary-onwer` generates the `vector-clock` and sends it back to the originator.
151 | 
152 | The new `vector-clocks` is sent together with the `CommitCommand` to all the `backup-onwers`.
153 | 
154 | When an update from the remote site is received, it is wrapper in a transaction.
155 | This transaction only contains a single `WriteCommand`, and the `primary-owner` checks for conflict 
156 | during the prepare phase.
157 | If a conflict is detected and the update needs to be discarded, it aborts the transaction.
158 | 
159 | ### Pessimistic Transactions
160 | 
161 | The Pessimistic Transactions commits in one phase and, it requires a bit of work to handle IRAC properly.
162 | 
163 | The `primary-owner` acquires lock for a `key` during the transaction execution. 
164 | So, after the lock request, the originator sends a request to obtain a new `vector-clock` for that `WriteCommand`.
165 | It sends the request asynchronously and, it only waits for the reply, just before committing the transaction.
166 | 
167 | Conflict resolution requires a little more work. 
168 | To avoid complicate the code dramatically, when an update from remote site is received 
169 | it must be forwarded to the `primary owner`.
170 | It is wrapped in the transaction and the originator (`primary-owner`) has all the data needed to check for conflicts. 
171 | 
172 | ## Topology Changes
173 | 
174 | The `topology` element in the `[topology, version]` pair is here to provide consistency in case of topology changes,
175 | and it changes every time a topology change.
176 | 
177 | When the `primary-owner` crashes, the "new" `primary-owner` will start sending the updates to the remote site. 
178 | It may send duplicated updates for `keys` sent but yet not confirmed by the previous `primary-owner`.
179 | With the `vector-clock`, the duplicated are detected and ignored.
180 | 
181 | Finally, the state transfer sends the pending `keys` and `tombstone` for new nodes that become owners for a segment.
182 | 
183 | 


--------------------------------------------------------------------------------
/Handling-cluster-partitions.md:
--------------------------------------------------------------------------------
  1 | # Problem
  2 | It is possible and should be expected for a cluster to form multiple partitions (aka. [split brain](http://en.wikipedia.org/wiki/Split-brain_(computing)). E.g. if the cluster has the initial topology {A,B,C,D,E}, because of a network issue (e.g. router crash) it is possible for the cluster to split into two partitions {A,B,C} and {D,E}. Assuming DIST mode with _numOwners=2_, both partitions end up holding a subset of data and can individually be in inconsistent state.
  3 | Partitions cause inconsistencies: if in the original cluster the key _K_ is owned by nodes D and E, after the split brain the {A,B,C} partition considers _K_ as null.
  4 | At the moment Infinispan allows the users to be notified in the eventuality of a partition by registering [ViewChanged](http://docs.jboss.org/infinispan/6.0/apidocs/index.html?org/infinispan/notifications/class-use/Listener.html) listers but it doesn't offer any support for the user to react to the partition e.g. by making the cluster as inactive in order to avoid inconsistent reads. 
  5 | This wiki page documents a solution we consider for better handling of cluster partitions.
  6 | 
  7 | # Possible approaches
  8 | There are several approaches that can be taken for either mitigating or (eventually) solving the partitioning problem:
  9 | 
 10 | 1. Redundant infrastructure
 11 |   * using two (or more) physical networks infrastructures the cache for partitions to happen can be reduced significantly. E.g. if the cluster is connected by two networks, each network having an availability rate of 99%, then the overall availability of the system is 99.99%. This redundancy can be configured at OS level through [IP bonding](http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces-nic-into-single-interface.html) and doesn't require any additional Infinispan/JGroups configuration
 12 |   * note that this approach, whilst feasible for many situations doesn't entirely avoid the possibility for the partition to happen
 13 | 2. Partition merging
 14 |   * in this approach the partitions can progress individually accepting read/writes from the user application (might cause inconsistencies as described above). When the two partition discover each other as a result of the network healing, the state of the two partitions is merged. There are several approaches to merge the state: e.g. automatic (would require each entry to be versioned) or pass the merging logic to the application. 
 15 |   * note that when the two partition run in parallel the data is inconsistent(AP from the [CAP](http://en.wikipedia.org/wiki/CAP_theorem) theorem)
 16 | 3. Primary partition
 17 |   * both partitions make a deterministic decision on which partition to stay active and which one to go in read-only mode (or even stop serving users entirely). The decision can be made based on quorum, e.g. the partition having _numMembers/2 + 1_ nodes to win (or both to loose if a deterministic decision cannot be made, e.g. for even clusters).
 18 |   * when the network heals, the "loosing" partition can merge into the active partition (which might have been modified in between) by wiping out its state and re-fetching it (state transfer)
 19 | 
 20 | # Our approach
 21 | In the first iteration we plan to enhance the support for partition detection and allow the user to react to a partition happening (custom policy) by making a partition as UNAVAILABLE (stop answering users' request), READ_ONLY or AVAILABLE. This is along the lines of item 3 (Primary partition) as described in the previous section.
 22 | 
 23 | # Design
 24 | 
 25 | ## New API and Config
 26 | The partition handling policy is configured in the _availability_  section of the global configuration:
 27 | ```xml
 28 |   <infinispan>
 29 |        <global>
 30 |            <availability>
 31 |                <!-- this handler is invoked on every node in order to handle partitions.
 32 |                 Nodes within the same partition are expected to react consistently to the partition change.
 33 |                 MyPartitionStategy implements PartitionHandlingStrategy (see below).
 34 |                --> 
 35 |                <partition-handling strategy="org.acme.MyPartitionStategy"/>
 36 |                <!-- The following config is just an example and might be added in the future release in 
 37 |                order to control system's  availability when the persistent store is out...         
 38 |                 < persistence-failure strategy="org.acme.MyPersistenceFailuerStrategy"/ >
 39 |                -->
 40 |            </availability>
 41 |        </global>
 42 |      ...
 43 |     </inifinispan>
 44 | ```
 45 | 
 46 | ```java
 47 |     interface PartitionContext {
 48 |         /**
 49 |          * returns the list of members before the partition happened.
 50 |         */
 51 |         View getPreviousView();
 52 |    
 53 |        /**
 54 |          * returns the list of members as seen within this partition.
 55 |         */
 56 |         View getCurrentView();
 57 |    
 58 |        /**
 59 |         * Returns true if this partition might not contain all the data present in the cluster before
 60 |         * partitioning happened. E.g. if numOwners=5 and only 3 nodes left in the other partition, then 
 61 |         * this method returns false. If 6 nodes left this method returns true.
 62 |         * Note: in future release for distributed caches, this method might do some smart computing based on 
 63 |         * segment allocations, so even if > numOwners left, this method might still return true.
 64 |         */
 65 |         boolean isDataLost();
 66 |     
 67 |         /**
 68 |         * Marks the current partition as read only (writes are rejected with an AvailabilityException).
 69 |         **/   
 70 |         void currentPartitionReadOnly();
 71 |      
 72 |         /**
 73 |         * Marks the current partition as available or not (writes are rejected with a 
 74 |         *  AvailabilityException).
 75 |         **/   
 76 |         void currentPartitionAvailable(boolean available);
 77 |     }
 78 | ```
 79 | 
 80 | ```java
 81 |     interface PartitionHandlingStrategy {
 82 |        /**
 83 |        * Implementations might query the PartitionContext in order to determine if this is the primary 
 84 |        * partition, based on  quorum and mark the partition unavailable/readonly.
 85 |        **/
 86 |        void handlePartition(PartitionContext pc);
 87 |      }
 88 | ```
 89 | 
 90 | ## Implementation details
 91 | * a new `AvailabilityInterceptor` is added, having 3 states: available, readOnly, unavailable. Based on its state the interceptor might allow, reject the writes or respectively reject all operations to the cache
 92 | * when an operation is rejected a custom exception exception is thrown to the user indicating the fact that the partition is not available (AvailabilityException)
 93 | * the `PartitionContext.currentPartitionAvailable` and `PartitionContext.currentPartitionReadOnly` methods set the state of the `AvailabilityInterceptor` and are invoked by configured `PartitionHandlingStrategy` implementation 
 94 | * the status of `AvailabilityInterceptor` is to be exposed through JMX operations as well (read/write)
 95 | * we might also provide an @Merge listener implementation to automatically merge a primary partition with an secondary (unavailable) partition by making the later wipe out it state and re-fetch it from the former. This is a useful auto-healing tool for situations where the partitioning doesn't happen because of an network error but because of e.g. a long GC on an isolated node.
 96 | 
 97 | ## To be further considered
 98 | The partition handling strategy is intended for the whole cache manager. Wouldn't it make more sense to have a per cache strategy? E.g. a certain cache might not even be affected by a partition (e.g. if asymmetric clusters are used).
 99 | 
100 | #Related
101 | * JIRA: [ISPN-263] (https://issues.jboss.org/browse/ISPN-263)
102 | * An interesting [mail discussion](http://infinispan.markmail.org/search/#query:+page:3+mid:qanjofgmpyvdjnmg+state:results) around the subject


--------------------------------------------------------------------------------
/Multi-tenancy-for-Hotrod-Server.asciidoc:
--------------------------------------------------------------------------------
  1 | Context
  2 | ~~~~~~~
  3 | 
  4 | In cloud enabled environments it would be beneficial to run Infinispan as a Service. Unfortunately current implement lacks multi tenancy and requires spinning separate Hot Rod server per tenant. 
  5 | 
  6 | This design doc addresses those concerns and the implementation should allow running Infinispan in the Cloud environment and serving data for multiple tenants.
  7 | 
  8 | Current implementation
  9 | ~~~~~~~~~~~~~~~~~~~~~~
 10 | 
 11 | Currently in Infinispan 8/9 we have some sort of multi tenancy implementation - multiple `cache-containers`s in https://github.com/infinispan/infinispan/blob/278597ce4864e9e857ef5ab2650af5c08badae9d/server/integration/infinispan/src/main/resources/schema/jboss-infinispan-core_8_2.xsd#L39-L39[Infinispan Subsystem]. The problematic part is Infinispan Endpoint, which expects https://github.com/infinispan/infinispan/blob/614e35f3927f2c73b4d24703ef1d9ba0dd40fb39/server/integration/endpoint/src/main/resources/schema/jboss-infinispan-endpoint_8_0.xsd#L26-L26[a single cache-container per server].
 12 | 
 13 | In other words - it is possible to have a multi tenant Hot Rod server now, but each `cache-container` would have to be served on separate port.
 14 | 
 15 | Core Infinispan doesn't support multi tenancy (because of symmetry, it parses configuration file which looks like server side xml and it really can't handle multiple `cache-container` elements).
 16 | 
 17 | Finally the Hot Rod protocol also assumes that it connects to a single `cache-container`.
 18 | 
 19 | Scope of changes
 20 | ~~~~~~~~~~~~~~~~
 21 | 
 22 | Our standard use cases don't need to be extended with multi tenancy (library mode). It is only a matter of implementing this functionality on the server side. Having said that, the changes should be limited to:
 23 | 
 24 | * Hot Rod Server - it needs to serve data for multiple tenants on a single endpoint
 25 | * Hot Rod Clients - they need to be able to connect to different tenant
 26 | * Rest and Hot Rod protocol. Memcached server and Web Socket servers are out of the scope.
 27 | 
 28 | The implementation idea - Multi Tenant Router
 29 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 30 | 
 31 | The main idea for the implementation is to create another layer on the top of Protocol Servers (REST, Hot Rod), which is responsible for routing requests to proper server. The router will be attached to endpoint (instead of a Protocol Server instance) and will accept incoming requests and forward them to the proper server.
 32 | 
 33 | Implementing the Router as a new service attached to the endpoint has the advantage of backwards compatibility in terms of configuration. In other words, the administrators will still be able to attach Protocol Server to the endpoint as they did prior to Infinispan 9 but additionally they could use the router.
 34 | 
 35 | Configuration
 36 | ~~~~~~~~~~~~~
 37 | 
 38 | Server side configuration will look like this:
 39 | 
 40 | ```
 41 | <subsystem xmlns="urn:infinispan:server:endpoint:9.0">
 42 |    <!-- note, no socket binding! -->
 43 |    <hotrod-connector name="hotrod1" cache-container="default">
 44 |       ...
 45 |    </hotrod-connector>
 46 |    <hotrod-connector name="hotrod2" cache-container="default">
 47 |       ...
 48 |    </hotrod-connector>
 49 |    <!-- additional name attribute and also, no binding -->
 50 |    <rest-connector name="rest1" cache-container="default" />
 51 |    <rest-connector name="rest2" cache-container="default" />
 52 |    <rest-connector cache-container="default" />
 53 |    <router-connector socket-binding="router">
 54 |       <servers>
 55 |          <hotrod name="hotrod1" />
 56 |          <hotrod name="hotrod1" />
 57 |          <rest name="rest1" />
 58 |          <rest name="rest2" />
 59 |       <servers>
 60 |       <!-- we need to support SNI here, explained why below -->
 61 |       <encryption security-realm="other">
 62 |          <sni host-name="sni" security-realm="other" />
 63 |          <sni host-name="sni2" security-realm="other2" />
 64 |       </encryption>
 65 |    </router>
 66 | </subsystem>
 67 | ```
 68 | 
 69 | The Hot Rod client will use TLS/SSL SNI mechanism to inform the server about chosen tenant. Here is an example:
 70 | 
 71 | ```
 72 |  builder
 73 |    .security()
 74 |       .ssl()
 75 |       .sslContext(cont)
 76 |       .sniHostName("sni")
 77 |       .enable();
 78 | ```
 79 | 
 80 | Implementation details
 81 | ~~~~~~~~~~~~~~~~~~~~~~
 82 | 
 83 | The router will be bootstrapped the same way as all other Protocol Servers - using Netty. It will use the following pipeline:
 84 | 
 85 | ```
 86 | +--------------------+
 87 | |      Request       |
 88 | +--------------------+
 89 |           |
 90 |           V
 91 | +--------------------+
 92 | |        SNI         |
 93 | +--------------------+
 94 |           | From this handler down we can read the content.
 95 |           V    Up to this point it can be encrypted.
 96 | +--------------------+
 97 | |       Router       | 
 98 | +--------------------+
 99 |           | After the router decided where to send incoming message
100 |           V     it attachs Protocol Server's handlers below.
101 | +--------------------+
102 | |   Chosen server    |
103 | |   handler stack    |
104 | +--------------------+
105 |           |
106 |           V
107 | +--------------------+
108 | |     Exception      |
109 | +--------------------+
110 | ```
111 | 
112 | For HotRod, the router will determine target tenant based on SNI host name. Use cases without SSL are currently out of scope and if implemented in the future - will require additional operation in the Hot Rod Protocol.
113 | 
114 | The router implementation for REST protocol will use Path prefixes. There is also possibility to use `host` header but currently it's out of the scope.
115 | 
116 | Memcached and WebSocket are out of the scope.
117 | 
118 | Note that existing Protocol Servers would need to be adjusted to support the router - they should not start the Netty server if there is no host/port binding. They should start Cache Managers instead.
119 | 
120 | Note that this effectively means that each type of Protocol Server will spin up its own Router.
121 | 
122 | Alternative approach
123 | ~~~~~~~~~~~~~~~~~~~~
124 | 
125 | Embed routing logic into the Hot Rod, REST and Memcached (to be considered if it's worth it) servers.
126 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
127 | 
128 | This approach would require modifying `ProtocolServer` and its configuration (`ProtocolServerConfiguration`) and putting multiple `EmbeddedCacheManager`s into it. We would also need to put additional `RoutingHandler` into the Netty configuration which would select proper `CacheManager` for the client. Each Multi Tenant `CacheManager` would possibly share the same transport (this is the easiest way for implementing it). 
129 | 
130 | What do we gain by this?
131 | 
132 | * We won't introduce another component on server side
133 | * The implementation will be simpler
134 | 
135 | What are the pain points?
136 | 
137 | * We will have `if(isMultiTenant())` statements all over the place
138 | * Since we host multiple `CacheContainer`s inside a single `ProtocolServer`, using different configuration (e.g. container 1 uses SSL whereas container 2 does not) might be problematic.
139 | 
140 | Questions to confirm
141 | ~~~~~~~~~~~~~~~~~~~~
142 | 
143 | * Perhaps we can enforce all our clients to use SSL+SNI? If so - the routing protocol could be removed.
144 | ** Update: Yes, we would like to enforce all clients to use SSL+SNI for now (possibly we adjust this in the future)
145 | * How to dynamically add a new Protocol Server to existing configuration? Does current implementation support this?
146 | ** Yes, DMR supports it
147 | * Should we allow switching tenant once the client was started? How does this should work together with Near Caching (I'm assuming we should flush everything)?
148 | ** Currently out of the scope
149 | * How to protect against scanning for non-protected tenants in Cloud Environment? This could be potentially used as an attack vector.
150 | ** Since we use HTTPS, we should be good here
151 | 
152 | 
153 | 


--------------------------------------------------------------------------------
/Infinispan-query-language-syntax-and-considerations.asciidoc:
--------------------------------------------------------------------------------
  1 | Authors: Adrian Nistor, Emmanuel Bernard
  2 | 
  3 | == Syntax
  4 | 
  5 | The basic syntax is a cleanup JP-QL + the embedding of the Lucene query syntax.
  6 | 
  7 | TODO: describe the syntax
  8 | 
  9 | The following sections enhance the syntax with more advanced full-text query support.
 10 | 
 11 | === Sort
 12 | 
 13 | Full-text search allow sorting by score, by order in the index and by distance.
 14 | 
 15 | [SOURCE]
 16 | ----
 17 | select u from User u where u.firstname : "Emmanuel"
 18 |          order by u.lastname, score(), index()
 19 | ----
 20 | 
 21 | We use a function approach to differentiate properties from the full-text search operators.
 22 | 
 23 | One can also order by distance from a given point.
 24 | 
 25 | [SOURCE]
 26 | ----
 27 | # when a single spatial predicate is present
 28 | select u from User u
 29 |          where u within (5 km) of (:latitude, :longitude)
 30 |          order by distance()
 31 | 
 32 | # when no or several spatial predicates are present, we need to express the distance and point to distance from
 33 | # option 1
 34 | select u from User u
 35 |          order by distance(u, :latitude, :longitude)
 36 | # option 2
 37 | select u from User u
 38 |          order by distance(u) from(:latitude, :longitude) desc
 39 | ----
 40 | 
 41 | === Spatial
 42 | 
 43 | [SOURCE]
 44 | ----
 45 | # here we use the default @Spatial index (Hibernate Search style
 46 | # the latitude/longitude fields an driven by the annotation metadata
 47 | select u from User u
 48 |          where u within (5 km) of (:latitude, :longitude)
 49 | 
 50 | # here we use an explicit spatial name (either the coordinates property
 51 | # or a synthetic property representing the tuple latitude, longitude
 52 | select u from User u
 53 |          where u.location within (5 km) of (:latitude, :longitude)
 54 | 
 55 | # here we use the actual latitude and longitude properties explicitly
 56 | # this is not the Hibernate Search style, so one would need to find the
 57 | # spatial index from these properties.
 58 | # or a synthetic property representing the tuple latitude, longitude
 59 | select u from User u
 60 |          where (u.address,latitude, u.address.longitude) within (5 km) of (:latitude, :longitude)
 61 | 
 62 | ----
 63 | 
 64 | The third option does not have my preference but felt more natural initially.
 65 | Both variation 1 and 2 are the one targeted.
 66 | 
 67 | [NOTE]
 68 | ====
 69 | I'm not happy about the confusion of latitude vs longitude (ordering).
 70 | Which one comes first, could we have a syntax making this explicit.
 71 | Apparently, latitude, then longitude is the standard.
 72 | Maybe that's goo enough?
 73 | ====
 74 | 
 75 | === All except
 76 | 
 77 | This is a way to negate a (list of) full-text predicate meaning get all entries except the ones matching the sub predicates.
 78 | 
 79 | [source]
 80 | ----
 81 | # Preferred as it's more natural in the language
 82 | select t from Transaction t
 83 |          where not ( t.status : +"failed" or t.status : +"blocked" )
 84 | 
 85 | # Alternative closer to the Hibernate Search Query DSL
 86 | select t from Transaction t
 87 |          where all except ( t.status : +"failed" or t.status : +"blocked" )
 88 | ----
 89 | 
 90 | === Boosting and constant score
 91 | 
 92 | One can boost per term or per field.
 93 | One can also force a score to be ignored or made constant for a subtree of the full-text queries.
 94 | 
 95 | [source]
 96 | ----
 97 | # term boosting
 98 | # TODO check term boosting with Gustavo, does Hibernate Search supports term boosting
 99 | select u from User u
100 |          where u.firstname : ("Adrian"^3 "Emmanuel")
101 | 
102 | # field boosting option 1 (DO NOT IMPLEMENT)
103 | select u from User u
104 |          where u.firstname^3: ("Adrian" "Emmanuel")
105 | # field boosting option 2
106 | select u from User u
107 |          where u.firstname: ("Adrian" "Emmanuel")^3
108 | 
109 | # sub query boosting based on option2
110 | select u from User u
111 |          where (u.firstname : "Adrian")^3 OR (u.firstname : "Emmanuel")
112 | ----
113 | 
114 | Now let's tackle constant score
115 | 
116 | 
117 | [source]
118 | ----
119 | select u from User u
120 |          where ( (u.firstname : "Adrian")^3 OR (u.firstname : "Emmanuel") )
121 |                and (u.lastname : ("Nistor" "Bernard")) )^[constant=10]
122 | ----
123 | 
124 | === Analyzer
125 | 
126 | It is sometimes necessary to force a different analyzer between query time and index time.
127 | 
128 | 
129 | [source]
130 | ----
131 | select u from User u
132 |          where u.firstname : "Emmanuel" with analyzer "ngram"
133 |                and u.lastname_3gram : ("ber" "rna" "nar" "ard")^6 with no analyzer
134 | ----
135 | 
136 | this syntax `with (no) analyzer` can only be present after a predicate and not on a composed query.
137 | 
138 | === More like this
139 | 
140 | This compares an entity and expect to find similar entities.
141 | 
142 | We will pass the entity or its key as parameter.
143 | This means Hot Rod clients need to extract the entity type to know the protobuf schema to then properly serialize the parameter payload.
144 | 
145 | [NOTE]
146 | .TODO: General question on parameters
147 | ====
148 | Adrian, do you plan on *guessing* the parameter type from the protobuf of the targeted entity?
149 | If not, how do you plan on serializing parameter values with the query?
150 | ====
151 | 
152 | [source]
153 | ----
154 | # Compare to a user instance (not necessarily persisted)
155 | select u from User u
156 |          where u like :user
157 |          comparing (u.lastname^3, u.firstname)
158 |          with options (favorSignificantTermsWithFactor=3, excludeEntityUsedForComparison=true)
159 | 
160 | # Compare to a user instance stored on a given key in the grid
161 | select u from User u
162 |          where u likeByKey :key
163 |          # we compare all fields since we did not give the compare keyword
164 |          with options (favorSignificantTermsWithFactor=3, excludeEntityUsedForComparison=true)
165 | ----
166 | 
167 | [NOTE]
168 | .Many options and workaround for it
169 | ====
170 | Some full-text options require a bunch of fine-tuning options which would be hard to embed int he syntax unless we offer a generic system.
171 | more like this offers a possible solution
172 | 
173 | [source]
174 | ----
175 | where u.property someMagicFullTextSearchOperator [some values or parameters] with options (option1=value1, option2=value2)
176 | ----
177 | 
178 | In this model, options are generic key/values deemed less important and not requiring a keyword.
179 | This could be useful for things like boolean query options like `minimumNumberShouldMatch`, dismax query etc.
180 | ====
181 | 
182 | === Explore DisMax
183 | 
184 | Dismax is like a boolean query except the score of matching document is computed differently, it takes the best of the score of the document from all of the subqueries.
185 | https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-dis-max-query.html
186 | https://lucidworks.com/blog/2010/05/23/whats-a-dismax/
187 | 
188 | Elasticsearch and Solr expose different approach to use DisMaxQuery and exposing it differently to the user.
189 | 
190 | [source]
191 | ----
192 | select u from User u
193 |          where u.fistname: "
194 | ----
195 | 
196 | === Meta thinking
197 | 
198 | Express all full-text things as function calls that can be nested:
199 | 
200 | * AND(query1, queryn), OR
201 | * within(,,,)
202 | * morelikethis
203 | * fuzzy(field, fuzzyFactor)
204 | * etc
205 | 
206 | And have a flat syntactic sugar to replace them (or a subset of their usage)
207 | 
208 | === Remaining syntax TODOs
209 | 
210 | * discuss generic options (see above)
211 | * how to do the additional fuzzy options (since fuzzy is not a keyword but is embedded in the Lucene syntax
212 | 
213 | == Query features around the syntax
214 | 
215 | Both are operations atop a query that return additional and specific informations
216 | 
217 | * explain
218 | * faceting
219 | 
220 | 
221 | == Security
222 | 
223 | At the implementation detail, we need to ensure we are not susceptible to DoS attack from a rogue client query.
224 | Possible ideas:
225 | 
226 | * stop if the query payload looks too big and could lead to a huge memory consumption upon parsing
227 | * TODO: what else


--------------------------------------------------------------------------------