├── .gitignore ├── .clang-format ├── resmoke.sh ├── .arcconfig ├── resmokelist ├── src ├── KNOWN_ISSUES.md ├── rocks_record_store_mock.cpp ├── rocks_options_init.cpp ├── rocks_prepare_conflict.cpp ├── rocks_server_status.h ├── rocks_counter_manager.h ├── rocks_global_options.h ├── totdb │ ├── totransaction.h │ ├── totransaction_impl.h │ ├── totransaction_prepare_iterator.h │ ├── totransaction_db.h │ └── totransaction_db_impl.h ├── rocks_global_options.cpp ├── mongo_rate_limiter_checker.h ├── rocks_util.h ├── rocks_begin_transaction_block.h ├── rocks_util.cpp ├── rocks_durability_manager.cpp ├── rocks_index_test.cpp ├── rocks_parameters.idl ├── rocks_begin_transaction_block.cpp ├── rocks_snapshot_manager.h ├── rocks_durability_manager.h ├── rocks_counter_manager.cpp ├── rocks_snapshot_manager.cpp ├── rocks_oplog_manager.h ├── rocks_compaction_scheduler.h ├── rocks_record_store_test.cpp ├── rocks_prepare_conflict.h ├── rocks_index.h ├── rocks_init.cpp ├── rocks_global_options.idl ├── mongo_rate_limiter_checker.cpp ├── rocks_record_store_mongod.cpp ├── rocks_parameters.cpp └── rocks_oplog_manager.cpp ├── README.md ├── BUILD.md ├── CONTRIBUTING.md └── SConscript /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.sw* 3 | -------------------------------------------------------------------------------- /.clang-format: -------------------------------------------------------------------------------- 1 | BasedOnStyle: Google 2 | AccessModifierOffset: -4 3 | ColumnLimit: 100 4 | IndentWidth: 4 5 | BreakBeforeBraces: Attach 6 | NamespaceIndentation: All 7 | -------------------------------------------------------------------------------- /resmoke.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ulimit -n 100000 4 | 5 | for testcase in `cat resmokelist` 6 | do 7 | echo "run suite: " $testcase 8 | python buildscripts/resmoke.py --storageEngine rocksdb --suite=$testcase --dbpathPrefix=/root/mongo/ci -j1 1>$testcase.log 2>&1 9 | done 10 | -------------------------------------------------------------------------------- /.arcconfig: -------------------------------------------------------------------------------- 1 | { 2 | "project_id" : "mongo-rocks", 3 | "conduit_uri" : "https://reviews.facebook.net/", 4 | "load" : [], 5 | "base" : "git:HEAD^, hg:.^", 6 | "git.default-relative-commit" : "HEAD^", 7 | "git:arc.feature.start.default" : "origin/master", 8 | "arc.feature.start.default" : "master", 9 | "history.immutable" : false 10 | } 11 | -------------------------------------------------------------------------------- /resmokelist: -------------------------------------------------------------------------------- 1 | write_concern_majority_passthrough 2 | change_streams_whole_cluster_mongos_passthrough 3 | change_streams_whole_db_mongos_passthrough 4 | sharding_auth_12 5 | core_minimum_batch_size 6 | causally_consistent_jscore_txns_passthrough 7 | concurrency_simultaneous 8 | change_streams_secondary_reads 9 | replica_sets_initsync_jscore_passthrough 10 | sharding_10 11 | sharding_14 12 | core_op_query 13 | session_jscore_passthrough 14 | secondary_reads_passthrough 15 | -------------------------------------------------------------------------------- /src/KNOWN_ISSUES.md: -------------------------------------------------------------------------------- 1 | MongoRocks r4.2.5 2 | 1) RocksDB layer bottommost compaction may be triggered frequently with no progress when enableMajorityReadConcern=true, TODO: add a issue somewhere 3 | 2) jstests/core/txns/commit_prepared_transaction_errors.js wont pass now because mongo-wt introduced the timestamped-safe unique index, which does dupkey check in wt-layer. mongoRocks does this in mongoRocks layer, which hangs into PrepareConflict error, while mongo-wt throws WriteConflict 4 | 3) src/mongo/db/storage/sorted_data_interface_test_dupkeycheck.cpp:TEST(SortedDataInterface, DupKeyCheckWithDuplicates) wont pass, because mongoRocks currently do not have timestamped-safe unique index 5 | 6 | MongoRocks r4.0.3 7 | 1) RocksDB layer bottommost compaction may be triggered frequently with no progress when enableMajorityReadConcern=true, TODO: add a issue somewhere 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## RocksDB Storage Engine Module for MongoDB 2 | 3 | ### Stable Versions/Branches 4 | + v3.2 5 | + v3.4 6 | + v4.0.3 7 | + v4.2.5 8 | 9 | ### How to build 10 | See BUILD.md 11 | 12 | ### More information 13 | To use this module, it has to be linked from `mongo/src/mongo/db/modules`. The build system will automatically recognize it. In the `mongo` repository directory do the following: 14 | 15 | mkdir -p src/mongo/db/modules/ 16 | ln -sf ~/mongo-rocks src/mongo/db/modules/rocks 17 | 18 | To build you will need to first install the RocksDB library, see `INSTALL.md` 19 | at https://github.com/facebook/rocksdb for more information. If you install 20 | in non-standard locations, you may need to set `CPPPATH` and `LIBPATH` 21 | environment variables: 22 | 23 | CPPPATH=/myrocksdb/include LIBPATH=/myrocksdb/lib scons 24 | 25 | ### Reach out 26 | If you have any issues with MongoRocks, leave an issue on github's issue board. 27 | -------------------------------------------------------------------------------- /BUILD.md: -------------------------------------------------------------------------------- 1 | Execute this series of commands to compile MongoDB with RocksDB storage engine: 2 | ``` 3 | install compression libraries (zlib, bzip2, snappy): 4 | sudo apt-get install zlib1g-dev; sudo apt-get install libbz2-dev; sudo apt-get install libsnappy-dev 5 | # get rocksdb 6 | git clone https://github.com/facebook/rocksdb.git 7 | git checkout main 8 | # compile rocksdb 9 | cd rocksdb; USE_RTTI=1 CFLAGS=-fPIC make static_lib; sudo INSTALL_PATH=/usr make install; cd .. 10 | # get mongo 11 | git clone https://github.com/mongodb/mongo.git 12 | git checkout tags/r4.2.5 -b branch_tags_4.2.5 13 | # get mongorocks 14 | git clone https://github.com/mongodb-partners/mongo-rocks 15 | git checkout master 16 | # add rocksdb module to mongo 17 | mkdir -p mongo/src/mongo/db/modules/ 18 | ln -sf ~/mongo-rocks mongo/src/mongo/db/modules/rocks 19 | # compile mongo 20 | cd mongo; scons 21 | ``` 22 | Start `mongod` using the `--storageEngine=rocksdb` option. 23 | 24 | 25 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | When contributing to this repository, please first discuss the change you wish to make via issue, 4 | email, or any other method with the owners of this repository before making a change. 5 | 6 | Please note we have a code of conduct, please follow it in all your interactions with the project. 7 | 8 | ## Pull Request Process 9 | 10 | 1. Ensure any install or build dependencies are removed before the end of the layer when doing a build. 11 | 2. Make sure you add unittests/cpptests for what you write, although not forced. Unfortunaly, it's impossible to add more resmoke tests/js tests because they are in mongodb's code base. 12 | 3. It is encouraged to make pull request to the **master** branch, which tracks the latest develop activities. Pull requests to other branchs are only limited to bug fixes. 13 | 14 | ## Code of Conduct 15 | 16 | ### Code Standards 17 | 18 | MongoRocks follows mongodb's code-style. Which is similiar with google-c++-code-style. It's highly suggested you **clang-format** your code before making the pull request. It's not forced but advised. 19 | 20 | 21 | -------------------------------------------------------------------------------- /src/rocks_record_store_mock.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #include "mongo/platform/basic.h" 31 | 32 | #include "mongo/base/init.h" 33 | #include "mongo/db/namespace_string.h" 34 | #include "mongo/db/operation_context_noop.h" 35 | #include "mongo/db/service_context.h" 36 | #include "mongo/stdx/memory.h" 37 | 38 | #include "rocks_engine.h" 39 | 40 | namespace mongo { 41 | 42 | // static 43 | bool RocksEngine::initRsOplogBackgroundThread(StringData ns) { 44 | return NamespaceString::oplog(ns); 45 | } 46 | 47 | } // namespace mongo 48 | -------------------------------------------------------------------------------- /src/rocks_options_init.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #include 30 | 31 | #include "mongo/util/exit_code.h" 32 | #include "mongo/util/options_parser/startup_option_init.h" 33 | #include "mongo/util/options_parser/startup_options.h" 34 | 35 | #include "rocks_global_options.h" 36 | 37 | namespace moe = mongo::optionenvironment; 38 | 39 | namespace mongo { 40 | 41 | MONGO_STARTUP_OPTIONS_STORE(RocksOptions)(InitializerContext* context) { 42 | Status ret = rocksGlobalOptions.store(moe::startupOptionsParsed); 43 | if (!ret.isOK()) { 44 | std::cerr << ret.toString() << std::endl; 45 | std::cerr << "try '" << context->args()[0] << " --help' for more information" 46 | << std::endl; 47 | ::_exit(EXIT_BADOPTIONS); 48 | } 49 | return Status::OK(); 50 | } 51 | } // namespace mongo 52 | -------------------------------------------------------------------------------- /src/rocks_prepare_conflict.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2018-present MongoDB, Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the Server Side Public License, version 1, 6 | * as published by MongoDB, Inc. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * Server Side Public License for more details. 12 | * 13 | * You should have received a copy of the Server Side Public License 14 | * along with this program. If not, see 15 | * . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the Server Side Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 31 | 32 | #include "mongo/platform/basic.h" 33 | #include "rocks_prepare_conflict.h" 34 | #include "mongo/util/fail_point_service.h" 35 | #include "mongo/util/log.h" 36 | 37 | namespace mongo { 38 | 39 | // When set, simulates WT_PREPARE_CONFLICT returned from WiredTiger API calls. 40 | MONGO_FAIL_POINT_DEFINE(RocksPrepareConflictForReads); 41 | 42 | MONGO_FAIL_POINT_DEFINE(RocksSkipPrepareConflictRetries); 43 | 44 | MONGO_FAIL_POINT_DEFINE(RocksPrintPrepareConflictLog); 45 | 46 | void rocksPrepareConflictLog(int attempts) { 47 | LOG(1) << "Caught ROCKS_PREPARE_CONFLICT, attempt " << attempts 48 | << ". Waiting for unit of work to commit or abort."; 49 | } 50 | 51 | void rocksPrepareConflictFailPointLog() { 52 | log() << "RocksPrintPrepareConflictLog fail point enabled."; 53 | } 54 | 55 | } // namespace mongo 56 | -------------------------------------------------------------------------------- /src/rocks_server_status.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #pragma once 30 | 31 | #include "mongo/db/commands/server_status.h" 32 | 33 | namespace mongo { 34 | 35 | class RocksEngine; 36 | 37 | /** 38 | * Adds "rocksdb" to the results of db.serverStatus(). 39 | */ 40 | class RocksServerStatusSection : public ServerStatusSection { 41 | public: 42 | RocksServerStatusSection(RocksEngine* engine); 43 | bool includeByDefault() const override; 44 | BSONObj generateSection(OperationContext* opCtx, 45 | const BSONElement& configElement) const override; 46 | 47 | protected: 48 | virtual void generatePropertiesSection(BSONObjBuilder* bob) const; 49 | virtual void generateThreadStatusSection(BSONObjBuilder* bob) const; 50 | virtual void generateCountersSection(BSONObjBuilder* bob) const; 51 | virtual void generateTxnStatsSection(BSONObjBuilder* bob) const; 52 | virtual void generateOplogDelStatsSection(BSONObjBuilder* bob) const; 53 | virtual void generateCompactSchedulerSection(BSONObjBuilder* bob) const; 54 | virtual void generateDefaultCFEntriesNumSection(BSONObjBuilder* bob) const; 55 | private: 56 | RocksEngine* _engine; 57 | }; 58 | 59 | } // namespace mongo 60 | -------------------------------------------------------------------------------- /src/rocks_counter_manager.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #pragma once 30 | 31 | #include 32 | #include 33 | #include 34 | #include 35 | #include 36 | #include 37 | 38 | #include 39 | #include 40 | #include "mongo/db/modules/rocks/src/totdb/totransaction.h" 41 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db.h" 42 | 43 | #include "mongo/base/string_data.h" 44 | #include "mongo/platform/mutex.h" 45 | 46 | namespace mongo { 47 | 48 | class RocksCounterManager { 49 | public: 50 | RocksCounterManager(rocksdb::TOTransactionDB* db, rocksdb::ColumnFamilyHandle* cf, bool crashSafe) 51 | : _db(db), _cf(cf), _crashSafe(crashSafe), _syncCounter(0) {} 52 | 53 | long long loadCounter(const std::string& counterKey); 54 | 55 | void updateCounter(const std::string& counterKey, long long count); 56 | 57 | void sync(); 58 | 59 | bool crashSafe() const { return _crashSafe; } 60 | 61 | private: 62 | static rocksdb::Slice _encodeCounter(long long counter, int64_t* storage); 63 | 64 | rocksdb::TOTransactionDB* _db; // not owned 65 | 66 | rocksdb::ColumnFamilyHandle* _cf; // not owned 67 | 68 | const bool _crashSafe; 69 | 70 | Mutex _lock = MONGO_MAKE_LATCH("RocksCounterManager::_lock"); 71 | 72 | // protected by _lock 73 | std::unordered_map _counters; 74 | // protected by _lock 75 | int _syncCounter; 76 | 77 | static const int kSyncEvery = 10000; 78 | }; 79 | } // namespace mongo 80 | -------------------------------------------------------------------------------- /src/rocks_global_options.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #pragma once 30 | 31 | #include "mongo/util/options_parser/startup_option_init.h" 32 | #include "mongo/util/options_parser/startup_options.h" 33 | 34 | namespace mongo { 35 | 36 | class RocksGlobalOptions { 37 | public: 38 | RocksGlobalOptions() 39 | : cacheSizeGB(0), 40 | maxWriteMBPerSec(1024), 41 | compression("snappy"), 42 | crashSafeCounters(false), 43 | counters(true), 44 | singleDeleteIndex(false), 45 | logLevel("info"), 46 | maxConflictCheckSizeMB(200) {} 47 | 48 | Status store(const optionenvironment::Environment& params); 49 | static Status validateRocksdbLogLevel(const std::string& value); 50 | static Status validateRocksdbCompressor(const std::string& value); 51 | size_t cacheSizeGB; 52 | int maxWriteMBPerSec; 53 | 54 | std::string compression; 55 | std::string configString; 56 | 57 | bool crashSafeCounters; 58 | bool counters; 59 | bool singleDeleteIndex; 60 | 61 | std::string logLevel; 62 | int maxConflictCheckSizeMB; 63 | int maxBackgroundJobs; 64 | long maxTotalWalSize; 65 | long dbWriteBufferSize; 66 | long writeBufferSize; 67 | long delayedWriteRate; 68 | int numLevels; 69 | int maxWriteBufferNumber; 70 | int level0FileNumCompactionTrigger; 71 | int level0SlowdownWritesTrigger; 72 | int level0StopWritesTrigger; 73 | long maxBytesForLevelBase; 74 | int softPendingCompactionMBLimit; 75 | int hardPendingCompactionMBLimit; 76 | }; 77 | 78 | extern RocksGlobalOptions rocksGlobalOptions; 79 | } // namespace mongo 80 | -------------------------------------------------------------------------------- /src/totdb/totransaction.h: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #ifndef ROCKSDB_LITE 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | 10 | #include "rocksdb/comparator.h" 11 | #include "rocksdb/db.h" 12 | #include "rocksdb/status.h" 13 | 14 | namespace rocksdb { 15 | 16 | class Iterator; 17 | class TransactionDB; 18 | class WriteBatchWithIndex; 19 | 20 | using TransactionName = std::string; 21 | 22 | using TransactionID = uint64_t; 23 | 24 | //TimeStamp in rocksdb 25 | using RocksTimeStamp = uint64_t; 26 | 27 | //TimeStamp Ordering Transaction 28 | class TOTransaction { 29 | public: 30 | virtual ~TOTransaction() {} 31 | 32 | // set prepare timestamp for transaction, if the application set the prepare 33 | // timestamp twice, an error will be returned 34 | virtual Status SetPrepareTimeStamp(const RocksTimeStamp& timestamp) = 0; 35 | 36 | virtual Status SetCommitTimeStamp(const RocksTimeStamp& timestamp) = 0; 37 | 38 | virtual Status SetDurableTimeStamp(const RocksTimeStamp& timestamp) = 0; 39 | 40 | // set read timestamp for transaction, if the application set the commit timestamp twice, an error will be returned 41 | virtual Status SetReadTimeStamp(const RocksTimeStamp& timestamp) = 0; 42 | 43 | virtual Status GetReadTimeStamp(RocksTimeStamp* timestamp) const = 0; 44 | 45 | virtual Status Prepare() = 0; 46 | 47 | virtual Status Commit(std::function* hook = nullptr) = 0; 48 | 49 | virtual Status Rollback() = 0; 50 | 51 | virtual Status Get(ReadOptions& options, 52 | ColumnFamilyHandle* column_family, const Slice& key, 53 | std::string* value) = 0; 54 | 55 | virtual Status Get(ReadOptions& options, const Slice& key, 56 | std::string* value) = 0; 57 | 58 | virtual Iterator* GetIterator(ReadOptions& read_options) = 0; 59 | 60 | virtual Iterator* GetIterator(ReadOptions& read_options, 61 | ColumnFamilyHandle* column_family) = 0; 62 | 63 | virtual Status Put(ColumnFamilyHandle* column_family, const Slice& key, 64 | const Slice& value) = 0; 65 | virtual Status Put(const Slice& key, const Slice& value) = 0; 66 | 67 | virtual Status Delete(ColumnFamilyHandle* column_family, const Slice& key) = 0; 68 | virtual Status Delete(const Slice& key) = 0; 69 | 70 | virtual Status GetForUpdate(ColumnFamilyHandle* column_family, const Slice& key) = 0; 71 | virtual Status GetForUpdate(const Slice& key) = 0; 72 | 73 | virtual WriteBatchWithIndex* GetWriteBatch() = 0; 74 | 75 | virtual Status SetName(const TransactionName& name) = 0; 76 | 77 | virtual TransactionName GetName() const { return name_; } 78 | 79 | virtual TransactionID GetID() const { return 0; } 80 | 81 | enum TOTransactionState { 82 | kStarted = 0, 83 | kPrepared = 1, 84 | kCommitted = 2, 85 | kRollback = 3, 86 | }; 87 | 88 | virtual TOTransactionState GetState() const = 0; 89 | 90 | static void enableTimestamp(const std::string& prefix); 91 | static bool isEnableTimestamp(const Slice& key); 92 | static std::set timestampPrefixes; 93 | static std::mutex prefixes_mutex; 94 | 95 | protected: 96 | explicit TOTransaction(const DB* /*db*/) {} 97 | TOTransaction() {} 98 | 99 | TransactionName name_; 100 | }; 101 | 102 | } // namespace rocksdb 103 | 104 | #endif 105 | 106 | -------------------------------------------------------------------------------- /src/rocks_global_options.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 30 | 31 | #include "mongo/platform/basic.h" 32 | 33 | #include "mongo/base/status.h" 34 | #include "mongo/util/log.h" 35 | #include "mongo/util/options_parser/constraints.h" 36 | 37 | #include "rocks_global_options.h" 38 | 39 | namespace moe = mongo::optionenvironment; 40 | 41 | namespace mongo { 42 | 43 | RocksGlobalOptions rocksGlobalOptions; 44 | 45 | Status RocksGlobalOptions::store(const optionenvironment::Environment& params) { 46 | return Status::OK(); 47 | } 48 | 49 | Status RocksGlobalOptions::validateRocksdbLogLevel(const std::string& value) { 50 | constexpr auto kDebug = "debug"_sd; 51 | constexpr auto kInfo = "info"_sd; 52 | constexpr auto kWarn = "warn"_sd; 53 | constexpr auto kError = "error"_sd; 54 | 55 | if (!kDebug.equalCaseInsensitive(value) && !kInfo.equalCaseInsensitive(value) && 56 | !kWarn.equalCaseInsensitive(value) && !kError.equalCaseInsensitive(value)) { 57 | return {ErrorCodes::BadValue, 58 | "Compression option must be one of: 'debug', 'info', 'warn', or 'error'"}; 59 | } 60 | 61 | return Status::OK(); 62 | } 63 | 64 | Status RocksGlobalOptions::validateRocksdbCompressor(const std::string& value) { 65 | constexpr auto kNone = "none"_sd; 66 | constexpr auto kSnappy = "snappy"_sd; 67 | constexpr auto kZlib = "zlib"_sd; 68 | constexpr auto kZstd = "zstd"_sd; 69 | 70 | if (!kNone.equalCaseInsensitive(value) && !kSnappy.equalCaseInsensitive(value) && 71 | !kZlib.equalCaseInsensitive(value) && !kZstd.equalCaseInsensitive(value)) { 72 | return {ErrorCodes::BadValue, 73 | "Compression option must be one of: 'none', 'snappy', 'zlib', or 'zstd'"}; 74 | } 75 | 76 | return Status::OK(); 77 | } 78 | } // namespace mongo 79 | -------------------------------------------------------------------------------- /src/mongo_rate_limiter_checker.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #pragma once 31 | 32 | #ifdef __linux__ 33 | #include 34 | #include "mongo/bson/bsonobjbuilder.h" 35 | 36 | namespace mongo { 37 | 38 | const uint64_t kMinMongoRateLimitRequestTokens = 100; 39 | const uint64_t kInitMongoRateLimitRequestTokens = 1000000; 40 | 41 | class MongoRateLimiter { 42 | public: 43 | MongoRateLimiter(rocksdb::RateLimiter* rateLimiter) 44 | : _rateLimiter(rateLimiter), _requestTokens(0) {} 45 | virtual ~MongoRateLimiter() {} 46 | 47 | virtual void resetRequestTokens() { 48 | _requestTokens.store(0, std::memory_order_relaxed); 49 | } 50 | virtual int64_t getRequestTokens() { 51 | return _requestTokens.load(std::memory_order_relaxed); 52 | } 53 | virtual void resetTokensPerSecond(int64_t tokens_per_second) { 54 | _rateLimiter->SetBytesPerSecond(tokens_per_second); 55 | } 56 | virtual int64_t getTokensPerSecond() { 57 | return _rateLimiter->GetBytesPerSecond(); 58 | } 59 | virtual void request(const int64_t bytes) { 60 | auto requestTokens = _requestTokens.load(std::memory_order_relaxed); 61 | _requestTokens.store(requestTokens + bytes, std::memory_order_relaxed); 62 | _rateLimiter->Request(bytes, rocksdb::Env::IOPriority::IO_HIGH); 63 | } 64 | 65 | private: 66 | std::unique_ptr _rateLimiter; 67 | std::atomic _requestTokens; 68 | }; 69 | 70 | struct DiskStats { 71 | DiskStats() : micros(0), reads(0), writes(0), read_sectors(0), write_sectors(0) {} 72 | DiskStats(const BSONObj& diskStatsObj) { 73 | micros = curTimeMicros64(); 74 | reads = static_cast(diskStatsObj.getField("reads").safeNumberLong()); 75 | writes = static_cast(diskStatsObj.getField("writes").safeNumberLong()); 76 | read_sectors = 77 | static_cast(diskStatsObj.getField("read_sectors").safeNumberLong()); 78 | write_sectors = 79 | static_cast(diskStatsObj.getField("write_sectors").safeNumberLong()); 80 | } 81 | uint64_t micros; 82 | uint64_t reads; 83 | uint64_t writes; 84 | uint64_t read_sectors; 85 | uint64_t write_sectors; 86 | }; 87 | 88 | MongoRateLimiter* getMongoRateLimiter(); 89 | 90 | void startMongoRateLimiterChecker(); 91 | } 92 | #endif 93 | -------------------------------------------------------------------------------- /src/rocks_util.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #pragma once 30 | 31 | #include 32 | #include 33 | #include 34 | #include "mongo/util/assert_util.h" 35 | 36 | namespace mongo { 37 | class MongoRocksLogger : public rocksdb::Logger { 38 | public: 39 | MongoRocksLogger() : rocksdb::Logger(rocksdb::InfoLogLevel::INFO_LEVEL) {} 40 | 41 | // Write an entry to the log file with the specified format. 42 | virtual void Logv(const char* format, va_list ap) override; 43 | using rocksdb::Logger::Logv; 44 | }; 45 | 46 | inline std::string rocksGetNextPrefix(const rocksdb::Slice& prefix) { 47 | // next prefix lexicographically, assume same length 48 | std::string nextPrefix(prefix.data(), prefix.size()); 49 | for (int i = static_cast(nextPrefix.size()) - 1; i >= 0; --i) { 50 | nextPrefix[i]++; 51 | // if it's == 0, that means we've overflowed, so need to keep adding 52 | if (nextPrefix[i] != 0) { 53 | break; 54 | } 55 | } 56 | return nextPrefix; 57 | } 58 | 59 | std::string encodePrefix(uint32_t prefix); 60 | bool extractPrefix(const rocksdb::Slice& slice, uint32_t* prefix); 61 | int get_internal_delete_skipped_count(); 62 | 63 | Status rocksToMongoStatus_slow(const rocksdb::Status& status, const char* prefix); 64 | 65 | /** 66 | * converts rocksdb status to mongodb status 67 | */ 68 | inline Status rocksToMongoStatus(const rocksdb::Status& status, const char* prefix = NULL) { 69 | if (MONGO_likely(status.ok())) { 70 | return Status::OK(); 71 | } 72 | return rocksToMongoStatus_slow(status, prefix); 73 | } 74 | 75 | #define invariantRocksOK(expression) \ 76 | do { \ 77 | auto _invariantRocksOK_status = expression; \ 78 | if (MONGO_unlikely(!_invariantRocksOK_status.ok())) { \ 79 | invariantOKFailed(#expression, rocksToMongoStatus(_invariantRocksOK_status), __FILE__, \ 80 | __LINE__); \ 81 | } \ 82 | } while (false) 83 | 84 | #define checkRocks 85 | 86 | } // namespace mongo 87 | -------------------------------------------------------------------------------- /src/rocks_begin_transaction_block.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2018 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #pragma once 31 | 32 | #include "mongo/db/modules/rocks/src/totdb/totransaction.h" 33 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db.h" 34 | #include "mongo/base/status.h" 35 | #include "mongo/bson/timestamp.h" 36 | #include "mongo/db/storage/recovery_unit.h" 37 | 38 | namespace mongo { 39 | 40 | /** 41 | * When constructed, this object begins a Rocks transaction on the provided session. The 42 | * transaction will be rolled back if done() is not called before the object is destructed. 43 | */ 44 | class RocksBeginTxnBlock { 45 | public: 46 | // Whether or not to round up to the oldest timestamp when the read timestamp is behind it. 47 | enum class RoundUpReadTimestamp { 48 | kNoRound, // Do not round to the oldest timestamp. BadValue error may be returned. 49 | kRound // Round the read timestamp up to the oldest timestamp when it is behind. 50 | }; 51 | 52 | // Dictates whether to round up prepare and commit timestamp of a prepared transaction. 53 | // 'kNoRound' - Does not round up prepare and commit timestamp of a prepared transaction. 54 | // 'kRound' - The prepare timestamp will be rounded up to the oldest timestamp if found to 55 | // be earlier; and the commit timestamp will be rounded up to the prepare timestamp if 56 | // found to be earlier. 57 | enum class RoundUpPreparedTimestamps { kNoRound, kRound }; 58 | 59 | RocksBeginTxnBlock( 60 | rocksdb::TOTransactionDB* db, std::unique_ptr* txn, 61 | PrepareConflictBehavior prepareConflictBehavior, 62 | RoundUpPreparedTimestamps roundUpPreparedTimestamps, 63 | RoundUpReadTimestamp roundUpReadTimestamp = RoundUpReadTimestamp::kNoRound); 64 | 65 | ~RocksBeginTxnBlock(); 66 | 67 | /** 68 | * End the begin transaction block. Must be called to ensure the opened transaction 69 | * is not be rolled back. 70 | */ 71 | void done(); 72 | 73 | /** 74 | * Sets the read timestamp on the opened transaction. Cannot be called after a call to 75 | * done(). 76 | */ 77 | Status setReadSnapshot(Timestamp); 78 | 79 | /* Get the read timestamp on the opened transaction */ 80 | Timestamp getTimestamp() const; 81 | 82 | private: 83 | rocksdb::TOTransactionDB* _db; // not own 84 | rocksdb::TOTransaction* _transaction; // not own 85 | bool _rollback = false; // not own 86 | Timestamp _readTimestamp; // not own 87 | }; 88 | 89 | } // namespace mongo 90 | -------------------------------------------------------------------------------- /src/rocks_util.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 29 | 30 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 31 | 32 | #include "rocks_util.h" 33 | 34 | #include 35 | #include 36 | #include 37 | #include 38 | 39 | // Temporary fix for https://github.com/facebook/rocksdb/pull/2336#issuecomment-303226208 40 | #define ROCKSDB_SUPPORT_THREAD_LOCAL 41 | #include 42 | #include 43 | 44 | #include "mongo/db/concurrency/write_conflict_exception.h" 45 | #include "mongo/platform/endian.h" 46 | #include "mongo/util/log.h" 47 | 48 | namespace mongo { 49 | std::string encodePrefix(uint32_t prefix) { 50 | uint32_t bigEndianPrefix = endian::nativeToBig(prefix); 51 | return std::string(reinterpret_cast(&bigEndianPrefix), sizeof(uint32_t)); 52 | } 53 | 54 | // we encode prefixes in big endian because we want to quickly jump to the max prefix 55 | // (iter->SeekToLast()) 56 | bool extractPrefix(const rocksdb::Slice& slice, uint32_t* prefix) { 57 | if (slice.size() < sizeof(uint32_t)) { 58 | return false; 59 | } 60 | *prefix = endian::bigToNative(*reinterpret_cast(slice.data())); 61 | return true; 62 | } 63 | 64 | int get_internal_delete_skipped_count() { 65 | #if ROCKSDB_MAJOR > 5 || (ROCKSDB_MAJOR == 5 && ROCKSDB_MINOR >= 6) 66 | return rocksdb::get_perf_context()->internal_delete_skipped_count; 67 | #else 68 | return rocksdb::perf_context.internal_delete_skipped_count; 69 | #endif 70 | } 71 | 72 | Status rocksToMongoStatus_slow(const rocksdb::Status& status, const char* prefix) { 73 | if (status.ok()) { 74 | return Status::OK(); 75 | } 76 | 77 | if (status.IsBusy()) { 78 | throw WriteConflictException(); 79 | } 80 | 81 | if (status.IsCorruption() || status.IsInvalidArgument()) { 82 | return Status(ErrorCodes::BadValue, status.ToString()); 83 | } 84 | 85 | return Status(ErrorCodes::InternalError, status.ToString()); 86 | } 87 | 88 | void MongoRocksLogger::Logv(const char* format, va_list ap) { 89 | char buffer[8192]; 90 | int len = snprintf(buffer, sizeof(buffer), "[RocksDB]:"); 91 | if (0 > len) { 92 | mongo::log() << "MongoRocksLogger::Logv return NEGATIVE value."; 93 | return; 94 | } 95 | vsnprintf(buffer + len, sizeof(buffer) - len, format, ap); 96 | log() << buffer; 97 | } 98 | 99 | } // namespace mongo 100 | -------------------------------------------------------------------------------- /src/rocks_durability_manager.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #include "mongo/db/storage/journal_listener.h" 30 | 31 | #include 32 | 33 | #include "rocks_durability_manager.h" 34 | #include "rocks_util.h" 35 | 36 | namespace mongo { 37 | RocksDurabilityManager::RocksDurabilityManager(rocksdb::DB* db, bool durable, 38 | rocksdb::ColumnFamilyHandle* defaultCf, 39 | rocksdb::ColumnFamilyHandle* oplogCf) 40 | : _db(db), 41 | _durable(durable), 42 | _defaultCf(defaultCf), 43 | _oplogCf(oplogCf), 44 | _journalListener(&NoOpJournalListener::instance) {} 45 | 46 | void RocksDurabilityManager::setJournalListener(JournalListener* jl) { 47 | stdx::unique_lock lk(_journalListenerMutex); 48 | _journalListener = jl; 49 | } 50 | 51 | // TODO(cuixin): rtt should modify waitUntilDurable 52 | void RocksDurabilityManager::waitUntilDurable(bool forceFlush) { 53 | uint32_t start = _lastSyncTime.load(); 54 | // Do the remainder in a critical section that ensures only a single thread at a time 55 | // will attempt to synchronize. 56 | stdx::unique_lock lk(_lastSyncMutex); 57 | uint32_t current = _lastSyncTime.loadRelaxed(); // synchronized with writes through mutex 58 | if (current != start) { 59 | // Someone else synced already since we read lastSyncTime, so we're done! 60 | return; 61 | } 62 | _lastSyncTime.store(current + 1); 63 | 64 | stdx::unique_lock jlk(_journalListenerMutex); 65 | JournalListener::Token token = _journalListener->getToken(); 66 | if (!_durable || forceFlush) { 67 | invariantRocksOK(_db->Flush(rocksdb::FlushOptions(), {_defaultCf, _oplogCf})); 68 | } else { 69 | invariantRocksOK(_db->SyncWAL()); 70 | } 71 | _journalListener->onDurable(token); 72 | } 73 | 74 | void RocksDurabilityManager::waitUntilPreparedUnitOfWorkCommitsOrAborts( 75 | OperationContext* opCtx, std::uint64_t lastCount) { 76 | invariant(opCtx); 77 | stdx::unique_lock lk(_prepareCommittedOrAbortedMutex); 78 | if (lastCount == _prepareCommitOrAbortCounter.loadRelaxed()) { 79 | opCtx->waitForConditionOrInterrupt(_prepareCommittedOrAbortedCond, lk, [&] { 80 | return _prepareCommitOrAbortCounter.loadRelaxed() > lastCount; 81 | }); 82 | } 83 | } 84 | 85 | void RocksDurabilityManager::notifyPreparedUnitOfWorkHasCommittedOrAborted() { 86 | stdx::unique_lock lk(_prepareCommittedOrAbortedMutex); 87 | _prepareCommitOrAbortCounter.fetchAndAdd(1); 88 | _prepareCommittedOrAbortedCond.notify_all(); 89 | } 90 | } // namespace mongo 91 | -------------------------------------------------------------------------------- /src/rocks_index_test.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #include "mongo/platform/basic.h" 30 | 31 | #include 32 | #include 33 | 34 | #include 35 | #include 36 | #include 37 | #include 38 | 39 | #include "mongo/base/init.h" 40 | #include "mongo/db/concurrency/write_conflict_exception.h" 41 | #include "mongo/db/storage/sorted_data_interface_test_harness.h" 42 | #include "mongo/stdx/memory.h" 43 | #include "mongo/unittest/temp_dir.h" 44 | #include "mongo/unittest/unittest.h" 45 | 46 | #include "rocks_engine.h" 47 | #include "rocks_index.h" 48 | #include "rocks_recovery_unit.h" 49 | #include "rocks_snapshot_manager.h" 50 | 51 | namespace mongo { 52 | namespace { 53 | 54 | using std::string; 55 | 56 | class RocksIndexHarness final : public SortedDataInterfaceHarnessHelper { 57 | public: 58 | RocksIndexHarness() 59 | : _order(Ordering::make(BSONObj())), 60 | _dbpath("rocks_test"), 61 | _engine(_dbpath.path(), true /* durable */, 3 /* kRocksFormatVersion */, 62 | false /* readOnly */) {} 63 | 64 | virtual ~RocksIndexHarness() {} 65 | 66 | std::unique_ptr newSortedDataInterface(bool unique, bool partial) { 67 | BSONObjBuilder configBuilder; 68 | RocksIndexBase::generateConfig(&configBuilder, 3, 69 | IndexDescriptor::IndexVersion::kV2); 70 | if (unique) { 71 | return stdx::make_unique( 72 | _engine.getDB(), _engine.getDefaultCf_ForTest(), "prefix", "ident", _order, 73 | configBuilder.obj(), "test.rocks", "testIndex", BSONObj(), partial); 74 | } else { 75 | return stdx::make_unique( 76 | _engine.getDB(), _engine.getDefaultCf_ForTest(), "prefix", "ident", _order, 77 | configBuilder.obj()); 78 | } 79 | } 80 | 81 | std::unique_ptr newRecoveryUnit() final { 82 | return stdx::make_unique(true /* durale */, &_engine); 83 | } 84 | 85 | private: 86 | Ordering _order; 87 | unittest::TempDir _dbpath; 88 | RocksEngine _engine; 89 | }; 90 | 91 | std::unique_ptr makeHarnessHelper() { 92 | return stdx::make_unique(); 93 | } 94 | 95 | MONGO_INITIALIZER(RegisterHarnessFactory)(InitializerContext* const) { 96 | mongo::registerHarnessHelperFactory(makeHarnessHelper); 97 | return Status::OK(); 98 | } 99 | 100 | } // namespace 101 | } // namespace mongo 102 | -------------------------------------------------------------------------------- /src/rocks_parameters.idl: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2018-present MongoDB, Inc. 2 | # 3 | # This program is free software: you can redistribute it and/or modify 4 | # it under the terms of the Server Side Public License, version 1, 5 | # as published by MongoDB, Inc. 6 | # 7 | # This program is distributed in the hope that it will be useful, 8 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 9 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 10 | # Server Side Public License for more details. 11 | # 12 | # You should have received a copy of the Server Side Public License 13 | # along with this program. If not, see 14 | # . 15 | # 16 | # As a special exception, the copyright holders give permission to link the 17 | # code of portions of this program with the OpenSSL library under certain 18 | # conditions as described in each individual source file and distribute 19 | # linked combinations including the program with the OpenSSL library. You 20 | # must comply with the Server Side Public License in all respects for 21 | # all of the code used other than as permitted herein. If you modify file(s) 22 | # with this exception, you may extend this exception to your version of the 23 | # file(s), but you are not obligated to do so. If you do not wish to do so, 24 | # delete this exception statement from your version. If you delete this 25 | # exception statement from all source files in the program, then also delete 26 | # it in the license file. 27 | # 28 | 29 | global: 30 | cpp_namespace: "mongo" 31 | cpp_includes: 32 | - "mongo/db/modules/rocks/src/rocks_engine.h" 33 | - "mongo/util/concurrency/ticketholder.h" 34 | - "mongo/util/debug_util.h" 35 | 36 | server_parameters: 37 | rocksConcurrentWriteTransactions: 38 | description: "Rocks Concurrent Write Transactions" 39 | set_at: [ startup, runtime ] 40 | cpp_class: 41 | name: ROpenWriteTransactionParam 42 | data: 'TicketHolder*' 43 | override_ctor: true 44 | rocksConcurrentReadTransactions: 45 | description: "Rocks Concurrent Read Transactions" 46 | set_at: [ startup, runtime ] 47 | cpp_class: 48 | name: ROpenReadTransactionParam 49 | data: 'TicketHolder*' 50 | override_ctor: true 51 | 52 | rocksdbRuntimeConfigMaxWriteMBPerSec: 53 | description: 'rate limiter to MB/s' 54 | set_at: [ startup, runtime ] 55 | cpp_class: 56 | name: RocksRateLimiterServerParameter 57 | data: 'RocksEngine*' 58 | override_set: true 59 | condition: { expr: false } 60 | 61 | rocksdbBackup: 62 | description: 'rocksdb backup' 63 | set_at: runtime 64 | cpp_class: 65 | name: RocksBackupServerParameter 66 | data: 'RocksEngine*' 67 | override_set: true 68 | condition: { expr: false } 69 | 70 | rocksdbCompact: 71 | description: 'rocksdb compact' 72 | set_at: runtime 73 | cpp_class: 74 | name: RocksCompactServerParameter 75 | data: 'RocksEngine*' 76 | override_set: true 77 | condition: { expr: false } 78 | 79 | rocksdbRuntimeConfigCacheSizeGB: 80 | description: 'rocks cache sizeGB' 81 | set_at: startup 82 | cpp_class: 83 | name: RocksCacheSizeParameter 84 | data: 'RocksEngine*' 85 | override_set: true 86 | condition: { expr: false } 87 | 88 | rocksdbOptions: 89 | description: 'set rocksdb options' 90 | set_at: [ startup, runtime ] 91 | cpp_class: 92 | name: RocksOptionsParameter 93 | data: 'RocksEngine*' 94 | override_set: true 95 | condition: { expr: false } 96 | 97 | minSSTFileCountReserved: 98 | description: 'delete oplogs until minSSTFileCountReserved files exceeds the total max size' 99 | set_at: [ startup, runtime ] 100 | cpp_class: 101 | name: ExportedMinSSTFileCountReservedParameter 102 | data: 'AtomicWord*' 103 | override_ctor: true 104 | 105 | rocksdbRuntimeConfigMaxConflictCheckSize: 106 | description: 'rocksdb max conflict check size' 107 | set_at: startup 108 | cpp_class: 109 | name: RocksdbMaxConflictCheckSizeParameter 110 | data: 'RocksEngine*' 111 | override_set: true 112 | condition: { expr: false } 113 | -------------------------------------------------------------------------------- /src/rocks_begin_transaction_block.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2018 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 31 | 32 | #include "rocks_begin_transaction_block.h" 33 | #include 34 | #include "mongo/platform/basic.h" 35 | #include "mongo/util/log.h" 36 | #include "rocks_util.h" 37 | 38 | namespace mongo { 39 | RocksBeginTxnBlock::RocksBeginTxnBlock(rocksdb::TOTransactionDB* db, 40 | std::unique_ptr* txn, 41 | PrepareConflictBehavior prepareConflictBehavior, 42 | RoundUpPreparedTimestamps roundUpPreparedTimestamps, 43 | RoundUpReadTimestamp roundUpReadTimestamp) 44 | : _db(db) { 45 | invariant(!_rollback); 46 | rocksdb::WriteOptions wOpts; 47 | rocksdb::TOTransactionOptions txnOpts; 48 | 49 | if (prepareConflictBehavior == PrepareConflictBehavior::kIgnoreConflicts) { 50 | txnOpts.ignore_prepare = true; 51 | txnOpts.read_only = true; 52 | } else if (prepareConflictBehavior == 53 | PrepareConflictBehavior::kIgnoreConflictsAllowWrites) { 54 | txnOpts.ignore_prepare = true; 55 | } 56 | 57 | if (roundUpPreparedTimestamps == RoundUpPreparedTimestamps::kRound) { 58 | txnOpts.timestamp_round_prepared = true; 59 | } 60 | if (roundUpReadTimestamp == RoundUpReadTimestamp::kRound) { 61 | txnOpts.timestamp_round_read = true; 62 | } 63 | 64 | _transaction = _db->BeginTransaction(wOpts, txnOpts); 65 | invariant(_transaction); 66 | txn->reset(_transaction); 67 | _rollback = true; 68 | } 69 | 70 | RocksBeginTxnBlock::~RocksBeginTxnBlock() { 71 | if (_rollback) { 72 | invariant(_transaction->Rollback().ok()); 73 | } 74 | } 75 | 76 | Status RocksBeginTxnBlock::setReadSnapshot(Timestamp readTs) { 77 | invariant(_rollback); 78 | rocksdb::RocksTimeStamp ts(readTs.asULL()); 79 | auto status = _transaction->SetReadTimeStamp(ts); 80 | if (!status.ok()) { 81 | if (status.IsInvalidArgument()) { 82 | return Status(ErrorCodes::SnapshotTooOld, 83 | str::stream() << "Read timestamp " << ts 84 | << " is older than the oldest available timestamp."); 85 | } 86 | return rocksToMongoStatus(status); 87 | } 88 | 89 | status = _transaction->GetReadTimeStamp(&ts); 90 | invariant(status.ok(), status.ToString()); 91 | _readTimestamp = Timestamp(ts); 92 | return Status::OK(); 93 | } 94 | 95 | void RocksBeginTxnBlock::done() { 96 | invariant(_rollback); 97 | _rollback = false; 98 | } 99 | 100 | Timestamp RocksBeginTxnBlock::getTimestamp() const { 101 | invariant(!_readTimestamp.isNull()); 102 | return _readTimestamp; 103 | } 104 | 105 | } // namespace mongo 106 | -------------------------------------------------------------------------------- /src/rocks_snapshot_manager.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #include 30 | 31 | #include 32 | #include "mongo/db/modules/rocks/src/totdb/totransaction.h" 33 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db.h" 34 | 35 | #include "mongo/db/storage/recovery_unit.h" 36 | #include "mongo/db/storage/snapshot_manager.h" 37 | #include "mongo/platform/mutex.h" 38 | 39 | #include "rocks_begin_transaction_block.h" 40 | 41 | #pragma once 42 | 43 | namespace mongo { 44 | 45 | using RoundUpPreparedTimestamps = RocksBeginTxnBlock::RoundUpPreparedTimestamps; 46 | 47 | class RocksRecoveryUnit; 48 | 49 | class RocksSnapshotManager final : public SnapshotManager { 50 | RocksSnapshotManager(const RocksSnapshotManager&) = delete; 51 | RocksSnapshotManager& operator=(const RocksSnapshotManager&) = delete; 52 | 53 | public: 54 | RocksSnapshotManager() = default; 55 | virtual ~RocksSnapshotManager() {} 56 | void setCommittedSnapshot(const Timestamp& ts) final; 57 | void dropAllSnapshots() final; 58 | void setLocalSnapshot(const Timestamp& ts) final; 59 | boost::optional getLocalSnapshot() final; 60 | 61 | // 62 | // RocksDB-specific methods 63 | // 64 | 65 | /** 66 | * Starts a transaction and returns the SnapshotName used. 67 | * 68 | * Throws if there is currently no committed snapshot. 69 | */ 70 | Timestamp beginTransactionOnCommittedSnapshot( 71 | rocksdb::TOTransactionDB* db, std::unique_ptr* txn, 72 | PrepareConflictBehavior prepareConflictBehavior, 73 | RoundUpPreparedTimestamps roundUpPreparedTimestamps) const; 74 | 75 | /** 76 | * Starts a transaction on the last stable local timestamp, set by setLocalSnapshot. 77 | * 78 | * Throws if no local snapshot has been set. 79 | */ 80 | Timestamp beginTransactionOnLocalSnapshot( 81 | rocksdb::TOTransactionDB* db, std::unique_ptr* txn, 82 | PrepareConflictBehavior prepareConflictBehavior, 83 | RoundUpPreparedTimestamps roundUpPreparedTimestamps) const; 84 | 85 | /** 86 | * Returns lowest SnapshotName that could possibly be used by a future call to 87 | * beginTransactionOnCommittedSnapshot, or boost::none if there is currently no committed 88 | * snapshot. 89 | * 90 | * This should not be used for starting a transaction on this SnapshotName since the named 91 | * snapshot may be deleted by the time you start the transaction. 92 | */ 93 | boost::optional getMinSnapshotForNextCommittedRead() const; 94 | 95 | private: 96 | // Snapshot to use for reads at a commit timestamp. 97 | mutable Mutex _committedSnapshotMutex = // Guards _committedSnapshot. 98 | MONGO_MAKE_LATCH("RocksSnapshotManager::_committedSnapshotMutex"); 99 | boost::optional _committedSnapshot; 100 | 101 | // Snapshot to use for reads at a local stable timestamp. 102 | mutable Mutex _localSnapshotMutex = // Guards _localSnapshot. 103 | MONGO_MAKE_LATCH("RocksSnapshotManager::_localSnapshotMutex"); 104 | boost::optional _localSnapshot; 105 | }; 106 | } // namespace mongo 107 | -------------------------------------------------------------------------------- /src/rocks_durability_manager.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #pragma once 30 | 31 | #include "mongo/db/operation_context.h" 32 | #include "mongo/platform/basic.h" 33 | #include "mongo/stdx/condition_variable.h" 34 | 35 | namespace rocksdb { 36 | class DB; 37 | } 38 | 39 | namespace mongo { 40 | 41 | class JournalListener; 42 | 43 | class RocksDurabilityManager { 44 | RocksDurabilityManager(const RocksDurabilityManager&) = delete; 45 | RocksDurabilityManager& operator=(const RocksDurabilityManager&) = delete; 46 | 47 | public: 48 | RocksDurabilityManager(rocksdb::DB* db, bool durable, 49 | rocksdb::ColumnFamilyHandle* defaultCf, 50 | rocksdb::ColumnFamilyHandle* oplogCf); 51 | 52 | void setJournalListener(JournalListener* jl); 53 | 54 | void waitUntilDurable(bool forceFlush); 55 | 56 | /** 57 | * Waits until a prepared unit of work has ended (either been commited or aborted). This 58 | * should be used when encountering ROCKS_PREPARE_CONFLICT errors. The caller is required to 59 | * retry the conflicting WiredTiger API operation. A return from this function does not 60 | * guarantee that the conflicting transaction has ended, only that one prepared unit of work 61 | * in the process has signaled that it has ended. Accepts an OperationContext that will 62 | * throw an AssertionException when interrupted. 63 | * 64 | * This method is provided in RocksDurabilityManager and not RecoveryUnit because all 65 | * recovery units share the same RocksDurabilityManager, and we want a recovery unit on one 66 | * thread to signal all recovery units waiting for prepare conflicts across all other 67 | * threads. 68 | */ 69 | void waitUntilPreparedUnitOfWorkCommitsOrAborts(OperationContext* opCtx, 70 | uint64_t lastCount); 71 | 72 | /** 73 | * Notifies waiters that the caller's perpared unit of work has ended 74 | * (either committed or aborted). 75 | */ 76 | void notifyPreparedUnitOfWorkHasCommittedOrAborted(); 77 | 78 | std::uint64_t getPrepareCommitOrAbortCount() const { 79 | return _prepareCommitOrAbortCounter.loadRelaxed(); 80 | } 81 | 82 | private: 83 | rocksdb::DB* _db; // not owned 84 | 85 | bool _durable; 86 | rocksdb::ColumnFamilyHandle* _defaultCf; // not owned 87 | rocksdb::ColumnFamilyHandle* _oplogCf; // not owned 88 | // Notified when we commit to the journal. 89 | JournalListener* _journalListener; 90 | 91 | // Protects _journalListener. 92 | Mutex _journalListenerMutex = 93 | MONGO_MAKE_LATCH("RocksDurabilityManager::_journalListenerMutex"); 94 | AtomicWord _lastSyncTime; 95 | Mutex _lastSyncMutex = MONGO_MAKE_LATCH("RocksDurabilityManager::_lastSyncMutex"); 96 | 97 | // Mutex and cond var for waiting on prepare commit or abort. 98 | Mutex _prepareCommittedOrAbortedMutex = 99 | MONGO_MAKE_LATCH("RocksDurabilityManager::_prepareCommittedOrAbortedMutex"); 100 | 101 | stdx::condition_variable _prepareCommittedOrAbortedCond; 102 | 103 | AtomicWord _prepareCommitOrAbortCounter{0}; 104 | }; 105 | } // namespace mongo 106 | -------------------------------------------------------------------------------- /src/rocks_counter_manager.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 30 | 31 | #include "mongo/platform/basic.h" 32 | #include "mongo/platform/endian.h" 33 | 34 | #include "rocks_counter_manager.h" 35 | 36 | #include 37 | #include 38 | #include 39 | #include 40 | 41 | // for invariant() 42 | #include 43 | #include "mongo/platform/mutex.h" 44 | #include "mongo/util/assert_util.h" 45 | #include "mongo/util/log.h" 46 | 47 | #include "rocks_util.h" 48 | 49 | namespace mongo { 50 | 51 | long long RocksCounterManager::loadCounter(const std::string& counterKey) { 52 | { 53 | stdx::lock_guard lk(_lock); 54 | auto itr = _counters.find(counterKey); 55 | if (itr != _counters.end()) { 56 | return itr->second; 57 | } 58 | } 59 | std::string value; 60 | { 61 | auto txn = _db->makeTxn(); 62 | auto readopts = rocksdb::ReadOptions(); 63 | auto s = txn->Get(readopts, _cf, counterKey, &value); 64 | if (s.IsNotFound()) { 65 | return 0; 66 | } 67 | invariantRocksOK(s); 68 | } 69 | int64_t ret; 70 | invariant(sizeof(ret) == value.size()); 71 | memcpy(&ret, value.data(), sizeof(ret)); 72 | // we store counters in little endian 73 | return static_cast(endian::littleToNative(ret)); 74 | } 75 | 76 | void RocksCounterManager::updateCounter(const std::string& counterKey, long long count) { 77 | if (_crashSafe) { 78 | int64_t storage; 79 | auto txn = _db->makeTxn(); 80 | invariantRocksOK(txn->Put(_cf, counterKey, _encodeCounter(count, &storage))); 81 | invariantRocksOK(txn->Commit()); 82 | } else { 83 | stdx::lock_guard lk(_lock); 84 | _counters[counterKey] = count; 85 | ++_syncCounter; 86 | if (_syncCounter >= kSyncEvery) { 87 | // let's sync this now. piggyback on writeBatch 88 | int64_t storage; 89 | auto txn = _db->makeTxn(); 90 | for (const auto& counter : _counters) { 91 | invariantRocksOK( 92 | txn->Put(_cf, counter.first, _encodeCounter(counter.second, &storage))); 93 | } 94 | _counters.clear(); 95 | _syncCounter = 0; 96 | invariantRocksOK(txn->Commit()); 97 | } 98 | } 99 | } 100 | 101 | void RocksCounterManager::sync() { 102 | stdx::lock_guard lk(_lock); 103 | if (_counters.size() == 0) { 104 | return; 105 | } 106 | auto txn = _db->makeTxn(); 107 | int64_t storage; 108 | for (const auto& counter : _counters) { 109 | invariantRocksOK(txn->Put(_cf, counter.first, _encodeCounter(counter.second, &storage))); 110 | } 111 | _counters.clear(); 112 | _syncCounter = 0; 113 | invariantRocksOK(txn->Commit()); 114 | } 115 | 116 | rocksdb::Slice RocksCounterManager::_encodeCounter(long long counter, int64_t* storage) { 117 | *storage = static_cast(endian::littleToNative(counter)); 118 | return rocksdb::Slice(reinterpret_cast(storage), sizeof(*storage)); 119 | } 120 | 121 | } // namespace mongo 122 | -------------------------------------------------------------------------------- /src/rocks_snapshot_manager.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 30 | 31 | #include "mongo/platform/basic.h" 32 | 33 | #include "rocks_begin_transaction_block.h" 34 | #include "rocks_recovery_unit.h" 35 | #include "rocks_snapshot_manager.h" 36 | 37 | #include 38 | 39 | #include "mongo/base/checked_cast.h" 40 | #include "mongo/db/server_options.h" 41 | #include "mongo/util/log.h" 42 | 43 | namespace mongo { 44 | void RocksSnapshotManager::setCommittedSnapshot(const Timestamp& ts) { 45 | stdx::lock_guard lock(_committedSnapshotMutex); 46 | 47 | invariant(!_committedSnapshot || *_committedSnapshot <= ts); 48 | _committedSnapshot = ts; 49 | } 50 | 51 | void RocksSnapshotManager::setLocalSnapshot(const Timestamp& timestamp) { 52 | stdx::lock_guard lock(_localSnapshotMutex); 53 | if(timestamp.isNull()) { 54 | _localSnapshot = boost::none; 55 | } else { 56 | _localSnapshot = timestamp; 57 | } 58 | } 59 | 60 | boost::optional RocksSnapshotManager::getLocalSnapshot() { 61 | stdx::lock_guard lock(_localSnapshotMutex); 62 | return _localSnapshot; 63 | } 64 | 65 | void RocksSnapshotManager::dropAllSnapshots() { 66 | stdx::lock_guard lock(_committedSnapshotMutex); 67 | _committedSnapshot = boost::none; 68 | } 69 | 70 | boost::optional RocksSnapshotManager::getMinSnapshotForNextCommittedRead() const { 71 | if (!serverGlobalParams.enableMajorityReadConcern) { 72 | return boost::none; 73 | } 74 | 75 | stdx::lock_guard lock(_committedSnapshotMutex); 76 | return _committedSnapshot; 77 | } 78 | 79 | Timestamp RocksSnapshotManager::beginTransactionOnCommittedSnapshot( 80 | rocksdb::TOTransactionDB* db, std::unique_ptr* txn, 81 | PrepareConflictBehavior prepareConflictBehavior, 82 | RoundUpPreparedTimestamps roundUpPreparedTimestamps) const { 83 | RocksBeginTxnBlock txnOpen(db, txn, prepareConflictBehavior, roundUpPreparedTimestamps); 84 | stdx::lock_guard lock(_committedSnapshotMutex); 85 | uassert(ErrorCodes::ReadConcernMajorityNotAvailableYet, 86 | "Committed view disappeared while running operation", _committedSnapshot); 87 | 88 | auto status = txnOpen.setReadSnapshot(_committedSnapshot.get()); 89 | invariant(status.isOK(), status.reason()); 90 | 91 | txnOpen.done(); 92 | return *_committedSnapshot; 93 | } 94 | 95 | Timestamp RocksSnapshotManager::beginTransactionOnLocalSnapshot( 96 | rocksdb::TOTransactionDB* db, std::unique_ptr* txn, 97 | PrepareConflictBehavior prepareConflictBehavior, 98 | RoundUpPreparedTimestamps roundUpPreparedTimestamps) const { 99 | RocksBeginTxnBlock txnOpen(db, txn, prepareConflictBehavior, roundUpPreparedTimestamps); 100 | stdx::lock_guard lock(_localSnapshotMutex); 101 | invariant(_localSnapshot); 102 | LOG(3) << "begin_transaction on local snapshot " << _localSnapshot.get().toString(); 103 | auto status = txnOpen.setReadSnapshot(_localSnapshot.get()); 104 | invariant(status.isOK(), status.reason()); 105 | 106 | txnOpen.done(); 107 | return *_localSnapshot; 108 | } 109 | 110 | } // namespace mongo 111 | -------------------------------------------------------------------------------- /SConscript: -------------------------------------------------------------------------------- 1 | # -*- mode: python -*- 2 | Import("env") 3 | 4 | env = env.Clone() 5 | 6 | dynamic_syslibdeps = [] 7 | conf = Configure(env) 8 | 9 | if conf.CheckLibWithHeader("lz4", ["lz4.h","lz4hc.h"], "C", "LZ4_versionNumber();", autoadd=False ): 10 | dynamic_syslibdeps.append("lz4") 11 | 12 | 13 | env.InjectMongoIncludePaths() 14 | 15 | env.InjectThirdParty(libraries=['s2',]) # for Encoder and Decoder 16 | 17 | conf.Finish() 18 | 19 | env.Library( 20 | target= 'storage_rocks_base', 21 | source= [ 22 | 'src/rocks_compaction_scheduler.cpp', 23 | 'src/rocks_counter_manager.cpp', 24 | 'src/rocks_global_options.cpp', 25 | 'src/rocks_engine.cpp', 26 | 'src/rocks_record_store.cpp', 27 | 'src/rocks_recovery_unit.cpp', 28 | 'src/rocks_index.cpp', 29 | 'src/rocks_durability_manager.cpp', 30 | 'src/rocks_snapshot_manager.cpp', 31 | 'src/rocks_util.cpp', 32 | 'src/rocks_oplog_manager.cpp', 33 | 'src/rocks_begin_transaction_block.cpp', 34 | 'src/rocks_prepare_conflict.cpp', 35 | # TODO(wolfkdy): move totdb files into a seperate compile-unit 36 | 'src/totdb/totransaction_impl.cpp', 37 | 'src/totdb/totransaction_db_impl.cpp', 38 | 'src/totdb/totransaction_prepare_iterator.cpp', 39 | env.Idlc('src/rocks_parameters.idl')[0], 40 | env.Idlc('src/rocks_global_options.idl')[0], 41 | 'src/rocks_parameters.cpp', 42 | ], 43 | LIBDEPS= [ 44 | '$BUILD_DIR/mongo/base', 45 | '$BUILD_DIR/mongo/db/namespace_string', 46 | '$BUILD_DIR/mongo/db/commands/test_commands_enabled', 47 | '$BUILD_DIR/mongo/db/prepare_conflict_tracker', 48 | '$BUILD_DIR/mongo/db/catalog/collection_options', 49 | '$BUILD_DIR/mongo/db/concurrency/lock_manager', 50 | '$BUILD_DIR/mongo/db/concurrency/write_conflict_exception', 51 | '$BUILD_DIR/mongo/db/curop', 52 | '$BUILD_DIR/mongo/db/index/index_descriptor', 53 | '$BUILD_DIR/mongo/db/storage/bson_collection_catalog_entry', 54 | '$BUILD_DIR/mongo/db/storage/index_entry_comparison', 55 | '$BUILD_DIR/mongo/db/storage/journal_listener', 56 | '$BUILD_DIR/mongo/db/storage/key_string', 57 | '$BUILD_DIR/mongo/db/storage/oplog_hack', 58 | '$BUILD_DIR/mongo/db/storage/kv/kv_prefix', 59 | '$BUILD_DIR/mongo/util/background_job', 60 | '$BUILD_DIR/mongo/util/concurrency/ticketholder', 61 | '$BUILD_DIR/mongo/util/processinfo', 62 | '$BUILD_DIR/third_party/shim_snappy', 63 | '$BUILD_DIR/third_party/s2/util/coding/coding', 64 | ], 65 | LIBDEPS_PRIVATE= [ 66 | '$BUILD_DIR/mongo/db/snapshot_window_options', 67 | ], 68 | SYSLIBDEPS=["rocksdb", 69 | "z", 70 | "bz2"] #z and bz2 are dependencies for rocks 71 | + dynamic_syslibdeps 72 | ) 73 | 74 | env.Library( 75 | target= 'storage_rocks', 76 | source= [ 77 | 'src/rocks_init.cpp', 78 | 'src/rocks_options_init.cpp', 79 | 'src/rocks_record_store_mongod.cpp', 80 | 'src/rocks_server_status.cpp', 81 | ], 82 | LIBDEPS= [ 83 | 'storage_rocks_base', 84 | ], 85 | PROGDEPS_DEPENDENTS=['$BUILD_DIR/mongo/mongod'] 86 | ) 87 | 88 | env.Library( 89 | target= 'storage_rocks_mock', 90 | source= [ 91 | 'src/rocks_record_store_mock.cpp', 92 | ], 93 | LIBDEPS= [ 94 | 'storage_rocks_base', 95 | ] 96 | ) 97 | 98 | 99 | env.CppUnitTest( 100 | target='storage_rocks_index_test', 101 | source=[ 102 | 'src/rocks_index_test.cpp' 103 | ], 104 | LIBDEPS=[ 105 | 'storage_rocks_mock', 106 | '$BUILD_DIR/mongo/db/storage/sorted_data_interface_test_harness' 107 | ] 108 | ) 109 | 110 | 111 | env.CppUnitTest( 112 | target='storage_rocks_record_store_test', 113 | source=[ 114 | 'src/rocks_record_store_test.cpp' 115 | ], 116 | LIBDEPS=[ 117 | '$BUILD_DIR/mongo/db/auth/authmocks', 118 | '$BUILD_DIR/mongo/db/storage/record_store_test_harness', 119 | '$BUILD_DIR/mongo/db/repl/repl_coordinator_interface', 120 | '$BUILD_DIR/mongo/db/repl/replmocks', 121 | '$BUILD_DIR/mongo/util/clock_source_mock', 122 | 'storage_rocks_mock', 123 | ] 124 | ) 125 | 126 | env.CppUnitTest( 127 | target='storage_rocks_recovery_unit_test', 128 | source=[ 129 | 'src/rocks_recovery_unit_test.cpp', 130 | ], 131 | LIBDEPS=[ 132 | 'storage_rocks_mock', 133 | '$BUILD_DIR/mongo/util/clock_source_mock', 134 | '$BUILD_DIR/mongo/db/storage/test_harness_helper', 135 | ], 136 | LIBDEPS_PRIVATE=[ 137 | '$BUILD_DIR/mongo/db/auth/authmocks', 138 | '$BUILD_DIR/mongo/db/index/index_access_methods', 139 | '$BUILD_DIR/mongo/db/repl/repl_coordinator_interface', 140 | '$BUILD_DIR/mongo/db/repl/replmocks', 141 | ], 142 | ) 143 | 144 | env.CppUnitTest( 145 | target='totdb_test', 146 | source=[ 147 | 'src/totdb/totransaction_test.cpp', 148 | ], 149 | LIBDEPS=[ 150 | 'storage_rocks_mock', 151 | ], 152 | ) 153 | -------------------------------------------------------------------------------- /src/rocks_oplog_manager.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2017 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #pragma once 31 | 32 | #include "mongo/platform/mutex.h" 33 | #include "mongo/stdx/condition_variable.h" 34 | #include "mongo/stdx/thread.h" 35 | #include "mongo/util/concurrency/with_lock.h" 36 | #include "rocks_engine.h" 37 | #include "rocks_record_store.h" 38 | 39 | namespace rocksdb { 40 | class TOTransactionDB; 41 | } // namespace rocksdb 42 | 43 | namespace mongo { 44 | class RocksEngine; 45 | 46 | // Manages oplog visibility, by periodically querying RocksDB's all_committed timestamp value 47 | // and 48 | // then using that timestamp for all transactions that read the oplog collection. 49 | class RocksOplogManager { 50 | RocksOplogManager(const RocksOplogManager&) = delete; 51 | RocksOplogManager& operator=(const RocksOplogManager&) = delete; 52 | 53 | public: 54 | RocksOplogManager(rocksdb::TOTransactionDB* db, RocksEngine* kvEngine, 55 | RocksDurabilityManager* durabilityManager); 56 | virtual ~RocksOplogManager(){}; 57 | 58 | void init(rocksdb::TOTransactionDB* db, RocksDurabilityManager* durabilityManager); 59 | 60 | void start(OperationContext* opCtx, RocksRecordStore* oplogRecordStore); 61 | 62 | void halt(); 63 | 64 | bool isRunning() { 65 | stdx::lock_guard lk(_oplogVisibilityStateMutex); 66 | return _isRunning && !_shuttingDown; 67 | } 68 | // The oplogReadTimestamp is the timestamp used for oplog reads, to prevent readers from 69 | // reading past uncommitted transactions (which may create "holes" in the oplog after an 70 | // unclean shutdown). 71 | std::uint64_t getOplogReadTimestamp() const; 72 | void setOplogReadTimestamp(Timestamp ts); 73 | 74 | // Triggers the oplogJournal thread to update its oplog read timestamp, by flushing the 75 | // journal. 76 | void triggerJournalFlush(); 77 | 78 | // Waits until all committed writes at this point to become visible (that is, no holes exist 79 | // in 80 | // the oplog.) 81 | void waitForAllEarlierOplogWritesToBeVisible(const RocksRecordStore* oplogRecordStore, 82 | OperationContext* opCtx); 83 | 84 | // Returns the all_durable timestamp. All transactions with timestamps earlier than the 85 | // all_durable timestamp are committed. 86 | Timestamp fetchAllDurableValue(); 87 | 88 | private: 89 | void _oplogJournalThreadLoop(RocksRecordStore* oplogRecordStore) noexcept; 90 | 91 | void _setOplogReadTimestamp(WithLock, uint64_t newTimestamp); 92 | 93 | stdx::thread _oplogJournalThread; 94 | mutable Mutex _oplogVisibilityStateMutex = 95 | MONGO_MAKE_LATCH("RocksOplogManager::_oplogVisibilityStateMutex"); 96 | mutable stdx::condition_variable 97 | _opsWaitingForJournalCV; // Signaled to trigger a journal flush. 98 | mutable stdx::condition_variable 99 | _opsBecameVisibleCV; // Signaled when a journal flush is complete. 100 | 101 | bool _isRunning = false; // Guarded by the oplogVisibilityStateMutex. 102 | bool _shuttingDown = false; // Guarded by oplogVisibilityStateMutex. 103 | 104 | bool _opsWaitingForJournal = false; // Guarded by oplogVisibilityStateMutex. 105 | 106 | // When greater than 0, indicates that there are operations waiting for oplog visibility, 107 | // and 108 | // journal flushing should not be delayed. 109 | std::int64_t _opsWaitingForVisibility = 0; // Guarded by oplogVisibilityStateMutex. 110 | 111 | AtomicWord _oplogReadTimestamp; 112 | 113 | rocksdb::TOTransactionDB* _db; // not owned 114 | 115 | RocksEngine* _kvEngine; // not ownded 116 | 117 | RocksDurabilityManager* _durabilityManager; // not owned 118 | }; 119 | } // namespace mongo 120 | -------------------------------------------------------------------------------- /src/totdb/totransaction_impl.h: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #ifndef ROCKSDB_LITE 3 | 4 | #include "rocksdb/db.h" 5 | #include "rocksdb/slice.h" 6 | #include "rocksdb/status.h" 7 | #include "rocksdb/types.h" 8 | #include "mongo/db/modules/rocks/src/totdb/totransaction.h" 9 | #include "mongo/db/modules/rocks/src/totdb//totransaction_db.h" 10 | #include "rocksdb/utilities/write_batch_with_index.h" 11 | 12 | 13 | namespace rocksdb { 14 | 15 | using TxnKey = std::pair; 16 | 17 | // TimeStamp Ordering Transaction Options 18 | struct TOTxnOptions { 19 | size_t max_write_batch_size = 1000; 20 | Logger* log_ = nullptr; 21 | }; 22 | 23 | class TOTransactionDBImpl; 24 | 25 | class SwapSnapshotGuard { 26 | public: 27 | // No copying allowed 28 | SwapSnapshotGuard(const SwapSnapshotGuard&) = delete; 29 | SwapSnapshotGuard& operator=(const SwapSnapshotGuard&) = delete; 30 | SwapSnapshotGuard(ReadOptions* readOpt, const Snapshot* newSnapshot) { 31 | read_opt_ = readOpt; 32 | old_snapshot_ = read_opt_->snapshot; 33 | read_opt_->snapshot = newSnapshot; 34 | } 35 | ~SwapSnapshotGuard() { read_opt_->snapshot = old_snapshot_; } 36 | 37 | private: 38 | ReadOptions* read_opt_; 39 | const Snapshot* old_snapshot_; 40 | }; 41 | 42 | class TOTransactionImpl : public TOTransaction { 43 | public: 44 | struct ActiveTxnNode { 45 | // NOTE(deyukong): txn_id_ is indeed duplicated with txn_snapshot 46 | // consider using txn_snapshot 47 | TransactionID txn_id_; 48 | TransactionID commit_txn_id_; 49 | bool commit_ts_set_; 50 | RocksTimeStamp commit_ts_; 51 | RocksTimeStamp first_commit_ts_; 52 | bool read_ts_set_; 53 | RocksTimeStamp read_ts_; 54 | char read_ts_buffer_[sizeof(RocksTimeStamp)]; 55 | Slice read_ts_slice_; 56 | bool prepare_ts_set_; 57 | RocksTimeStamp prepare_ts_; 58 | bool durable_ts_set_; 59 | RocksTimeStamp durable_ts_; 60 | bool timestamp_published_; 61 | bool timestamp_round_prepared_; 62 | bool timestamp_round_read_; 63 | bool read_only_; 64 | bool ignore_prepare_; 65 | std::atomic state_; 66 | const Snapshot* txn_snapshot; 67 | WriteBatchWithIndex write_batch_; 68 | 69 | public: 70 | ActiveTxnNode(const ActiveTxnNode&) = delete; 71 | ActiveTxnNode& operator=(const ActiveTxnNode&) = delete; 72 | ActiveTxnNode(); 73 | }; 74 | 75 | TOTransactionImpl(TOTransactionDB* db, 76 | const WriteOptions& options, 77 | const TOTxnOptions& txn_options, 78 | const std::shared_ptr& core); 79 | 80 | virtual ~TOTransactionImpl(); 81 | 82 | virtual Status SetPrepareTimeStamp(const RocksTimeStamp& timestamp) override; 83 | 84 | virtual Status SetCommitTimeStamp(const RocksTimeStamp& timestamp) override; 85 | 86 | virtual Status SetDurableTimeStamp(const RocksTimeStamp& timestamp) override; 87 | 88 | virtual Status SetReadTimeStamp(const RocksTimeStamp& timestamp) override; 89 | 90 | virtual Status GetReadTimeStamp(RocksTimeStamp* timestamp) const override; 91 | 92 | virtual Status Prepare() override; 93 | 94 | virtual Status Commit(std::function* hook = nullptr) override; 95 | 96 | virtual Status Rollback() override; 97 | 98 | virtual Status Get(ReadOptions& options, 99 | ColumnFamilyHandle* column_family, const Slice& key, 100 | std::string* value) override; 101 | 102 | virtual Status Get(ReadOptions& options, const Slice& key, 103 | std::string* value) override; 104 | 105 | virtual Iterator* GetIterator(ReadOptions& read_options) override; 106 | 107 | virtual Iterator* GetIterator(ReadOptions& read_options, 108 | ColumnFamilyHandle* column_family) override; 109 | 110 | virtual Status Put(ColumnFamilyHandle* column_family, const Slice& key, 111 | const Slice& value) override; 112 | 113 | virtual Status Put(const Slice& key, const Slice& value) override; 114 | 115 | virtual Status Delete(ColumnFamilyHandle* column_family, const Slice& key) override; 116 | 117 | virtual Status Delete(const Slice& key) override; 118 | 119 | virtual Status GetForUpdate(ColumnFamilyHandle* column_family, const Slice& key) override; 120 | 121 | virtual Status GetForUpdate(const Slice& key) override; 122 | 123 | virtual Status SetName(const TransactionName& name) override; 124 | 125 | virtual TransactionID GetID() const override; 126 | 127 | virtual TOTransactionState GetState() const override; 128 | 129 | virtual WriteBatchWithIndex* GetWriteBatch() override; 130 | 131 | const ActiveTxnNode* GetCore() const; 132 | 133 | // Check write conflict. If there is no write conflict, add the key to uncommitted keys 134 | Status CheckWriteConflict(const TxnKey& key); 135 | 136 | // Generate a new unique transaction identifier 137 | static TransactionID GenTxnID(); 138 | 139 | private: 140 | // Used to create unique ids for transactions. 141 | static std::atomic txn_id_counter_; 142 | 143 | // Unique ID for this transaction 144 | TransactionID txn_id_; 145 | 146 | // Updated keys in this transaction 147 | // TODO(deyukong): writtenKeys_ is duplicated with core_->Write_batch_, remove 148 | // this 149 | std::set written_keys_; 150 | 151 | std::set get_for_updates_; 152 | 153 | DB* db_; 154 | TOTransactionDBImpl* txn_db_impl_; 155 | 156 | WriteOptions write_options_; 157 | TOTxnOptions txn_option_; 158 | 159 | std::shared_ptr core_; 160 | 161 | std::vector asof_commit_timestamps_; 162 | }; 163 | 164 | } // namespace rocksdb 165 | #endif 166 | -------------------------------------------------------------------------------- /src/totdb/totransaction_prepare_iterator.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2011-present, Facebook, Inc. All rights reserved. 2 | // This source code is licensed under both the GPLv2 (found in the 3 | // COPYING file in the root directory) and Apache 2.0 License 4 | // (found in the LICENSE.Apache file in the root directory). 5 | 6 | #pragma once 7 | #ifndef ROCKSDB_LITE 8 | 9 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db_impl.h" 10 | #include "mongo/db/modules/rocks/src/totdb/totransaction_impl.h" 11 | 12 | namespace rocksdb { 13 | 14 | // PrepareFilterIterator 15 | // | 16 | // --PrepareMergingIterator 17 | // | 18 | // -- PrepareMapIterator 19 | // | 20 | // -- BaseIterator 21 | // | 22 | // -- WriteBatchWithIndexIterator 23 | // | 24 | // -- Normal LsmTree Iterator 25 | // | 26 | // -- ...... 27 | // 28 | // 1) `PrepareFilterIterator` checks prepare status of an input entry and 29 | // decides 30 | // to return, wait, or advance to the next record 31 | // 2) PrepareMergingIterator arranges PrepareMapIterator and BaseIterator into 32 | // total-order. if PrepareMapIterator and BaseIterator has the same key, they 33 | // are both returned by `ShadowValue`, it is impossible that the same key 34 | // comes from me(or WriteBatchWithIndexIterator) because the only operations 35 | // after `prepare` is rollback or commit 36 | 37 | class PrepareMapIterator { 38 | public: 39 | PrepareMapIterator(ColumnFamilyHandle* cf, PrepareHeap* ph, 40 | TOTransactionImpl::ActiveTxnNode* core) 41 | : cf_(cf), ph_(ph), core_(core), valid_(false) {} 42 | 43 | bool Valid() const { return valid_; } 44 | 45 | void SeekToFirst(); 46 | 47 | void SeekToLast(); 48 | 49 | void Seek(const Slice& target); 50 | 51 | void SeekForPrev(const Slice& target); 52 | 53 | void Next(); 54 | 55 | void Prev(); 56 | 57 | Slice key() const; 58 | 59 | const std::shared_ptr& value() const; 60 | 61 | TOTransaction::TOTransactionState valueState() const; 62 | 63 | TOTransaction::TOTransactionState UpdatePrepareState(); 64 | 65 | PrepareHeap* getPrepareHeapMap() const; 66 | 67 | private: 68 | void TryPosValueToCorrectMvccVersionInLock( 69 | const std::list>& 70 | prepare_mvccs); 71 | 72 | ColumnFamilyHandle* cf_; // not owned 73 | 74 | PrepareHeap* ph_; // not owned 75 | 76 | TOTransactionImpl::ActiveTxnNode* core_; // not owned 77 | 78 | bool forward_; 79 | 80 | bool valid_; 81 | 82 | std::string pos_; 83 | 84 | std::shared_ptr val_; 85 | 86 | TOTransaction::TOTransactionState val_state_; 87 | }; 88 | 89 | struct ShadowValue { 90 | bool has_prepare; 91 | bool has_base; 92 | TOTransactionImpl::ActiveTxnNode* prepare_value; 93 | TOTransaction::TOTransactionState prepare_value_state; 94 | Slice base_value; 95 | Slice base_key; 96 | RocksTimeStamp base_timestamp; 97 | }; 98 | 99 | class PrepareMergingIterator { 100 | public: 101 | PrepareMergingIterator(std::unique_ptr base_iterator, 102 | std::unique_ptr pmap_iterator); 103 | 104 | bool Valid() const; 105 | 106 | Slice key() const; 107 | 108 | ShadowValue value() const; 109 | 110 | void SeekToFirst(); 111 | 112 | void SeekToLast(); 113 | 114 | void Seek(const Slice& k); 115 | 116 | void SeekForPrev(const Slice& k); 117 | 118 | void Next(); 119 | 120 | void Prev(); 121 | 122 | Status status() const; 123 | 124 | TOTransaction::TOTransactionState UpdatePrepareState(); 125 | 126 | private: 127 | void Advance(); 128 | 129 | void AdvanceDelta(); 130 | 131 | void AdvanceBase(); 132 | 133 | void AssertInvariants(); 134 | 135 | void UpdateCurrent(); 136 | 137 | bool BaseValid() const; 138 | 139 | bool DeltaValid() const; 140 | 141 | bool forward_; 142 | 143 | bool current_at_base_; 144 | 145 | bool equal_keys_; 146 | 147 | Status status_; 148 | 149 | std::unique_ptr base_iterator_; 150 | 151 | std::unique_ptr delta_iterator_; 152 | 153 | const Comparator* comparator_; // not owned 154 | }; 155 | 156 | class PrepareFilterIterator : public Iterator { 157 | public: 158 | PrepareFilterIterator(DB* db, ColumnFamilyHandle* cf, 159 | const std::shared_ptr& core, 160 | std::unique_ptr input, 161 | Logger* info_log = nullptr); 162 | 163 | bool Valid() const final; 164 | 165 | Slice key() const final; 166 | 167 | Slice value() const final; 168 | 169 | void SeekToFirst() final; 170 | 171 | void SeekToLast() final; 172 | 173 | void Seek(const Slice& k) final; 174 | 175 | void SeekForPrev(const Slice& k) final; 176 | 177 | void Next() final; 178 | 179 | void Prev() final; 180 | 181 | Status status() const; 182 | 183 | private: 184 | // rocksdb internal api, for sanity check 185 | // WBWIIteratorImpl::Result GetFromBatch(WriteBatchWithIndex* batch, 186 | // const Slice& key, 187 | // std::string* val); 188 | 189 | void AdvanceInputNoFilter(); 190 | 191 | void UpdateCurrent(); 192 | 193 | void UpdatePrepareState(); 194 | 195 | DB* db_; 196 | 197 | ColumnFamilyHandle* cf_; 198 | 199 | // Iterator's lifetime should be shorter than the Transaction who created it. 200 | // So here core_ should be a raw pointer rather than shared_ptr. However, 201 | // MultiIndexBlock::insertAllDocumentsInCollection breaks this. The cursor 202 | // in exec has longer lifetime than WriteUnitOfWork, so we have to workaround 203 | // with shared_ptr. 204 | std::shared_ptr core_; 205 | 206 | std::unique_ptr input_; 207 | 208 | Slice key_; 209 | 210 | std::string val_; 211 | 212 | ShadowValue sval_; 213 | 214 | bool valid_; 215 | 216 | bool forward_; 217 | 218 | Status status_; 219 | }; 220 | 221 | } // namespace rocksdb 222 | 223 | #endif // ROCKSDB_LITE 224 | -------------------------------------------------------------------------------- /src/rocks_compaction_scheduler.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #pragma once 30 | 31 | #include 32 | #include 33 | #include 34 | #include 35 | #include 36 | #include 37 | #include 38 | 39 | #include "mongo/base/status.h" 40 | #include "mongo/platform/mutex.h" 41 | #include "mongo/util/timer.h" 42 | #include "mongo/util/concurrency/notification.h" 43 | #include "mongo/bson/bsonobj.h" 44 | 45 | namespace rocksdb { 46 | class CompactionFilterFactory; 47 | class ColumnFamilyHandle; 48 | class DB; 49 | class Iterator; 50 | struct WriteOptions; 51 | class WriteBatch; 52 | class TOTransaction; 53 | class TOTransactionDB; 54 | } 55 | 56 | namespace mongo { 57 | 58 | class CompactionBackgroundJob; 59 | 60 | struct OplogDelCompactStats { 61 | uint64_t oplogEntriesDeleted; 62 | uint64_t oplogSizeDeleted; 63 | uint64_t oplogCompactSkip; 64 | uint64_t oplogCompactKeep; 65 | }; 66 | 67 | class RocksCompactionScheduler { 68 | public: 69 | RocksCompactionScheduler(); 70 | ~RocksCompactionScheduler(); 71 | 72 | void start(rocksdb::TOTransactionDB* db, rocksdb::ColumnFamilyHandle* cf); 73 | void stop(); 74 | 75 | static int getSkippedDeletionsThreshold() { return kSkippedDeletionsThreshold; } 76 | 77 | void reportSkippedDeletionsAboveThreshold(rocksdb::ColumnFamilyHandle* cf, const std::string& prefix); 78 | 79 | // schedule compact range operation for execution in _compactionThread 80 | void compactAll(); 81 | Status compactOplog(rocksdb::ColumnFamilyHandle* cf, const std::string& begin, const std::string& end); 82 | 83 | rocksdb::CompactionFilterFactory* createCompactionFilterFactory(); 84 | std::unordered_map getDroppedPrefixes() const; 85 | boost::optional>> getOplogDeleteUntil() const; 86 | 87 | // load dropped prefixes, and re-schedule compaction of each dropped prefix. 88 | // as we don't know which cf a prefix exists, we have to compact each prefix out of each cf. 89 | // since prefix is globally unique, we don't worry about deleting unexpceted data. 90 | uint32_t loadDroppedPrefixes(rocksdb::Iterator* iter, const std::vector); 91 | Status dropPrefixesAtomic(rocksdb::ColumnFamilyHandle* cf, 92 | const std::vector& prefixesToDrop, 93 | rocksdb::TOTransaction* txn, 94 | const BSONObj& debugInfo); 95 | void notifyCompacted(const std::string& begin, const std::string& end, bool rangeDropped, 96 | bool opSucceeded); 97 | 98 | // calculate Oplog Delete Entries 99 | void addOplogEntriesDeleted(const uint64_t entries); 100 | // calculate Oplog Delete Size 101 | void addOplogSizeDeleted(const uint64_t size); 102 | // add up to Oplog Compact Removed Entries 103 | void addOplogCompactRemoved(); 104 | // add up tp Oplog Compact Preserved Entries 105 | void addOplogCompactPreserved(); 106 | // query Oplog Delete and Compact all Stats 107 | const OplogDelCompactStats getOplogDelCompactStats() const; 108 | 109 | private: 110 | void compactPrefix(rocksdb::ColumnFamilyHandle* cf, const std::string& prefix); 111 | void compactDroppedPrefix(rocksdb::ColumnFamilyHandle* cf, const std::string& prefix); 112 | void compact(rocksdb::ColumnFamilyHandle* cf, const std::string& begin, 113 | const std::string& end, bool rangeDropped, uint32_t order, 114 | boost::optional>>); 115 | void droppedPrefixCompacted(const std::string& prefix, bool opSucceeded); 116 | 117 | private: 118 | Mutex _lock = MONGO_MAKE_LATCH("RocksCompactionScheduler::_lock"); 119 | // protected by _lock 120 | Timer _timer; 121 | 122 | rocksdb::TOTransactionDB* _db; // not owned 123 | 124 | // not owned, cf where compaction_scheduler's metadata exists. 125 | rocksdb::ColumnFamilyHandle* _metaCf; 126 | 127 | // Don't trigger compactions more often than every 10min 128 | static const int kMinCompactionIntervalMins = 10; 129 | // We'll compact the prefix if any operation on the prefix reports more than 50.000 130 | // deletions it had to skip over (this is about 10ms extra overhead) 131 | static const int kSkippedDeletionsThreshold = 50000; 132 | 133 | // thread for async execution of range compactions 134 | std::unique_ptr _compactionJob; 135 | 136 | // set of all prefixes that are deleted. we delete them in the background thread 137 | 138 | mutable Mutex _droppedDataMutex = 139 | MONGO_MAKE_LATCH("RocksCompactionScheduler::_droppedDataMutex"); 140 | 141 | std::unordered_map _droppedPrefixes; 142 | 143 | std::atomic _droppedPrefixesCount; 144 | boost::optional>> _oplogDeleteUntil; 145 | static const std::string kDroppedPrefix; 146 | 147 | std::atomic _oplogEntriesDeleted; 148 | std::atomic _oplogSizeDeleted; 149 | std::atomic _oplogCompactSkip; 150 | std::atomic _oplogCompactKeep; 151 | }; 152 | } // namespace mongo 153 | -------------------------------------------------------------------------------- /src/totdb/totransaction_db.h: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #ifndef ROCKSDB_LITE 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | #include "mongo/db/modules/rocks/src/totdb/totransaction.h" 10 | #include "mongo/util/str.h" 11 | #include "rocksdb/comparator.h" 12 | #include "rocksdb/db.h" 13 | #include "rocksdb/utilities/stackable_db.h" 14 | #include "third_party/s2/util/coding/coder.h" 15 | 16 | namespace rocksdb { 17 | 18 | //TimeStamp Ordering Transaction DB Options 19 | #define DEFAULT_NUM_STRIPES 32 20 | 21 | struct TOTransactionStat { 22 | size_t max_conflict_bytes; 23 | size_t cur_conflict_bytes; 24 | size_t uk_num; 25 | size_t ck_num; 26 | size_t alive_txns_num; 27 | size_t read_q_num; 28 | size_t commit_q_num; 29 | uint64_t oldest_ts; 30 | uint64_t min_read_ts; 31 | uint64_t max_commit_ts; 32 | uint64_t committed_max_txnid; 33 | uint64_t min_uncommit_ts; 34 | uint64_t update_max_commit_ts_times; 35 | uint64_t update_max_commit_ts_retries; 36 | uint64_t txn_commits; 37 | uint64_t txn_aborts; 38 | uint64_t commit_without_ts_times; 39 | uint64_t read_without_ts_times; 40 | uint64_t read_with_ts_times; 41 | uint64_t read_q_walk_len_sum; 42 | uint64_t read_q_walk_times; 43 | uint64_t commit_q_walk_len_sum; 44 | uint64_t commit_q_walk_times; 45 | }; 46 | 47 | struct TOTransactionDBOptions { 48 | size_t num_stripes = DEFAULT_NUM_STRIPES; 49 | size_t max_conflict_check_bytes_size = 200 * 1024 * 1024; 50 | TOTransactionDBOptions(){}; 51 | TOTransactionDBOptions(int max_conflict_check_bytes_size_mb) 52 | : max_conflict_check_bytes_size(max_conflict_check_bytes_size_mb * 1024 * 53 | 1024) {} 54 | }; 55 | 56 | enum TimeStampType { 57 | kOldest = 0, 58 | kStable = 1, // kStable is not used 59 | kCommitted = 2, 60 | kAllCommitted = 3, 61 | kTimeStampMax, 62 | }; 63 | 64 | Status PrepareConflict(); 65 | bool IsPrepareConflict(const Status& s); 66 | 67 | //TimeStamp Ordering Transaction Options 68 | struct TOTransactionOptions { 69 | size_t max_write_batch_size = 1000; 70 | // Whether or not to round up to the oldest timestamp when the read timestamp 71 | // is behind it. 72 | bool timestamp_round_read = false; 73 | // If true, The prepare timestamp will be rounded up to the oldest timestamp 74 | // if found to be 75 | // and the commit timestamp will be rounded up to the prepare timestamp if 76 | // found to be earlier 77 | // If false, Does not round up prepare and commit timestamp of a prepared 78 | // transaction. 79 | bool timestamp_round_prepared = false; 80 | 81 | bool read_only = false; 82 | 83 | bool ignore_prepare = false; 84 | }; 85 | 86 | class TOComparator : public Comparator { 87 | public: 88 | TOComparator() : Comparator(sizeof(RocksTimeStamp)), cmp_without_ts_(BytewiseComparator()) {} 89 | TOComparator(size_t ts_size) : Comparator(ts_size), cmp_without_ts_(BytewiseComparator()) {} 90 | 91 | static size_t TimestampSize() { return sizeof(RocksTimeStamp); } 92 | const char* Name() const override { return "TOComparator"; } 93 | 94 | void FindShortSuccessor(std::string*) const override {} 95 | 96 | void FindShortestSeparator(std::string*, const Slice&) const override {} 97 | 98 | int Compare(const Slice& a, const Slice& b) const override { 99 | invariant(timestamp_size() > 0); 100 | int r = cmp_without_ts_->Compare(StripTimestampFromUserKey(a, timestamp_size()), 101 | StripTimestampFromUserKey(b, timestamp_size())); 102 | if (r != 0) { 103 | return r; 104 | } 105 | return -CompareTimestamp( 106 | Slice(a.data() + a.size() - timestamp_size(), timestamp_size()), 107 | Slice(b.data() + b.size() - timestamp_size(), timestamp_size())); 108 | } 109 | 110 | int CompareWithoutTimestamp(const Slice& a, bool a_has_ts, const Slice& b, 111 | bool b_has_ts) const override { 112 | invariant(timestamp_size() > 0); 113 | if (a_has_ts) { 114 | invariant(a.size() >= timestamp_size()); 115 | } 116 | if (b_has_ts) { 117 | invariant(b.size() >= timestamp_size()); 118 | } 119 | Slice lhs = a_has_ts ? StripTimestampFromUserKey(a, timestamp_size()) : a; 120 | Slice rhs = b_has_ts ? StripTimestampFromUserKey(b, timestamp_size()) : b; 121 | return cmp_without_ts_->Compare(lhs, rhs); 122 | } 123 | 124 | int CompareTimestamp(const Slice& ts1, const Slice& ts2) const override; 125 | 126 | static const Slice StripTimestampFromUserKey(const Slice& user_key, size_t ts_sz) { 127 | Slice ret = user_key; 128 | ret.remove_suffix(ts_sz); 129 | return ret; 130 | } 131 | 132 | void forceSetOldestTs(RocksTimeStamp ts); 133 | void clearSetOldestTs(); 134 | 135 | private: 136 | const Comparator* cmp_without_ts_; 137 | }; 138 | 139 | class ShouldNotCheckOldestTsBlock { 140 | ShouldNotCheckOldestTsBlock(const ShouldNotCheckOldestTsBlock&) = delete; 141 | ShouldNotCheckOldestTsBlock& operator=(const ShouldNotCheckOldestTsBlock&) = delete; 142 | 143 | public: 144 | explicit ShouldNotCheckOldestTsBlock(TOComparator* comparator, RocksTimeStamp ts) 145 | : _to_comparator(comparator) { 146 | invariant(_to_comparator); 147 | _to_comparator->forceSetOldestTs(ts); 148 | } 149 | 150 | ~ShouldNotCheckOldestTsBlock() { 151 | _to_comparator->clearSetOldestTs(); 152 | } 153 | 154 | private: 155 | TOComparator* const _to_comparator; 156 | }; 157 | 158 | 159 | 160 | 161 | class TOTransactionDB : public StackableDB { 162 | public: 163 | static Status Open(const Options& options, 164 | const TOTransactionDBOptions& txn_db_options, 165 | const std::string& dbname, 166 | const std::string stableTsKey, 167 | TOTransactionDB** dbptr); 168 | 169 | static Status Open(const DBOptions& db_options, 170 | const TOTransactionDBOptions& txn_db_options, 171 | const std::string& dbname, 172 | const std::vector& open_cfds, 173 | std::vector* handles, 174 | const std::vector& trim_cfds, 175 | const bool trimHistory, 176 | const std::string stableTsKey, 177 | TOTransactionDB** dbptr); 178 | 179 | virtual void SetMaxConflictBytes(uint64_t bytes) = 0; 180 | 181 | // The lifecycle of returned pointer should be managed by the application level 182 | virtual TOTransaction* BeginTransaction( 183 | const WriteOptions& write_options, 184 | const TOTransactionOptions& txn_options) = 0; 185 | 186 | virtual Status SetTimeStamp(const TimeStampType& ts_type, 187 | const RocksTimeStamp& ts, bool force = false) = 0; 188 | 189 | virtual Status QueryTimeStamp(const TimeStampType& ts_type, RocksTimeStamp* timestamp) = 0; 190 | 191 | virtual Status Stat(TOTransactionStat* stat) = 0; 192 | //virtual Status Close(); 193 | 194 | virtual std::unique_ptr makeTxn() = 0; 195 | 196 | protected: 197 | //std::shared_ptr info_log_ = nullptr; 198 | // To Create an ToTransactionDB, call Open() 199 | explicit TOTransactionDB(DB* db) : StackableDB(db) {} 200 | }; 201 | 202 | } // namespace rocksdb 203 | 204 | #endif 205 | 206 | -------------------------------------------------------------------------------- /src/rocks_record_store_test.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #include "mongo/platform/basic.h" 30 | 31 | #include 32 | #include 33 | #include 34 | 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include "mongo/db/modules/rocks/src/totdb/totransaction.h" 40 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db.h" 41 | 42 | #include "mongo/base/checked_cast.h" 43 | #include "mongo/base/init.h" 44 | #include "mongo/base/string_data.h" 45 | #include "mongo/bson/bsonobjbuilder.h" 46 | #include "mongo/db/concurrency/write_conflict_exception.h" 47 | #include "mongo/db/json.h" 48 | #include "mongo/db/operation_context_noop.h" 49 | #include "mongo/db/repl/repl_settings.h" 50 | #include "mongo/db/repl/replication_coordinator_mock.h" 51 | #include "mongo/db/service_context.h" 52 | #include "mongo/db/storage/kv/kv_engine_test_harness.h" 53 | #include "mongo/db/storage/kv/kv_prefix.h" 54 | #include "mongo/db/storage/record_store_test_harness.h" 55 | #include "mongo/stdx/memory.h" 56 | #include "mongo/unittest/temp_dir.h" 57 | #include "mongo/unittest/unittest.h" 58 | #include "mongo/util/clock_source_mock.h" 59 | #include "mongo/util/fail_point.h" 60 | #include "mongo/util/scopeguard.h" 61 | 62 | #include "rocks_compaction_scheduler.h" 63 | #include "rocks_oplog_manager.h" 64 | #include "rocks_record_store.h" 65 | #include "rocks_recovery_unit.h" 66 | #include "rocks_snapshot_manager.h" 67 | 68 | namespace mongo { 69 | 70 | using std::string; 71 | 72 | class RocksHarnessHelper final : public RecordStoreHarnessHelper { 73 | public: 74 | RocksHarnessHelper() 75 | : _dbpath("rocks_test"), 76 | _engine(_dbpath.path(), true /* durable */, 3 /* kRocksFormatVersion */, 77 | false /* readOnly */) { 78 | repl::ReplicationCoordinator::set(serviceContext(), 79 | std::make_unique( 80 | serviceContext(), repl::ReplSettings())); 81 | } 82 | 83 | virtual ~RocksHarnessHelper() {} 84 | 85 | virtual std::unique_ptr newNonCappedRecordStore() { 86 | return newNonCappedRecordStore("a.b"); 87 | } 88 | 89 | std::unique_ptr newNonCappedRecordStore(const std::string& ns) { 90 | RocksRecoveryUnit* ru = dynamic_cast(_engine.newRecoveryUnit()); 91 | OperationContextNoop opCtx(ru); 92 | RocksRecordStore::Params params; 93 | params.ns = ns; 94 | params.ident = "1"; 95 | params.prefix = "prefix"; 96 | params.isCapped = false; 97 | params.cappedMaxSize = -1; 98 | params.cappedMaxDocs = -1; 99 | return stdx::make_unique(&_engine, _engine.getCf_ForTest(ns), &opCtx, params); 100 | } 101 | 102 | std::unique_ptr newCappedRecordStore(int64_t cappedMaxSize, 103 | int64_t cappedMaxDocs) final { 104 | return newCappedRecordStore("a.b", cappedMaxSize, cappedMaxDocs); 105 | } 106 | 107 | std::unique_ptr newCappedRecordStore(const std::string& ns, 108 | int64_t cappedMaxSize, 109 | int64_t cappedMaxDocs) { 110 | RocksRecoveryUnit* ru = dynamic_cast(_engine.newRecoveryUnit()); 111 | OperationContextNoop opCtx(ru); 112 | struct RocksRecordStore::Params params; 113 | params.ns = ns; 114 | params.ident = "1"; 115 | params.prefix = "prefix"; 116 | params.isCapped = true; 117 | params.cappedMaxSize = cappedMaxSize; 118 | params.cappedMaxDocs = cappedMaxDocs; 119 | return stdx::make_unique(&_engine, _engine.getCf_ForTest(ns), &opCtx, params); 120 | } 121 | 122 | std::unique_ptr newRecoveryUnit() final { 123 | return std::unique_ptr(_engine.newRecoveryUnit()); 124 | } 125 | 126 | bool supportsDocLocking() final { return true; } 127 | 128 | RocksEngine* getEngine() { return &_engine; } 129 | 130 | private: 131 | unittest::TempDir _dbpath; 132 | ClockSourceMock _cs; 133 | 134 | RocksEngine _engine; 135 | }; 136 | 137 | std::unique_ptr makeHarnessHelper() { 138 | return stdx::make_unique(); 139 | } 140 | 141 | MONGO_INITIALIZER(RegisterHarnessFactory)(InitializerContext* const) { 142 | mongo::registerHarnessHelperFactory(makeHarnessHelper); 143 | return Status::OK(); 144 | } 145 | 146 | TEST(RocksRecordStoreTest, CounterManager1) { 147 | std::unique_ptr harnessHelper(new RocksHarnessHelper()); 148 | std::unique_ptr rs(harnessHelper->newNonCappedRecordStore()); 149 | 150 | int N = 12; 151 | 152 | { 153 | ServiceContext::UniqueOperationContext opCtx(harnessHelper->newOperationContext()); 154 | { 155 | WriteUnitOfWork uow(opCtx.get()); 156 | for (int i = 0; i < N; i++) { 157 | StatusWith res = rs->insertRecord(opCtx.get(), "a", 2, Timestamp()); 158 | ASSERT_OK(res.getStatus()); 159 | } 160 | uow.commit(); 161 | } 162 | } 163 | 164 | { 165 | ServiceContext::UniqueOperationContext opCtx(harnessHelper->newOperationContext()); 166 | ASSERT_EQUALS(N, rs->numRecords(opCtx.get())); 167 | } 168 | 169 | { 170 | ServiceContext::UniqueOperationContext opCtx(harnessHelper->newOperationContext()); 171 | rs = harnessHelper->newNonCappedRecordStore(); 172 | ASSERT_EQUALS(N, rs->numRecords(opCtx.get())); 173 | } 174 | rs.reset(nullptr); // this has to be deleted before ss 175 | } 176 | 177 | } // namespace mongo 178 | -------------------------------------------------------------------------------- /src/rocks_prepare_conflict.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2018-present MongoDB, Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the Server Side Public License, version 1, 6 | * as published by MongoDB, Inc. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * Server Side Public License for more details. 12 | * 13 | * You should have received a copy of the Server Side Public License 14 | * along with this program. If not, see 15 | * . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the Server Side Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #pragma once 31 | 32 | #include 33 | 34 | #include "mongo/db/curop.h" 35 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db.h" 36 | #include "mongo/db/prepare_conflict_tracker.h" 37 | #include "mongo/util/fail_point_service.h" 38 | #include "rocks_recovery_unit.h" 39 | 40 | #include 41 | 42 | namespace mongo { 43 | 44 | // When set, returns simulates returning rocks prepare conflict status. 45 | MONGO_FAIL_POINT_DECLARE(RocksPrepareConflictForReads); 46 | 47 | // When set, rocksdb::Busy is returned in place of retrying on ROCKS_PREPARE_CONFLICT errors. 48 | MONGO_FAIL_POINT_DECLARE(RocksSkipPrepareConflictRetries); 49 | 50 | MONGO_FAIL_POINT_DECLARE(RocksPrintPrepareConflictLog); 51 | 52 | /** 53 | * Logs a message with the number of prepare conflict retry attempts. 54 | */ 55 | void rocksPrepareConflictLog(int attempt); 56 | 57 | /** 58 | * Logs a message to confirm we've hit the ROCKSPrintPrepareConflictLog fail point. 59 | */ 60 | void rocksPrepareConflictFailPointLog(); 61 | 62 | /** 63 | * Runs the argument function f as many times as needed for f to return an error other than 64 | * WT_PREPARE_CONFLICT. Each time f returns WT_PREPARE_CONFLICT we wait until the current unit 65 | * of work commits or aborts, and then try f again. Imposes no upper limit on the number of 66 | * times to re-try f, so any required timeout behavior must be enforced within f. The function f 67 | * must return a error code. 68 | */ 69 | template 70 | rocksdb::Status rocksPrepareConflictRetry(OperationContext* opCtx, F&& f) { 71 | invariant(opCtx); 72 | 73 | auto recoveryUnit = RocksRecoveryUnit::getRocksRecoveryUnit(opCtx); 74 | int attempts = 1; 75 | // If we return from this function, we have either returned successfully or we've 76 | // returned an error other than conflict. Reset PrepareConflictTracker accordingly. 77 | ON_BLOCK_EXIT([opCtx] { PrepareConflictTracker::get(opCtx).endPrepareConflict(); }); 78 | // If the failpoint is enabled, don't call the function, just simulate a conflict. 79 | rocksdb::Status s = MONGO_FAIL_POINT(RocksPrepareConflictForReads) 80 | ? rocksdb::PrepareConflict() 81 | : ROCKS_READ_CHECK(f()); 82 | if (!IsPrepareConflict(s)) return s; 83 | 84 | PrepareConflictTracker::get(opCtx).beginPrepareConflict(); 85 | 86 | // It is contradictory to be running into a prepare conflict when we are ignoring 87 | // interruptions, particularly when running code inside an 88 | // OperationContext::runWithoutInterruptionExceptAtGlobalShutdown block. 89 | // Operations executed in this way are expected to be set to ignore prepare conflicts. 90 | invariant(!opCtx->isIgnoringInterrupts()); 91 | 92 | if (MONGO_FAIL_POINT(RocksPrintPrepareConflictLog)) { 93 | rocksPrepareConflictFailPointLog(); 94 | } 95 | 96 | CurOp::get(opCtx)->debug().additiveMetrics.incrementPrepareReadConflicts(1); 97 | rocksPrepareConflictLog(attempts); 98 | 99 | const auto lockerInfo = opCtx->lockState()->getLockerInfo(boost::none); 100 | invariant(lockerInfo); 101 | for (const auto& lock : lockerInfo->locks) { 102 | const auto type = lock.resourceId.getType(); 103 | // If a user operation on secondaries acquires a lock in MODE_S and then blocks on a 104 | // prepare 105 | // conflict with a prepared transaction, deadlock will occur at the commit time of the 106 | // prepared transaction when it attempts to reacquire (since locks were yielded on 107 | // secondaries) an IX lock that conflicts with the MODE_S lock held by the user 108 | // operation. 109 | // User operations that acquire MODE_X locks and block on prepare conflicts could lead 110 | // to 111 | // the same problem. However, user operations on secondaries should never hold MODE_X 112 | // locks. 113 | // Since prepared transactions will not reacquire RESOURCE_MUTEX / RESOURCE_METADATA 114 | // locks 115 | // at commit time, these lock types are safe. Therefore, invariant here that we do not 116 | // get a 117 | // prepare conflict while holding a global, database, or collection MODE_S lock (or 118 | // MODE_X 119 | // lock for completeness). 120 | if (type == RESOURCE_GLOBAL || type == RESOURCE_DATABASE || type == RESOURCE_COLLECTION) 121 | invariant(lock.mode != MODE_S && lock.mode != MODE_X, 122 | str::stream() 123 | << lock.resourceId.toString() << " in " << modeName(lock.mode)); 124 | } 125 | 126 | if (MONGO_FAIL_POINT(RocksSkipPrepareConflictRetries)) { 127 | // Callers of wiredTigerPrepareConflictRetry() should eventually call wtRCToStatus() via 128 | // invariantRocksOK() and have the rocksdb::Busy error bubble up as a 129 | // WriteConflictException. Enabling the "skipWriteConflictRetries" failpoint in 130 | // conjunction with the "RocksSkipPrepareConflictRetries" failpoint prevents the higher 131 | // layers from retrying the entire operation. 132 | return rocksdb::Status::Busy("failpoint simulate"); 133 | } 134 | 135 | while (true) { 136 | attempts++; 137 | auto lastCount = recoveryUnit->getDurabilityManager()->getPrepareCommitOrAbortCount(); 138 | // If the failpoint is enabled, don't call the function, just simulate a conflict. 139 | rocksdb::Status s = MONGO_FAIL_POINT(RocksPrepareConflictForReads) 140 | ? rocksdb::PrepareConflict() 141 | : ROCKS_READ_CHECK(f()); 142 | 143 | if (!IsPrepareConflict(s)) return s; 144 | 145 | CurOp::get(opCtx)->debug().additiveMetrics.incrementPrepareReadConflicts(1); 146 | rocksPrepareConflictLog(attempts); 147 | 148 | // Wait on the session cache to signal that a unit of work has been committed or 149 | // aborted. 150 | recoveryUnit->getDurabilityManager()->waitUntilPreparedUnitOfWorkCommitsOrAborts( 151 | opCtx, lastCount); 152 | } 153 | } 154 | } // namespace mongo 155 | -------------------------------------------------------------------------------- /src/rocks_index.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | 29 | #include "mongo/db/storage/sorted_data_interface.h" 30 | 31 | #include 32 | #include 33 | 34 | #include 35 | 36 | #include "mongo/bson/ordering.h" 37 | #include "mongo/db/index/index_descriptor.h" 38 | #include "mongo/db/storage/key_string.h" 39 | 40 | #pragma once 41 | 42 | namespace rocksdb { 43 | class DB; 44 | class ColumnFamilyHandle; 45 | } 46 | 47 | namespace mongo { 48 | 49 | class RocksRecoveryUnit; 50 | 51 | class RocksIndexBase : public SortedDataInterface { 52 | RocksIndexBase(const RocksIndexBase&) = delete; 53 | RocksIndexBase& operator=(const RocksIndexBase&) = delete; 54 | 55 | public: 56 | RocksIndexBase(rocksdb::DB* db, rocksdb::ColumnFamilyHandle* cf, std::string prefix, 57 | std::string ident, Ordering order, const BSONObj& config); 58 | 59 | virtual SortedDataBuilderInterface* getBulkBuilder(OperationContext* opCtx, 60 | bool dupsAllowed) = 0; 61 | 62 | virtual void fullValidate(OperationContext* opCtx, long long* numKeysOut, 63 | ValidateResults* fullResults) const; 64 | 65 | virtual bool appendCustomStats(OperationContext* /* opCtx */, BSONObjBuilder* /* output */, 66 | double /* scale */) const { 67 | // nothing to say here, really 68 | return false; 69 | } 70 | 71 | virtual bool isEmpty(OperationContext* opCtx); 72 | 73 | virtual long long getSpaceUsedBytes(OperationContext* opCtx) const; 74 | 75 | virtual Status initAsEmpty(OperationContext* opCtx); 76 | 77 | static void generateConfig(BSONObjBuilder* configBuilder, int formatVersion, 78 | IndexDescriptor::IndexVersion descVersion); 79 | 80 | protected: 81 | static std::string _makePrefixedKey(const std::string& prefix, const KeyString& encodedKey); 82 | 83 | rocksdb::DB* _db; // not owned 84 | 85 | rocksdb::ColumnFamilyHandle* _cf; // not owned 86 | 87 | // Each key in the index is prefixed with _prefix 88 | std::string _prefix; 89 | std::string _ident; 90 | 91 | // very approximate index storage size 92 | std::atomic _indexStorageSize; 93 | 94 | // used to construct RocksCursors 95 | const Ordering _order; 96 | KeyString::Version _keyStringVersion; 97 | 98 | class StandardBulkBuilder; 99 | class UniqueBulkBuilder; 100 | friend class UniqueBulkBuilder; 101 | }; 102 | 103 | class RocksUniqueIndex : public RocksIndexBase { 104 | public: 105 | RocksUniqueIndex(rocksdb::DB* db, rocksdb::ColumnFamilyHandle* cf, std::string prefix, 106 | std::string ident, Ordering order, const BSONObj& config, 107 | std::string collectionNamespace, std::string indexName, 108 | const BSONObj& keyPattern, bool partial = false, bool isIdIdx = false); 109 | 110 | virtual StatusWith insert(OperationContext* opCtx, 111 | const BSONObj& key, const RecordId& loc, 112 | bool dupsAllowed); 113 | virtual void unindex(OperationContext* opCtx, const BSONObj& key, const RecordId& loc, 114 | bool dupsAllowed); 115 | virtual std::unique_ptr newCursor(OperationContext* opCtx, 116 | bool forward) const; 117 | 118 | virtual Status dupKeyCheck(OperationContext* opCtx, const BSONObj& key); 119 | 120 | virtual SortedDataBuilderInterface* getBulkBuilder(OperationContext* opCtx, 121 | bool dupsAllowed) override; 122 | 123 | private: 124 | StatusWith _insertTimestampSafe(OperationContext* opCtx, 125 | const BSONObj& key, 126 | const RecordId& loc, 127 | bool dupsAllowed); 128 | 129 | StatusWith _insertTimestampUnsafe(OperationContext* opCtx, 130 | const BSONObj& key, 131 | const RecordId& loc, 132 | bool dupsAllowed); 133 | 134 | void _unindexTimestampUnsafe(OperationContext* opCtx, const BSONObj& key, 135 | const RecordId& loc, bool dupsAllowed); 136 | 137 | void _unindexTimestampSafe(OperationContext* opCtx, const BSONObj& key, const RecordId& loc, 138 | bool dupsAllowed); 139 | 140 | bool _keyExistsTimestampSafe(OperationContext* opCtx, const KeyString& prefixedKey); 141 | 142 | std::string _collectionNamespace; 143 | std::string _indexName; 144 | const BSONObj _keyPattern; 145 | const bool _partial; 146 | const bool _isIdIndex; 147 | }; 148 | 149 | class RocksStandardIndex : public RocksIndexBase { 150 | public: 151 | RocksStandardIndex(rocksdb::DB* db, rocksdb::ColumnFamilyHandle* cf, std::string prefix, 152 | std::string ident, Ordering order, const BSONObj& config); 153 | 154 | virtual StatusWith insert(OperationContext* opCtx, 155 | const BSONObj& key, const RecordId& loc, 156 | bool dupsAllowed); 157 | virtual void unindex(OperationContext* opCtx, const BSONObj& key, const RecordId& loc, 158 | bool dupsAllowed); 159 | virtual std::unique_ptr newCursor(OperationContext* opCtx, 160 | bool forward) const; 161 | virtual Status dupKeyCheck(OperationContext* opCtx, const BSONObj& key) { 162 | // dupKeyCheck shouldn't be called for non-unique indexes 163 | invariant(false); 164 | return Status::OK(); 165 | } 166 | 167 | virtual SortedDataBuilderInterface* getBulkBuilder(OperationContext* opCtx, 168 | bool dupsAllowed) override; 169 | 170 | void enableSingleDelete() { useSingleDelete = true; } 171 | 172 | private: 173 | bool useSingleDelete; 174 | }; 175 | 176 | } // namespace mongo 177 | -------------------------------------------------------------------------------- /src/rocks_init.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #include "mongo/platform/basic.h" 31 | 32 | #include "mongo/base/init.h" 33 | #include "mongo/db/service_context.h" 34 | #include "mongo/db/storage/storage_engine_impl.h" 35 | #include "mongo/db/storage/storage_engine_init.h" 36 | #include "mongo/db/storage/storage_engine_metadata.h" 37 | #include "mongo/db/storage/storage_options.h" 38 | #include "mongo/util/str.h" 39 | 40 | #include "mongo/db/modules/rocks/src/rocks_parameters_gen.h" 41 | #include "rocks_engine.h" 42 | #include "rocks_server_status.h" 43 | 44 | #if __has_feature(address_sanitizer) 45 | #include 46 | #endif 47 | 48 | namespace mongo { 49 | const std::string kRocksDBEngineName = "rocksdb"; 50 | 51 | namespace { 52 | 53 | class RocksFactory : public StorageEngine::Factory { 54 | public: 55 | virtual ~RocksFactory() {} 56 | virtual StorageEngine* create(const StorageGlobalParams& params, 57 | const StorageEngineLockFile* lockFile) const { 58 | StorageEngineOptions options; 59 | options.directoryPerDB = params.directoryperdb; 60 | options.forRepair = params.repair; 61 | // Mongo keeps some files in params.dbpath. To avoid collision, put out files under 62 | // db/ directory 63 | if (formatVersion == -1) { 64 | // it's a new database, set it to the newest rocksdb version kRocksFormatVersion 65 | formatVersion = kRocksFormatVersion; 66 | } 67 | auto engine = new RocksEngine(params.dbpath + "/db", params.dur, formatVersion, 68 | params.readOnly); 69 | // Intentionally leaked. 70 | auto leaked __attribute__((unused)) = new RocksServerStatusSection(engine); 71 | auto leaked2 __attribute__((unused)) = new RocksRateLimiterServerParameter( 72 | "rocksdbRateLimiter", ServerParameterType::kRuntimeOnly); 73 | auto leaked3 __attribute__((unused)) = new RocksBackupServerParameter( 74 | "rocksdbBackup", ServerParameterType::kRuntimeOnly); 75 | auto leaked4 __attribute__((unused)) = new RocksCompactServerParameter( 76 | "rocksdbCompact", ServerParameterType::kRuntimeOnly); 77 | auto leaked5 __attribute__((unused)) = new RocksCacheSizeParameter( 78 | "rocksdbRuntimeConfigCacheSizeGB", ServerParameterType::kRuntimeOnly); 79 | auto leaked6 __attribute__((unused)) = 80 | new RocksOptionsParameter("rocksdbOptions", ServerParameterType::kRuntimeOnly); 81 | auto leaked7 __attribute__((unused)) = new RocksdbMaxConflictCheckSizeParameter( 82 | "rocksdbRuntimeConfigMaxWriteMBPerSec", ServerParameterType::kRuntimeOnly); 83 | leaked2->_data = engine; 84 | leaked3->_data = engine; 85 | leaked4->_data = engine; 86 | leaked5->_data = engine; 87 | leaked6->_data = engine; 88 | leaked7->_data = engine; 89 | 90 | return new StorageEngineImpl(engine, options); 91 | } 92 | 93 | virtual StringData getCanonicalName() const { return kRocksDBEngineName; } 94 | 95 | virtual Status validateMetadata(const StorageEngineMetadata& metadata, 96 | const StorageGlobalParams& params) const { 97 | const BSONObj& options = metadata.getStorageEngineOptions(); 98 | BSONElement element = options.getField(kRocksFormatVersionString); 99 | if (element.eoo() || !element.isNumber()) { 100 | return Status(ErrorCodes::UnsupportedFormat, 101 | "Storage engine metadata format not recognized. If you created " 102 | "this database with older version of mongo, please reload the " 103 | "database using mongodump and mongorestore"); 104 | } 105 | if (element.numberInt() < kMinSupportedRocksFormatVersion) { 106 | // database is older than what we can understand 107 | return Status( 108 | ErrorCodes::UnsupportedFormat, 109 | str::stream() 110 | << "Database was created with old format version " 111 | << element.numberInt() 112 | << " and this version only supports format versions from " 113 | << kMinSupportedRocksFormatVersion << " to " << kRocksFormatVersion 114 | << ". Please reload the database using mongodump and mongorestore"); 115 | } else if (element.numberInt() > kRocksFormatVersion) { 116 | // database is newer than what we can understand 117 | return Status( 118 | ErrorCodes::UnsupportedFormat, 119 | str::stream() 120 | << "Database was created with newer format version " 121 | << element.numberInt() 122 | << " and this version only supports format versions from " 123 | << kMinSupportedRocksFormatVersion << " to " << kRocksFormatVersion 124 | << ". Please reload the database using mongodump and mongorestore"); 125 | } 126 | formatVersion = element.numberInt(); 127 | return Status::OK(); 128 | } 129 | 130 | virtual BSONObj createMetadataOptions(const StorageGlobalParams& params) const { 131 | BSONObjBuilder builder; 132 | builder.append(kRocksFormatVersionString, kRocksFormatVersion); 133 | return builder.obj(); 134 | } 135 | 136 | bool supportsReadOnly() const final { return false; } 137 | 138 | private: 139 | // Current disk format. We bump this number when we change the disk format. MongoDB will 140 | // fail to start if the versions don't match. In that case a user needs to run mongodump 141 | // and mongorestore. 142 | // * Version 0 was the format with many column families -- one column family for each 143 | // collection and index 144 | // * Version 1 keeps all collections and indexes in a single column family 145 | // * Version 2 reserves two prefixes for oplog. one prefix keeps the oplog 146 | // documents and another only keeps keys. That way, we can cleanup the oplog without 147 | // reading full documents 148 | // * Version 3 (current) understands the Decimal128 index format. It also understands 149 | // the version 2, so it's backwards compatible, but not forward compatible 150 | const int kRocksFormatVersion = 3; 151 | const int kMinSupportedRocksFormatVersion = 2; 152 | const std::string kRocksFormatVersionString = "rocksFormatVersion"; 153 | int mutable formatVersion = -1; 154 | }; 155 | 156 | ServiceContext::ConstructorActionRegisterer registerRocks( 157 | "RocksEngineInit", [](ServiceContext* service) { 158 | registerStorageEngine(service, std::make_unique()); 159 | }); 160 | } // namespace 161 | } // namespace mongo 162 | -------------------------------------------------------------------------------- /src/rocks_global_options.idl: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2019-present MongoDB, Inc. 2 | # 3 | # This program is free software: you can redistribute it and/or modify 4 | # it under the terms of the Server Side Public License, version 1, 5 | # as published by MongoDB, Inc. 6 | # 7 | # This program is distributed in the hope that it will be useful, 8 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 9 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 10 | # Server Side Public License for more details. 11 | # 12 | # You should have received a copy of the Server Side Public License 13 | # along with this program. If not, see 14 | # . 15 | # 16 | # As a special exception, the copyright holders give permission to link the 17 | # code of portions of this program with the OpenSSL library under certain 18 | # conditions as described in each individual source file and distribute 19 | # linked combinations including the program with the OpenSSL library. You 20 | # must comply with the Server Side Public License in all respects for 21 | # all of the code used other than as permitted herein. If you modify file(s) 22 | # with this exception, you may extend this exception to your version of the 23 | # file(s), but you are not obligated to do so. If you do not wish to do so, 24 | # delete this exception statement from your version. If you delete this 25 | # exception statement from all source files in the program, then also delete 26 | # it in the license file. 27 | # 28 | 29 | global: 30 | cpp_namespace: "mongo" 31 | cpp_includes: 32 | - "mongo/db/modules/rocks/src/rocks_global_options.h" 33 | configs: 34 | section: 'RocksDB options' 35 | source: [ cli, ini, yaml ] 36 | 37 | configs: 38 | # Rocks storage engine options 39 | "storage.rocksdb.cacheSizeGB": 40 | description: >- 41 | maximum amount of memory to allocate for cache; 42 | Defaults to 3/10 of physical RAM 43 | arg_vartype: Int 44 | cpp_varname: 'rocksGlobalOptions.cacheSizeGB' 45 | short_name: rocksdbCacheSizeGB 46 | validator: 47 | gte: 0 48 | lte: 10000 49 | "storage.rocksdb.compression": 50 | description: >- 51 | block compression algorithm for collection data [none|snappy|zlib|lz4|lz4hc] 52 | arg_vartype: String 53 | cpp_varname: 'rocksGlobalOptions.compression' 54 | short_name: rocksdbCompression 55 | default: 'snappy' 56 | validator: 57 | callback: 'RocksGlobalOptions::validateRocksdbCompressor' 58 | "storage.rocksdb.maxWriteMBPerSec": 59 | description: >- 60 | Maximum speed that RocksDB will write to storage. Reducing this can 61 | help reduce read latency spikes during compactions. However, reducing this 62 | below a certain point might slow down writes. Defaults to 1GB/sec 63 | arg_vartype: Int 64 | cpp_varname: 'rocksGlobalOptions.maxWriteMBPerSec' 65 | short_name: rocksdbMaxWriteMBPerSec 66 | default: 1024 67 | validator: 68 | gte: 1 69 | lte: 1024 70 | "storage.rocksdb.configString": 71 | description: 'RocksDB storage engine custom' 72 | arg_vartype: String 73 | cpp_varname: 'rocksGlobalOptions.configString' 74 | short_name: rocksdbConfigString 75 | hidden: true 76 | "storage.rocksdb.crashSafeCounters": 77 | description: >- 78 | If true, numRecord and dataSize counter will be consistent 79 | even after power failure. If false, numRecord and dataSize 80 | might be a bit inconsistent after power failure, but 81 | should be correct under normal conditions. Setting this to 82 | true will make database inserts a bit slower 83 | arg_vartype: Bool 84 | cpp_varname: 'rocksGlobalOptions.crashSafeCounters' 85 | short_name: rocksdbCrashSafeCounters 86 | default: false 87 | hidden: true 88 | "storage.rocksdb.counters": 89 | description: 'This is still experimental. Use this only if you know what you are doing' 90 | arg_vartype: Bool 91 | cpp_varname: 'rocksGlobalOptions.counters' 92 | short_name: rocksdbCounters 93 | default: true 94 | "storage.rocksdb.singleDeleteIndex": 95 | description: 'This is still experimental. Use this only if you know what you are doing' 96 | arg_vartype: Bool 97 | cpp_varname: 'rocksGlobalOptions.singleDeleteIndex' 98 | short_name: rocksdbSingleDeleteIndex 99 | default: false 100 | "storage.rocksdb.logLevel": 101 | description: >- 102 | rocksdb log level [debug|info|warn|error] 103 | arg_vartype: String 104 | cpp_varname: 'rocksGlobalOptions.logLevel' 105 | short_name: rocksdbLogLevel 106 | default: 'info' 107 | validator: 108 | callback: 'RocksGlobalOptions::validateRocksdbLogLevel' 109 | "storage.rocksdb.maxConflictCheckSizeMB": 110 | description: 'This is still experimental. Use this only if you know what you are doing' 111 | arg_vartype: Int 112 | cpp_varname: 'rocksGlobalOptions.maxConflictCheckSizeMB' 113 | short_name: rocksdbMaxConflictCheckSizeMB 114 | default: 200 115 | validator: 116 | gte: 1 117 | lte: 10000 118 | "storage.rocksdb.maxBackgroundJobs": 119 | description: 'rocksdb engine max background jobs' 120 | arg_vartype: Int 121 | cpp_varname: 'rocksGlobalOptions.maxBackgroundJobs' 122 | short_name: rocksdbMaxBackgroundJobs 123 | default: 2 124 | "storage.rocksdb.maxTotalWalSize": 125 | description: 'rocksdb engine max total wal size' 126 | arg_vartype: Long 127 | cpp_varname: 'rocksGlobalOptions.maxTotalWalSize' 128 | short_name: rocksdbMaxTotalWalSize 129 | # 100 MB 130 | default: 104857600 131 | "storage.rocksdb.dbWriteBufferSize": 132 | description: 'rocksdb engine db write buffer size' 133 | arg_vartype: Long 134 | cpp_varname: 'rocksGlobalOptions.dbWriteBufferSize' 135 | short_name: rocksdbDbWriteBufferSize 136 | # 128 MB 137 | default: 134217728 138 | "storage.rocksdb.writeBufferSize": 139 | description: 'rocksdb engine write buffer size' 140 | arg_vartype: Long 141 | cpp_varname: 'rocksGlobalOptions.writeBufferSize' 142 | short_name: rocksdbWriteBufferSize 143 | # 16 MB 144 | default: 16777216 145 | "storage.rocksdb.delayedWriteRate": 146 | description: 'rocksdb engine delay write rate' 147 | arg_vartype: Long 148 | cpp_varname: 'rocksGlobalOptions.delayedWriteRate' 149 | short_name: rocksdbDelayedWriteRate 150 | # 512 MB 151 | default: 536870912 152 | "storage.rocksdb.numLevels": 153 | description: 'rocksdb engine num levels' 154 | arg_vartype: Int 155 | cpp_varname: 'rocksGlobalOptions.numLevels' 156 | short_name: rocksdbNumLevels 157 | default: 5 158 | "storage.rocksdb.maxWriteBufferNumber": 159 | description: 'rocksdb engine max write buffer number' 160 | arg_vartype: Int 161 | cpp_varname: 'rocksGlobalOptions.maxWriteBufferNumber' 162 | short_name: rocksdbMaxWriteBufferNumber 163 | default: 4 164 | "storage.rocksdb.level0FileNumCompactionTrigger": 165 | description: 'rocksdb engine level0 file num compaction trigger' 166 | arg_vartype: Int 167 | cpp_varname: 'rocksGlobalOptions.level0FileNumCompactionTrigger' 168 | short_name: rocksdbLevel0FileNumCompactionTrigger 169 | default: 4 170 | "storage.rocksdb.level0SlowdownWritesTrigger": 171 | description: 'rocksdb engine level0 stop writes trigger' 172 | arg_vartype: Int 173 | cpp_varname: 'rocksGlobalOptions.level0SlowdownWritesTrigger' 174 | short_name: rocksdbLevel0SlowdownWritesTrigger 175 | default: 128 176 | "storage.rocksdb.level0_stop_writes_trigger": 177 | description: 'rocksdb engine level0 stop writes trigger' 178 | arg_vartype: Int 179 | cpp_varname: 'rocksGlobalOptions.level0StopWritesTrigger' 180 | short_name: rocksdbLevel0StopWritesTrigger 181 | default: 512 182 | "storage.rocksdb.maxBytesForLevelBase": 183 | description: 'rocksdb engine max bytes for level base' 184 | arg_vartype: Long 185 | cpp_varname: 'rocksGlobalOptions.maxBytesForLevelBase' 186 | short_name: rocksdbMaxBytesForLevelBase 187 | # 512 MB 188 | default: 536870912 189 | "storage.rocksdb.softPendingCompactionMBLimit": 190 | description: 'rocksdb engine soft pending compaction MB limit' 191 | arg_vartype: Int 192 | cpp_varname: 'rocksGlobalOptions.softPendingCompactionMBLimit' 193 | short_name: rocksdbSoftPendingCompactionMBLimit 194 | # 300 GB 195 | default: 307200 196 | "storage.rocksdb.hardPendingCompactionMBLimit": 197 | description: 'rocksdb engine hard pending compaction MB limit' 198 | arg_vartype: Int 199 | cpp_varname: 'rocksGlobalOptions.hardPendingCompactionMBLimit' 200 | short_name: rocksdbHardPendingCompactionMBLimit 201 | # 500 GB 202 | default: 512000 203 | -------------------------------------------------------------------------------- /src/mongo_rate_limiter_checker.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kControl 31 | 32 | #include "mongo_rate_limiter_checker.h" 33 | 34 | #ifdef __linux__ 35 | #include "mongo/db/modules/rocks/src/rocks_parameters_gen.h" 36 | #include "rocks_util.h" 37 | #include "mongo/db/server_options.h" 38 | #include "mongo/db/service_context.h" 39 | #include "mongo/db/storage/storage_options.h" 40 | #include "mongo/util/background.h" 41 | #include "mongo/util/exit.h" 42 | #include "mongo/util/log.h" 43 | #include "mongo/util/time_support.h" 44 | #include "mongo/db/service_context.h" 45 | 46 | namespace mongo { 47 | 48 | /** 49 | * Thread for Mongo Rate Limiter Checker 50 | */ 51 | class MongoRateLimiterChecker : public BackgroundJob { 52 | public: 53 | std::string name() const { 54 | return "MongoRateLimiterChecker"; 55 | } 56 | 57 | Status init() { 58 | log() << "[Mongo Rate Limiter Checker]: inited: " 59 | << "disk is " << getMongoRateLimitParameter().getDisk() << "; " 60 | << "iops is " << getMongoRateLimitParameter().getIops() << "; " 61 | << "mbps is " << getMongoRateLimitParameter().getMbps() << "; "; 62 | if (getMongoRateLimitParameter().getDisk() == "") { 63 | return Status(ErrorCodes::InternalError, str::stream() << "disk is empty"); 64 | } 65 | if (_mongoRateLimiter.get() == nullptr) { 66 | _mongoRateLimiter = std::make_unique(rocksdb::NewGenericRateLimiter(kInitMongoRateLimitRequestTokens)); 67 | } 68 | 69 | // read /proc/diskstats data 70 | auto readStatus = readProcDiskStats(getMongoRateLimitParameter().getDisk()); 71 | if (!readStatus.isOK()) { 72 | return Status(ErrorCodes::InternalError, 73 | str::stream() << "failed to read /proc/diskstats: " 74 | << readStatus.getStatus().toString()); 75 | } 76 | 77 | DiskStats currStats(readStatus.getValue().getField(getMongoRateLimitParameter().getDisk()).Obj()); 78 | _prevStats = currStats; 79 | 80 | return Status::OK(); 81 | } 82 | 83 | void run() { 84 | while (!globalInShutdownDeprecated()) { 85 | invariant(_mongoRateLimiter.get() != nullptr); 86 | 87 | // Check whether tokens is exhuasted or not 88 | _mongoRateLimiter->resetRequestTokens(); 89 | int64_t requestTokensBefore = _mongoRateLimiter->getRequestTokens(); 90 | sleepsecs(1); 91 | int64_t requestTokensAfter = _mongoRateLimiter->getRequestTokens(); 92 | int64_t requestTokensGap = requestTokensAfter - requestTokensBefore; 93 | bool exhausted = requestTokensGap >= _mongoRateLimiter->getTokensPerSecond(); 94 | 95 | // read /proc/diskstats data 96 | auto readStatus = readProcDiskStats(getMongoRateLimitParameter().getDisk()); 97 | if (!readStatus.isOK()) { 98 | log() << "[Mongo Rate Limiter Checker]: disk is " << getMongoRateLimitParameter().getDisk() << "; readStatus is " 99 | << readStatus.getStatus().toString() << ";"; 100 | break; 101 | } 102 | 103 | DiskStats currStats(readStatus.getValue().getField(getMongoRateLimitParameter().getDisk()).Obj()); 104 | auto ratioStatus = calculateResetRatio(exhausted, currStats); 105 | if (ratioStatus.isOK()) { 106 | int64_t newTokens = static_cast(_mongoRateLimiter->getTokensPerSecond() * 107 | ratioStatus.getValue()); 108 | _mongoRateLimiter->resetTokensPerSecond( 109 | std::max(newTokens, static_cast(kMinMongoRateLimitRequestTokens))); 110 | } 111 | 112 | _prevStats = currStats; 113 | } 114 | } 115 | 116 | StatusWith readProcDiskStats(const std::string& targetDisk) { 117 | if (targetDisk.empty()) { 118 | return Status(ErrorCodes::InternalError, str::stream() << "disk info is empty"); 119 | } 120 | 121 | std::vector disks{targetDisk}; 122 | BSONObjBuilder disksBuilder; 123 | auto status = parseProcDiskStatsFileByDevNo("/proc/diskstats", disks, &disksBuilder); 124 | if (!status.isOK()) { 125 | return Status(ErrorCodes::InternalError, 126 | str::stream() << "failed to read /proc/diskstats: " << status.toString()); 127 | } 128 | disksBuilder.doneFast(); 129 | 130 | auto wholeDiskInfo = disksBuilder.obj(); 131 | if (!wholeDiskInfo.hasField(targetDisk)) { 132 | return Status(ErrorCodes::InternalError, 133 | str::stream() << "no disk(" << targetDisk << ") info: " << wholeDiskInfo); 134 | } 135 | return wholeDiskInfo; 136 | } 137 | 138 | StatusWith calculateResetRatio(bool exhausted, DiskStats& currStats) { 139 | // NOTE: iops and kbps is the value for 1 second 140 | double ratio = 1000000 / static_cast(currStats.micros - _prevStats.micros); 141 | uint64_t iops = 0; 142 | uint64_t kbps = 0; 143 | if (currStats.reads >= _prevStats.reads && currStats.writes >= _prevStats.writes) { 144 | iops = static_cast( 145 | ((currStats.reads - _prevStats.reads) + (currStats.writes - _prevStats.writes)) * 146 | ratio); 147 | } 148 | if (currStats.read_sectors >= _prevStats.read_sectors && 149 | currStats.write_sectors >= _prevStats.write_sectors) { 150 | // NOTE: the factor (divided by 2) is for that size of sector is 512 Bytes 151 | kbps = static_cast(((currStats.read_sectors - _prevStats.read_sectors) + 152 | (currStats.write_sectors - _prevStats.write_sectors)) * 153 | ratio / 2); 154 | } 155 | if (iops == 0 || kbps == 0) { 156 | return Status(ErrorCodes::InternalError, 157 | str::stream() << "do not reset: iops is " << iops << "; kbps is " 158 | << kbps); 159 | } 160 | double resetRatio = std::min(getMongoRateLimitParameter().getIops() / static_cast(iops), 161 | (getMongoRateLimitParameter().getMbps() * 1024) / static_cast(kbps)); 162 | double actualResetRatio = std::max(0.9, std::min(resetRatio, 1.1)); 163 | bool willReset = 164 | exhausted || iops >= getMongoRateLimitParameter().getIops() || kbps >= getMongoRateLimitParameter().getMbps() * 1024; 165 | LOG(1) << "[Mongo Rate Limiter Checker]: exhausted: " << exhausted 166 | << "; tokens: " << _mongoRateLimiter->getTokensPerSecond() << "; iops: " << iops 167 | << "/" << getMongoRateLimitParameter().getIops() << "; kbps: " << kbps << "/" 168 | << (getMongoRateLimitParameter().getMbps() * 1024) << "; resetRatio: " << resetRatio << "=>" 169 | << actualResetRatio << "; (" << (willReset ? "true" : "false") << ");"; 170 | if (willReset) { 171 | return actualResetRatio; 172 | } 173 | return Status(ErrorCodes::InternalError, 174 | str::stream() << "do not reset: not meet condition"); 175 | } 176 | 177 | MongoRateLimiter* getMongoRateLimiter() { 178 | return _mongoRateLimiter.get(); 179 | } 180 | private: 181 | std::unique_ptr _mongoRateLimiter; 182 | DiskStats _prevStats; 183 | }; 184 | 185 | namespace { 186 | // Only one instance of the MongoRateLimiterChecker exists 187 | MongoRateLimiterChecker mongoRateLimiterChecker; 188 | } 189 | 190 | MongoRateLimiter* getMongoRateLimiter() { 191 | return mongoRateLimiterChecker.getMongoRateLimiter(); 192 | } 193 | 194 | void startMongoRateLimiterChecker() { 195 | auto status = mongoRateLimiterChecker.init(); 196 | if (!status.isOK()) { 197 | log() << "[Mongo Rate Limiter Checker]: init failed; status is " << status.toString() << ";"; 198 | return; 199 | } 200 | mongoRateLimiterChecker.go(); 201 | } 202 | 203 | } 204 | #endif 205 | -------------------------------------------------------------------------------- /src/rocks_record_store_mongod.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 31 | 32 | #include "mongo/platform/basic.h" 33 | 34 | #include 35 | #include 36 | 37 | #include "mongo/base/checked_cast.h" 38 | #include "mongo/db/catalog/collection.h" 39 | #include "mongo/db/catalog/database.h" 40 | #include "mongo/db/catalog/database_holder.h" 41 | #include "mongo/db/client.h" 42 | #include "mongo/db/concurrency/d_concurrency.h" 43 | #include "mongo/db/db_raii.h" 44 | #include "mongo/db/dbdirectclient.h" 45 | #include "mongo/db/namespace_string.h" 46 | #include "mongo/db/operation_context.h" 47 | #include "mongo/db/service_context.h" 48 | #include "mongo/db/session_txn_record_gen.h" 49 | #include "mongo/util/background.h" 50 | #include "mongo/util/exit.h" 51 | #include "mongo/util/log.h" 52 | 53 | #include "rocks_engine.h" 54 | #include "rocks_record_store.h" 55 | #include "rocks_recovery_unit.h" 56 | 57 | namespace mongo { 58 | 59 | namespace { 60 | Timestamp getoldestPrepareTs(OperationContext* opCtx) { 61 | auto alterClient = opCtx->getServiceContext()->makeClient("get-oldest-prepared-txn"); 62 | AlternativeClientRegion acr(alterClient); 63 | const auto tmpOpCtx = cc().makeOperationContext(); 64 | tmpOpCtx->recoveryUnit()->setTimestampReadSource( 65 | RecoveryUnit::ReadSource::kNoTimestamp); 66 | DBDirectClient client(tmpOpCtx.get()); 67 | Query query = QUERY("txnState" 68 | << "kPrepared") 69 | .sort("lastWriteOpTime", 1); 70 | auto c = client.query(NamespaceString::kSessionTransactionsTableNamespace, query, 1); 71 | if (c->more()) { 72 | auto raw = c->next(); 73 | SessionTxnRecord record = 74 | SessionTxnRecord::parse(IDLParserErrorContext("init prepared txns"), raw); 75 | return record.getLastWriteOpTime().getTimestamp(); 76 | } 77 | return Timestamp::max(); 78 | } 79 | 80 | std::set _backgroundThreadNamespaces; 81 | Mutex _backgroundThreadMutex; 82 | 83 | class RocksRecordStoreThread : public BackgroundJob { 84 | public: 85 | RocksRecordStoreThread(const NamespaceString& ns) 86 | : BackgroundJob(true /* deleteSelf */), _ns(ns) { 87 | _name = std::string("RocksRecordStoreThread-for-") + _ns.toString(); 88 | } 89 | 90 | virtual std::string name() const { return _name; } 91 | 92 | /** 93 | * @return if any oplog records are deleted. 94 | */ 95 | bool _deleteExcessDocuments() { 96 | if (!getGlobalServiceContext()->getStorageEngine()) { 97 | LOG(1) << "no global storage engine yet"; 98 | return false; 99 | } 100 | auto engine = getGlobalServiceContext()->getStorageEngine(); 101 | const auto opCtx = cc().makeOperationContext(); 102 | 103 | try { 104 | const Timestamp oldestPreparedTxnTs = getoldestPrepareTs(opCtx.get()); 105 | // A Global IX lock should be good enough to protect the oplog truncation from 106 | // interruptions such as restartCatalog. PBWM, database lock or collection lock is not 107 | // needed. This improves concurrency if oplog truncation takes long time. 108 | ShouldNotConflictWithSecondaryBatchApplicationBlock shouldNotConflictBlock( 109 | opCtx.get()->lockState()); 110 | Lock::GlobalLock lk(opCtx.get(), MODE_IX); 111 | 112 | RocksRecordStore* rs = nullptr; 113 | { 114 | // Release the database lock right away because we don't want to 115 | // block other operations on the local database and given the 116 | // fact that oplog collection is so special, Global IX lock can 117 | // make sure the collection exists. 118 | Lock::DBLock dbLock(opCtx.get(), _ns.db(), MODE_IX); 119 | auto databaseHolder = DatabaseHolder::get(opCtx.get()); 120 | auto db = databaseHolder->getDb(opCtx.get(), _ns.db()); 121 | if (!db) { 122 | LOG(2) << "no local database yet"; 123 | return false; 124 | } 125 | // We need to hold the database lock while getting the collection. Otherwise a 126 | // concurrent collection creation would write to the map in the Database object 127 | // while we concurrently read the map. 128 | Collection* collection = db->getCollection(opCtx.get(), _ns); 129 | if (!collection) { 130 | LOG(2) << "no collection " << _ns; 131 | return false; 132 | } 133 | rs = checked_cast(collection->getRecordStore()); 134 | } 135 | if (!engine->supportsRecoverToStableTimestamp()) { 136 | // For non-RTT storage engines, the oplog can always be truncated. 137 | return rs->reclaimOplog(opCtx.get(), oldestPreparedTxnTs); 138 | } 139 | const auto lastStableCheckpointTsPtr = engine->getLastStableRecoveryTimestamp(); 140 | Timestamp lastStableCheckpointTimestamp = 141 | lastStableCheckpointTsPtr ? *lastStableCheckpointTsPtr : Timestamp::min(); 142 | Timestamp persistedTimestamp = 143 | std::min(oldestPreparedTxnTs, lastStableCheckpointTimestamp); 144 | return rs->reclaimOplog(opCtx.get(), persistedTimestamp); 145 | } catch (const ExceptionForCat&) { 146 | return false; 147 | } catch (const std::exception& e) { 148 | severe() << "error in RocksRecordStoreThread: " << redact(e.what()); 149 | fassertFailedNoTrace(!"error in RocksRecordStoreThread"); 150 | } catch (...) { 151 | fassertFailedNoTrace(!"unknown error in RocksRecordStoreThread"); 152 | } 153 | MONGO_UNREACHABLE 154 | } 155 | 156 | virtual void run() { 157 | ThreadClient tc(_name, getGlobalServiceContext()); 158 | 159 | while (!globalInShutdownDeprecated()) { 160 | bool removed = _deleteExcessDocuments(); 161 | LOG(2) << "RocksRecordStoreThread deleted " << removed; 162 | if (!removed) { 163 | // If we removed 0 documents, sleep a bit in case we're on a laptop 164 | // or something to be nice. 165 | sleepmillis(1000); 166 | } else { 167 | // wake up every 100ms 168 | sleepmillis(100); 169 | } 170 | } 171 | 172 | log() << "shutting down"; 173 | } 174 | 175 | private: 176 | NamespaceString _ns; 177 | std::string _name; 178 | }; 179 | 180 | } // namespace 181 | 182 | // static 183 | bool RocksEngine::initRsOplogBackgroundThread(StringData ns) { 184 | if (!NamespaceString::oplog(ns)) { 185 | return false; 186 | } 187 | 188 | if (storageGlobalParams.repair || storageGlobalParams.readOnly) { 189 | LOG(1) << "not starting RocksRecordStoreThread for " << ns 190 | << " because we are either in repair or read-only mode"; 191 | return false; 192 | } 193 | 194 | stdx::lock_guard lock(_backgroundThreadMutex); 195 | NamespaceString nss(ns); 196 | if (_backgroundThreadNamespaces.count(nss)) { 197 | log() << "RocksRecordStoreThread " << ns << " already started"; 198 | } else { 199 | log() << "Starting RocksRecordStoreThread " << ns; 200 | BackgroundJob* backgroundThread = new RocksRecordStoreThread(nss); 201 | backgroundThread->go(); 202 | _backgroundThreadNamespaces.insert(nss); 203 | } 204 | return true; 205 | } 206 | 207 | } // namespace mongo 208 | -------------------------------------------------------------------------------- /src/rocks_parameters.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2014 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * This program is distributed in the hope that it will be useful, 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 | * GNU Affero General Public License for more details. 12 | * 13 | * You should have received a copy of the GNU Affero General Public License 14 | * along with this program. If not, see . 15 | * 16 | * As a special exception, the copyright holders give permission to link the 17 | * code of portions of this program with the OpenSSL library under certain 18 | * conditions as described in each individual source file and distribute 19 | * linked combinations including the program with the OpenSSL library. You 20 | * must comply with the GNU Affero General Public License in all respects for 21 | * all of the code used other than as permitted herein. If you modify file(s) 22 | * with this exception, you may extend this exception to your version of the 23 | * file(s), but you are not obligated to do so. If you do not wish to do so, 24 | * delete this exception statement from your version. If you delete this 25 | * exception statement from all source files in the program, then also delete 26 | * it in the license file. 27 | */ 28 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 29 | 30 | #include "mongo/platform/basic.h" 31 | 32 | #include "mongo/db/modules/rocks/src/rocks_parameters_gen.h" 33 | #include "rocks_util.h" 34 | 35 | #include "mongo/db/json.h" 36 | #include "mongo/logger/parse_log_component_settings.h" 37 | #include "mongo/util/log.h" 38 | #include "mongo/util/str.h" 39 | #include "rocks_global_options.h" 40 | 41 | #include 42 | #include 43 | #include 44 | #include 45 | #include 46 | #include 47 | #include 48 | 49 | namespace mongo { 50 | 51 | namespace { 52 | Status RocksRateLimiterServerParameterSet(int newNum, const std::string& name, 53 | RocksEngine* engine) { 54 | if (newNum <= 0) { 55 | return Status(ErrorCodes::BadValue, str::stream() << name << " has to be > 0"); 56 | } 57 | log() << "RocksDB: changing rate limiter to " << newNum << "MB/s"; 58 | engine->setMaxWriteMBPerSec(newNum); 59 | 60 | return Status::OK(); 61 | } 62 | } // namespace 63 | 64 | void RocksRateLimiterServerParameter::append(OperationContext* opCtx, BSONObjBuilder& b, 65 | const std::string& name) { 66 | b.append(name, _data->getMaxWriteMBPerSec()); 67 | } 68 | 69 | Status RocksRateLimiterServerParameter::set(const BSONElement& newValueElement) { 70 | if (!newValueElement.isNumber()) { 71 | return Status(ErrorCodes::BadValue, str::stream() << name() << " has to be a number"); 72 | } 73 | return RocksRateLimiterServerParameterSet(newValueElement.numberInt(), name(), _data); 74 | } 75 | 76 | Status RocksRateLimiterServerParameter::setFromString(const std::string& str) { 77 | int num = 0; 78 | Status status = parseNumberFromString(str, &num); 79 | if (!status.isOK()) return status; 80 | return RocksRateLimiterServerParameterSet(num, name(), _data); 81 | } 82 | 83 | void RocksBackupServerParameter::append(OperationContext* opCtx, BSONObjBuilder& b, 84 | const std::string& name) { 85 | b.append(name, ""); 86 | } 87 | 88 | Status RocksBackupServerParameter::set(const BSONElement& newValueElement) { 89 | auto str = newValueElement.str(); 90 | if (str.size() == 0) { 91 | return Status(ErrorCodes::BadValue, str::stream() << name() << " has to be a string"); 92 | } 93 | return setFromString(str); 94 | } 95 | 96 | Status RocksBackupServerParameter::setFromString(const std::string& str) { 97 | return _data->backup(str); 98 | } 99 | 100 | void RocksCompactServerParameter::append(OperationContext* opCtx, BSONObjBuilder& b, 101 | const std::string& name) { 102 | b.append(name, ""); 103 | } 104 | 105 | Status RocksCompactServerParameter::set(const BSONElement& newValueElement) { 106 | return setFromString(""); 107 | } 108 | 109 | Status RocksCompactServerParameter::setFromString(const std::string& str) { 110 | _data->getCompactionScheduler()->compactAll(); 111 | return Status::OK(); 112 | } 113 | 114 | namespace { 115 | Status RocksCacheSizeParameterSet(int newNum, const std::string& name, 116 | RocksEngine* engine) { 117 | if (newNum <= 0) { 118 | return Status(ErrorCodes::BadValue, str::stream() << name << " has to be > 0"); 119 | } 120 | log() << "RocksDB: changing block cache size to " << newNum << "GB"; 121 | const long long bytesInGB = 1024 * 1024 * 1024LL; 122 | size_t newSizeInBytes = static_cast(newNum * bytesInGB); 123 | engine->getBlockCache()->SetCapacity(newSizeInBytes); 124 | 125 | return Status::OK(); 126 | } 127 | } // namespace 128 | 129 | void RocksCacheSizeParameter::append(OperationContext* opCtx, BSONObjBuilder& b, 130 | const std::string& name) { 131 | const long long bytesInGB = 1024 * 1024 * 1024LL; 132 | long long cacheSizeInGB = _data->getBlockCache()->GetCapacity() / bytesInGB; 133 | b.append(name, cacheSizeInGB); 134 | } 135 | 136 | Status RocksCacheSizeParameter::set(const BSONElement& newValueElement) { 137 | if (!newValueElement.isNumber()) { 138 | return Status(ErrorCodes::BadValue, str::stream() << name() << " has to be a number"); 139 | } 140 | return RocksCacheSizeParameterSet(newValueElement.numberInt(), name(), _data); 141 | } 142 | 143 | Status RocksCacheSizeParameter::setFromString(const std::string& str) { 144 | int num = 0; 145 | Status status = parseNumberFromString(str, &num); 146 | if (!status.isOK()) return status; 147 | return RocksCacheSizeParameterSet(num, name(), _data); 148 | } 149 | 150 | void RocksOptionsParameter::append(OperationContext* opCtx, BSONObjBuilder& b, 151 | const std::string& name) { 152 | std::string columnOptions; 153 | std::string dbOptions; 154 | std::string fullOptionsStr; 155 | rocksdb::Options fullOptions = _data->getDB()->GetOptions(); 156 | rocksdb::Status s = GetStringFromColumnFamilyOptions(&columnOptions, fullOptions); 157 | if (!s.ok()) { // If we failed, append the error for the user to see. 158 | b.append(name, s.ToString()); 159 | return; 160 | } 161 | 162 | fullOptionsStr.append(columnOptions); 163 | 164 | s = GetStringFromDBOptions(&dbOptions, fullOptions); 165 | if (!s.ok()) { // If we failed, append the error for the user to see. 166 | b.append(name, s.ToString()); 167 | return; 168 | } 169 | 170 | fullOptionsStr.append(dbOptions); 171 | 172 | b.append(name, fullOptionsStr); 173 | } 174 | 175 | Status RocksOptionsParameter::set(const BSONElement& newValueElement) { 176 | // In case the BSON element is not a string, the conversion will fail, 177 | // raising an exception catched by the outer layer. 178 | // Which will generate an error message that looks like this: 179 | // wrong type for field (rocksdbOptions) 3 != 2 180 | return setFromString(newValueElement.String()); 181 | } 182 | 183 | Status RocksOptionsParameter::setFromString(const std::string& str) { 184 | log() << "RocksDB: Attempting to apply settings: " << str; 185 | std::set supported_db_options = {"db_write_buffer_size", "delayed_write_rate", 186 | "max_background_jobs", "max_total_wal_size"}; 187 | 188 | std::set supported_cf_options = {"max_write_buffer_number", 189 | "disable_auto_compactions", 190 | "level0_slowdown_writes_trigger", 191 | "level0_stop_writes_trigger", 192 | "soft_pending_compaction_bytes_limit", 193 | "hard_pending_compaction_bytes_limit"}; 194 | std::unordered_map optionsMap; 195 | rocksdb::Status s = rocksdb::StringToMap(str, &optionsMap); 196 | if (!s.ok()) { 197 | return Status(ErrorCodes::BadValue, s.ToString()); 198 | } 199 | for (const auto& v : optionsMap) { 200 | if (supported_db_options.find(v.first) != supported_db_options.end()) { 201 | s = _data->getDB()->SetDBOptions({v}); 202 | } else if (supported_cf_options.find(v.first) != supported_cf_options.end()) { 203 | s = _data->getDB()->SetOptions({v}); 204 | } else { 205 | return Status(ErrorCodes::BadValue, str::stream() << "unknown param: " << v.first); 206 | } 207 | } 208 | if (!s.ok()) { 209 | return Status(ErrorCodes::BadValue, s.ToString()); 210 | } 211 | 212 | return Status::OK(); 213 | } 214 | 215 | void RocksdbMaxConflictCheckSizeParameter::append(OperationContext* opCtx, BSONObjBuilder& b, 216 | const std::string& name) { 217 | b << name << rocksGlobalOptions.maxConflictCheckSizeMB; 218 | } 219 | 220 | Status RocksdbMaxConflictCheckSizeParameter::set(const BSONElement& newValueElement) { 221 | return setFromString(newValueElement.toString(false)); 222 | } 223 | 224 | Status RocksdbMaxConflictCheckSizeParameter::setFromString(const std::string& str) { 225 | std::string trimStr; 226 | size_t pos = str.find('.'); 227 | if (pos != std::string::npos) { 228 | trimStr = str.substr(0, pos); 229 | } 230 | int newValue; 231 | Status status = parseNumberFromString(trimStr, &newValue); 232 | if (!status.isOK()) { 233 | return status; 234 | } 235 | rocksGlobalOptions.maxConflictCheckSizeMB = newValue; 236 | _data->getDB()->SetMaxConflictBytes(newValue * 1024 * 1024); 237 | return Status::OK(); 238 | } 239 | } // namespace mongo 240 | -------------------------------------------------------------------------------- /src/totdb/totransaction_db_impl.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2011-present, Facebook, Inc. All rights reserved. 2 | // This source code is licensed under both the GPLv2 (found in the 3 | // COPYING file in the root directory) and Apache 2.0 License 4 | // (found in the LICENSE.Apache file in the root directory). 5 | 6 | #pragma once 7 | #ifndef ROCKSDB_LITE 8 | 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | 19 | #include "rocksdb/db.h" 20 | #include "rocksdb/options.h" 21 | #include "mongo/db/modules/rocks/src/totdb/totransaction_db.h" 22 | #include "mongo/db/modules/rocks/src/totdb/totransaction_impl.h" 23 | 24 | namespace rocksdb { 25 | 26 | class PrepareHeap { 27 | public: 28 | PrepareHeap(); 29 | 30 | ~PrepareHeap() = default; 31 | 32 | std::shared_ptr Find( 33 | const TOTransactionImpl::ActiveTxnNode* core, const TxnKey& key, 34 | TOTransaction::TOTransactionState* state); 35 | 36 | void Insert(const std::shared_ptr& core); 37 | 38 | uint32_t Remove(TOTransactionImpl::ActiveTxnNode* core); 39 | 40 | void Purge(TransactionID oldest_txn_id, RocksTimeStamp oldest_ts); 41 | 42 | private: 43 | friend class PrepareMapIterator; 44 | 45 | std::shared_mutex mutex_; 46 | 47 | using PMAP = 48 | std::map>>; 50 | PMAP map_; 51 | 52 | static const TxnKey sentinal_; 53 | }; 54 | 55 | class TOTransactionDBImpl : public TOTransactionDB { 56 | public: 57 | TOTransactionDBImpl(DB* db, const TOTransactionDBOptions& txn_db_options, 58 | bool read_only, 59 | const std::string stable_ts_key) 60 | : TOTransactionDB(db), 61 | read_only_(read_only), 62 | txn_db_options_(txn_db_options), 63 | num_stripes_(DEFAULT_NUM_STRIPES), 64 | committed_max_txnid_(0), 65 | current_conflict_bytes_(0), 66 | max_conflict_bytes_(1.1 * txn_db_options.max_conflict_check_bytes_size), 67 | txn_commits_(0), 68 | txn_aborts_(0), 69 | committed_max_ts_(0), 70 | has_commit_ts_(false), 71 | update_max_commit_ts_times_(0), 72 | update_max_commit_ts_retries_(0), 73 | commit_without_ts_times_(0), 74 | read_without_ts_times_(0), 75 | read_with_ts_times_(0), 76 | read_q_walk_times_(0), 77 | read_q_walk_len_sum_(0), 78 | commit_q_walk_times_(0), 79 | commit_q_walk_len_sum_(0), 80 | oldest_ts_(nullptr), 81 | stable_ts_key_(stable_ts_key) { 82 | if (max_conflict_bytes_ == 0) { 83 | // we preserve at least 100MB for conflict check 84 | max_conflict_bytes_ = 100 * 1024 * 1024; 85 | } 86 | active_txns_.clear(); 87 | 88 | // Init default num_stripes 89 | num_stripes_ = (txn_db_options.num_stripes > 0) ? txn_db_options.num_stripes 90 | : DEFAULT_NUM_STRIPES; 91 | 92 | uncommitted_keys_.lock_map_stripes_.reserve(num_stripes_); 93 | for (size_t i = 0; i < num_stripes_; i++) { 94 | UnCommittedLockMapStripe* stripe = new UnCommittedLockMapStripe(); 95 | uncommitted_keys_.lock_map_stripes_.push_back(stripe); 96 | } 97 | 98 | committed_keys_.lock_map_stripes_.reserve(num_stripes_); 99 | for (size_t i = 0; i < num_stripes_; i++) { 100 | CommittedLockMapStripe* stripe = new CommittedLockMapStripe(); 101 | committed_keys_.lock_map_stripes_.push_back(stripe); 102 | } 103 | 104 | keys_mutex_.reserve(num_stripes_); 105 | for (size_t i = 0; i < num_stripes_; i++) { 106 | std::mutex* key_mutex = new std::mutex(); 107 | keys_mutex_.push_back(key_mutex); 108 | } 109 | } 110 | 111 | ~TOTransactionDBImpl() { 112 | // Clean resources 113 | clean_job_.StopThread(); 114 | clean_thread_.join(); 115 | 116 | { 117 | for (auto& it : uncommitted_keys_.lock_map_stripes_) { 118 | delete it; 119 | } 120 | uncommitted_keys_.lock_map_stripes_.clear(); 121 | 122 | for (auto& it : committed_keys_.lock_map_stripes_) { 123 | delete it; 124 | } 125 | committed_keys_.lock_map_stripes_.clear(); 126 | 127 | for (auto& it : keys_mutex_) { 128 | delete it; 129 | } 130 | keys_mutex_.clear(); 131 | } 132 | std::lock_guard lock(active_txns_mutex_); 133 | active_txns_.clear(); 134 | } 135 | 136 | void StartBackgroundCleanThread(); 137 | 138 | void SetMaxConflictBytes(uint64_t bytes) override { 139 | max_conflict_bytes_.store(bytes, std::memory_order_relaxed); 140 | } 141 | 142 | virtual TOTransaction* BeginTransaction(const WriteOptions& write_options, 143 | const TOTransactionOptions& txn_options) override; 144 | 145 | using ATN = TOTransactionImpl::ActiveTxnNode; 146 | Status CommitTransaction(const std::shared_ptr& core, 147 | const std::set& written_keys, 148 | const std::set& get_for_updates); 149 | 150 | Status RollbackTransaction(const std::shared_ptr& core, 151 | const std::set& written_keys, 152 | const std::set& get_for_updates); 153 | 154 | Status SetTimeStamp(const TimeStampType& ts_type, const RocksTimeStamp& ts, 155 | bool force) override; 156 | 157 | Status QueryTimeStamp(const TimeStampType& ts_type, RocksTimeStamp* timestamp) override; 158 | 159 | Status Stat(TOTransactionStat* stat) override; 160 | 161 | Status CheckWriteConflict(const TxnKey& key, const TransactionID& txn_id, 162 | const RocksTimeStamp& readts); 163 | 164 | Status PrepareTransaction(const std::shared_ptr& core); 165 | 166 | Status SetCommitTimeStamp(const std::shared_ptr& core, 167 | const RocksTimeStamp& timesamp); 168 | 169 | Status SetDurableTimeStamp(const std::shared_ptr& core, 170 | const RocksTimeStamp& timesamp); 171 | 172 | Status AddReadQueue(const std::shared_ptr& core, 173 | const RocksTimeStamp& ts); 174 | 175 | Status SetPrepareTimeStamp(const std::shared_ptr& core, 176 | const RocksTimeStamp& timestamp); 177 | 178 | void AdvanceTS(RocksTimeStamp* maxToCleanTs); 179 | 180 | void CleanCommittedKeys(); 181 | 182 | bool IsReadOnly() const { return read_only_; } 183 | 184 | Status GetConsiderPrepare(const std::shared_ptr& core, 185 | ReadOptions& options, 186 | ColumnFamilyHandle* column_family, const Slice& key, 187 | std::string* value); 188 | 189 | Iterator* NewIteratorConsiderPrepare(const std::shared_ptr& core, 190 | ColumnFamilyHandle* column_family, 191 | Iterator* db_iter); 192 | 193 | std::unique_ptr makeTxn() override; 194 | 195 | // Committed key, first commit txnid, second prepare ts, third commit ts 196 | // TODO: remove prepare ts from KeyModifyHistory 197 | using KeyModifyHistory = 198 | std::tuple; 199 | using TSTXN = std::pair; 200 | 201 | protected: 202 | bool read_only_; 203 | const TOTransactionDBOptions txn_db_options_; 204 | size_t num_stripes_; 205 | TransactionID committed_max_txnid_; 206 | std::atomic current_conflict_bytes_; 207 | std::atomic max_conflict_bytes_; 208 | std::atomic txn_commits_; 209 | std::atomic txn_aborts_; 210 | 211 | class BackgroundCleanJob { 212 | std::mutex thread_mutex_; 213 | TransactionID txnid_; 214 | RocksTimeStamp ts_; 215 | 216 | enum ThreadState { 217 | kRunning, 218 | kStopped 219 | }; 220 | 221 | ThreadState thread_state_; 222 | public: 223 | BackgroundCleanJob() 224 | :txnid_(0),ts_(0) { 225 | thread_state_ = kRunning; 226 | } 227 | 228 | ~BackgroundCleanJob() { 229 | } 230 | 231 | Status SetCleanInfo(const TransactionID& txn_id, 232 | const RocksTimeStamp& time_stamp); 233 | 234 | bool IsRunning(); 235 | 236 | bool NeedToClean(TransactionID* txn_id, 237 | RocksTimeStamp* time_stamp); 238 | 239 | void FinishClean(const TransactionID& txn_id, 240 | const RocksTimeStamp& time_stamp); 241 | 242 | void StopThread(); 243 | }; 244 | 245 | private: 246 | Status PublushTimeStamp(const std::shared_ptr& active_txn); 247 | 248 | // Add txn to active txns 249 | Status AddToActiveTxns(const std::shared_ptr& active_txn); 250 | 251 | void RemoveUncommittedKeysOnCleanup(const std::set& written_keys); 252 | 253 | Status TxnAssertAfterReads(const std::shared_ptr& core, const char* op, 254 | const RocksTimeStamp& timestamp); 255 | 256 | // Active txns 257 | std::mutex active_txns_mutex_; 258 | std::map> active_txns_; 259 | 260 | // txns sorted by {commit_ts, txnid} 261 | std::shared_mutex commit_ts_mutex_; 262 | std::map> commit_q_; 263 | 264 | // txns sorted by {read_ts, txnid} 265 | std::shared_mutex read_ts_mutex_; 266 | std::map> read_q_; 267 | 268 | PrepareHeap prepare_heap_; 269 | 270 | struct UnCommittedLockMapStripe { 271 | std::map uncommitted_keys_map_; 272 | }; 273 | 274 | size_t GetStripe(const TxnKey& key) const { 275 | invariant(num_stripes_ > 0); 276 | static std::hash hash; 277 | size_t stripe = hash(key.second) % num_stripes_; 278 | return stripe; 279 | } 280 | // Uncommitted keys 281 | struct UnCommittedKeys { 282 | std::vector lock_map_stripes_; 283 | public: 284 | // Remove key from uncommitted keys 285 | Status RemoveKeyInLock(const TxnKey& key, const size_t& stripe_num, 286 | std::atomic* mem_usage); 287 | // Check write conflict and add the key to uncommitted keys 288 | Status CheckKeyAndAddInLock(const TxnKey& key, const TransactionID& txn_id, 289 | const size_t& stripe_num, 290 | uint64_t max_mem_usage, 291 | std::atomic* mem_usage); 292 | 293 | size_t CountInLock() const; 294 | }; 295 | 296 | struct CommittedLockMapStripe { 297 | //std::mutex map_mutex_; 298 | std::map committed_keys_map_; 299 | }; 300 | 301 | struct CommittedKeys { 302 | std::vector lock_map_stripes_; 303 | public: 304 | // Add key to committed keys 305 | Status AddKeyInLock(const TxnKey& key, const TransactionID& commit_txn_id, 306 | const RocksTimeStamp& prepare_ts, 307 | const RocksTimeStamp& commit_ts, 308 | const size_t& stripe_num, 309 | std::atomic* mem_usage); 310 | 311 | // Check write conflict 312 | Status CheckKeyInLock(const TxnKey& key, const TransactionID& txn_id, 313 | const RocksTimeStamp& timestamp, 314 | const size_t& stripe_num); 315 | 316 | size_t CountInLock() const; 317 | }; 318 | 319 | std::vector keys_mutex_; 320 | 321 | UnCommittedKeys uncommitted_keys_; 322 | 323 | CommittedKeys committed_keys_; 324 | 325 | BackgroundCleanJob clean_job_; 326 | 327 | std::thread clean_thread_; 328 | 329 | // NOTE(xxxxxxxx): commit_ts_ is not protected by ts_meta_mutex_ 330 | // remember to publish commit_ts_ before has_commit_ts_ 331 | std::atomic committed_max_ts_; 332 | std::atomic has_commit_ts_; 333 | std::atomic update_max_commit_ts_times_; 334 | std::atomic update_max_commit_ts_retries_; 335 | std::atomic commit_without_ts_times_; 336 | std::atomic read_without_ts_times_; 337 | std::atomic read_with_ts_times_; 338 | std::atomic read_q_walk_times_; 339 | std::atomic read_q_walk_len_sum_; 340 | std::atomic commit_q_walk_times_; 341 | std::atomic commit_q_walk_len_sum_; 342 | 343 | // TODO(xxxxxxxx): use optional<> 344 | std::shared_mutex ts_meta_mutex_; 345 | // protected by ts_meta_mutex_ 346 | std::unique_ptr oldest_ts_; 347 | 348 | std::string stable_ts_key_; 349 | }; 350 | 351 | } // namespace rocksdb 352 | #endif // ROCKSDB_LITE 353 | -------------------------------------------------------------------------------- /src/rocks_oplog_manager.cpp: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2017 MongoDB Inc. 3 | * 4 | * This program is free software: you can redistribute it and/or modify 5 | * it under the terms of the GNU Affero General Public License, version 3, 6 | * as published by the Free Software Foundation. 7 | * 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU Affero General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU Affero General Public License 15 | * along with this program. If not, see . 16 | * 17 | * As a special exception, the copyright holders give permission to link the 18 | * code of portions of this program with the OpenSSL library under certain 19 | * conditions as described in each individual source file and distribute 20 | * linked combinations including the program with the OpenSSL library. You 21 | * must comply with the GNU Affero General Public License in all respects for 22 | * all of the code used other than as permitted herein. If you modify file(s) 23 | * with this exception, you may extend this exception to your version of the 24 | * file(s), but you are not obligated to do so. If you do not wish to do so, 25 | * delete this exception statement from your version. If you delete this 26 | * exception statement from all source files in the program, then also delete 27 | * it in the license file. 28 | */ 29 | 30 | #define MONGO_LOG_DEFAULT_COMPONENT ::mongo::logger::LogComponent::kStorage 31 | 32 | #include "mongo/platform/basic.h" 33 | 34 | #include 35 | 36 | #include "mongo/platform/mutex.h" 37 | #include "mongo/util/concurrency/idle_thread_block.h" 38 | #include "mongo/util/fail_point.h" 39 | #include "mongo/util/fail_point_service.h" 40 | #include "mongo/util/log.h" 41 | #include "mongo/util/scopeguard.h" 42 | #include "rocks_engine.h" 43 | #include "rocks_oplog_manager.h" 44 | 45 | namespace mongo { 46 | namespace { 47 | // This is the minimum valid timestamp; it can be used for reads that need to see all 48 | // untimestamped data but no timestamped data. We cannot use 0 here because 0 49 | // means see all timestamped data. 50 | const uint64_t kMinimumTimestamp = 1; 51 | } // namespace 52 | 53 | MONGO_FAIL_POINT_DEFINE(RocksPausePrimaryOplogDurabilityLoop); 54 | 55 | RocksOplogManager::RocksOplogManager(rocksdb::TOTransactionDB* db, RocksEngine* kvEngine, 56 | RocksDurabilityManager* durabilityManager) 57 | : _db(db), _kvEngine(kvEngine), _durabilityManager(durabilityManager) {} 58 | 59 | void RocksOplogManager::init(rocksdb::TOTransactionDB* db, RocksDurabilityManager* durabilityManager) { 60 | _db = db; 61 | _durabilityManager = durabilityManager; 62 | } 63 | 64 | void RocksOplogManager::start(OperationContext* opCtx, RocksRecordStore* oplogRecordStore) { 65 | invariant(!_isRunning); 66 | auto reverseOplogCursor = 67 | oplogRecordStore->getCursor(opCtx, false /* false = reverse cursor */); 68 | auto lastRecord = reverseOplogCursor->next(); 69 | if (lastRecord) { 70 | // Although the oplog may have holes, using the top of the oplog should be safe. In the 71 | // event of a secondary crashing, replication recovery will truncate the oplog, 72 | // resetting 73 | // visibility to the truncate point. In the event of a primary crashing, it will perform 74 | // rollback before servicing oplog reads. 75 | auto oplogVisibility = Timestamp(lastRecord->id.repr()); 76 | setOplogReadTimestamp(oplogVisibility); 77 | LOG(1) << "Setting oplog visibility at startup. Val: " << oplogVisibility; 78 | } else { 79 | // Avoid setting oplog visibility to 0. That means "everything is visible". 80 | setOplogReadTimestamp(Timestamp(kMinimumTimestamp)); 81 | } 82 | 83 | // Need to obtain the mutex before starting the thread, as otherwise it may race ahead 84 | // see _shuttingDown as true and quit prematurely. 85 | stdx::lock_guard lk(_oplogVisibilityStateMutex); 86 | _oplogJournalThread = 87 | stdx::thread(&RocksOplogManager::_oplogJournalThreadLoop, this, oplogRecordStore); 88 | _isRunning = true; 89 | _shuttingDown = false; 90 | } 91 | 92 | void RocksOplogManager::halt() { 93 | { 94 | stdx::lock_guard lk(_oplogVisibilityStateMutex); 95 | invariant(_isRunning); 96 | _shuttingDown = true; 97 | _isRunning = false; 98 | } 99 | 100 | if (_oplogJournalThread.joinable()) { 101 | _opsWaitingForJournalCV.notify_one(); 102 | _oplogJournalThread.join(); 103 | } 104 | } 105 | 106 | void RocksOplogManager::waitForAllEarlierOplogWritesToBeVisible( 107 | const RocksRecordStore* oplogRecordStore, OperationContext* opCtx) { 108 | invariant(opCtx->lockState()->isNoop() || !opCtx->lockState()->inAWriteUnitOfWork()); 109 | 110 | // In order to reliably detect rollback situations, we need to fetch the 111 | // latestVisibleTimestamp 112 | // prior to querying the end of the oplog. 113 | auto currentLatestVisibleTimestamp = getOplogReadTimestamp(); 114 | 115 | // Procedure: issue a read on a reverse cursor (which is not subject to the oplog visibility 116 | // rules), see what is last, and wait for that to become visible. 117 | std::unique_ptr cursor = 118 | oplogRecordStore->getCursor(opCtx, false /* false = reverse cursor */); 119 | auto lastRecord = cursor->next(); 120 | if (!lastRecord) { 121 | LOG(2) << "Trying to query an empty oplog"; 122 | opCtx->recoveryUnit()->abandonSnapshot(); 123 | return; 124 | } 125 | const auto waitingFor = lastRecord->id; 126 | // Close transaction before we wait. 127 | opCtx->recoveryUnit()->abandonSnapshot(); 128 | 129 | stdx::unique_lock lk(_oplogVisibilityStateMutex); 130 | 131 | // Prevent any scheduled journal flushes from being delayed and blocking this wait 132 | // excessively. 133 | _opsWaitingForVisibility++; 134 | invariant(_opsWaitingForVisibility > 0); 135 | auto exitGuard = makeGuard([&] { _opsWaitingForVisibility--; }); 136 | 137 | opCtx->waitForConditionOrInterrupt(_opsBecameVisibleCV, lk, [&] { 138 | auto newLatestVisibleTimestamp = getOplogReadTimestamp(); 139 | if (newLatestVisibleTimestamp < currentLatestVisibleTimestamp) { 140 | LOG(1) 141 | << "Oplog latest visible timestamp went backwards. newLatestVisibleTimestamp: " 142 | << Timestamp(newLatestVisibleTimestamp) << " currentLatestVisibleTimestamp: " 143 | << Timestamp(currentLatestVisibleTimestamp); 144 | // If the visibility went backwards, this means a rollback occurred. 145 | // Thus, we are finished waiting. 146 | return true; 147 | } 148 | currentLatestVisibleTimestamp = newLatestVisibleTimestamp; 149 | 150 | // currentLatestVisibleTimestamp might be Timestamp "1" if there are no oplog documents 151 | // inserted since the last mongod restart. In this case, we need to simulate what 152 | // timestamp 153 | // the last oplog document had when it was written, which is the _oplogMaxAtStartup 154 | // value. 155 | RecordId latestVisible = RecordId(currentLatestVisibleTimestamp); 156 | if (latestVisible < waitingFor) { 157 | LOG(2) << "Operation is waiting for " << waitingFor << "; latestVisible is " 158 | << Timestamp(currentLatestVisibleTimestamp); 159 | } 160 | return latestVisible >= waitingFor; 161 | }); 162 | } 163 | 164 | void RocksOplogManager::triggerJournalFlush() { 165 | stdx::lock_guard lk(_oplogVisibilityStateMutex); 166 | if (!_opsWaitingForJournal) { 167 | _opsWaitingForJournal = true; 168 | _opsWaitingForJournalCV.notify_one(); 169 | } 170 | } 171 | 172 | void RocksOplogManager::_oplogJournalThreadLoop(RocksRecordStore* oplogRecordStore) noexcept { 173 | Client::initThread("RocksOplogJournalThread"); 174 | 175 | // This thread updates the oplog read timestamp, the timestamp used to read from the oplog 176 | // with 177 | // forward cursors. The timestamp is used to hide oplog entries that might be committed but 178 | // have uncommitted entries ahead of them. 179 | while (true) { 180 | stdx::unique_lock lk(_oplogVisibilityStateMutex); 181 | { 182 | MONGO_IDLE_THREAD_BLOCK; 183 | _opsWaitingForJournalCV.wait( 184 | lk, [&] { return _shuttingDown || _opsWaitingForJournal; }); 185 | 186 | // If we're not shutting down and nobody is actively waiting for the oplog to become 187 | // durable, delay journaling a bit to reduce the sync rate. 188 | auto journalDelay = 189 | Milliseconds(storageGlobalParams.journalCommitIntervalMs.load()); 190 | auto now = Date_t::now(); 191 | auto deadline = now + journalDelay; 192 | auto shouldSyncOpsWaitingForJournal = [&] { 193 | return _shuttingDown || _opsWaitingForVisibility || 194 | oplogRecordStore->haveCappedWaiters(); 195 | }; 196 | 197 | // Eventually it would be more optimal to merge this with the normal journal 198 | // flushing 199 | // and block for either oplog tailers or operations waiting for oplog visibility. 200 | // For 201 | // now this loop will poll once a millisecond up to the journalDelay to see if we 202 | // have 203 | // any waiters yet. This reduces sync-related I/O on the primary when secondaries 204 | // are 205 | // lagged, but will avoid significant delays in confirming majority writes on 206 | // replica 207 | // sets with infrequent writes. 208 | // Callers of waitForAllEarlierOplogWritesToBeVisible() like causally consistent 209 | // reads 210 | // will preempt this delay. 211 | while (now < deadline && 212 | !_opsWaitingForJournalCV.wait_until(lk, now.toSystemTimePoint(), 213 | shouldSyncOpsWaitingForJournal)) { 214 | now += Milliseconds(1); 215 | } 216 | } 217 | 218 | while (!_shuttingDown && MONGO_FAIL_POINT(RocksPausePrimaryOplogDurabilityLoop)) { 219 | lk.unlock(); 220 | sleepmillis(10); 221 | lk.lock(); 222 | } 223 | 224 | if (_shuttingDown) { 225 | log() << "Oplog journal thread loop shutting down"; 226 | return; 227 | } 228 | invariant(_opsWaitingForJournal); 229 | _opsWaitingForJournal = false; 230 | lk.unlock(); 231 | 232 | const uint64_t newTimestamp = fetchAllDurableValue().asULL(); 233 | 234 | // The newTimestamp may actually go backward during secondary batch application, 235 | // where we commit data file changes separately from oplog changes, so ignore 236 | // a non-incrementing timestamp. 237 | if (newTimestamp <= _oplogReadTimestamp.load()) { 238 | LOG(2) << "No new oplog entries were made visible: " << Timestamp(newTimestamp); 239 | continue; 240 | } 241 | 242 | // In order to avoid oplog holes after an unclean shutdown, we must ensure this proposed 243 | // oplog read timestamp's documents are durable before publishing that timestamp. 244 | _durabilityManager->waitUntilDurable(false); 245 | 246 | lk.lock(); 247 | // Publish the new timestamp value. Avoid going backward. 248 | auto oldTimestamp = getOplogReadTimestamp(); 249 | if (newTimestamp > oldTimestamp) { 250 | _setOplogReadTimestamp(lk, newTimestamp); 251 | } 252 | lk.unlock(); 253 | 254 | // Wake up any await_data cursors and tell them more data might be visible now. 255 | oplogRecordStore->notifyCappedWaitersIfNeeded(); 256 | } 257 | } 258 | 259 | std::uint64_t RocksOplogManager::getOplogReadTimestamp() const { 260 | return _oplogReadTimestamp.load(); 261 | } 262 | 263 | void RocksOplogManager::setOplogReadTimestamp(Timestamp ts) { 264 | stdx::lock_guard lk(_oplogVisibilityStateMutex); 265 | _setOplogReadTimestamp(lk, ts.asULL()); 266 | } 267 | 268 | void RocksOplogManager::_setOplogReadTimestamp(WithLock, uint64_t newTimestamp) { 269 | _oplogReadTimestamp.store(newTimestamp); 270 | _opsBecameVisibleCV.notify_all(); 271 | LOG(2) << "Setting new oplogReadTimestamp: " << Timestamp(newTimestamp); 272 | } 273 | 274 | Timestamp RocksOplogManager::fetchAllDurableValue() { 275 | rocksdb::RocksTimeStamp ts(0); 276 | // all kAllDurrable is same with kAllCommitted 277 | auto status = _db->QueryTimeStamp(rocksdb::TimeStampType::kAllCommitted, &ts); 278 | if (status.IsNotFound()) { 279 | return Timestamp(kMinimumTimestamp); 280 | } else { 281 | invariant(status.ok(), status.ToString()); 282 | } 283 | return Timestamp(ts); 284 | } 285 | 286 | } // namespace mongo 287 | --------------------------------------------------------------------------------