├── CMakeLists.txt ├── LICENSE ├── README.md ├── datasets ├── patterns.7z └── texts.7z ├── internal ├── definitions.hpp ├── huff_string.hpp ├── r_index.hpp ├── rle_string.hpp ├── sparse_hyb_vector.hpp ├── sparse_sd_vector.hpp ├── succinct_bit_vector.hpp └── utils.hpp ├── ri-build.cpp ├── ri-count.cpp ├── ri-locate.cpp └── ri-space.cpp /CMakeLists.txt: -------------------------------------------------------------------------------- 1 | cmake_minimum_required(VERSION 2.6) 2 | 3 | # Set a default build type if none was specified 4 | if(NOT CMAKE_BUILD_TYPE) 5 | message(STATUS "Setting build type to 'Release' as none was specified.") 6 | set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE) 7 | endif() 8 | 9 | #set( CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/.. ) 10 | 11 | project (s-rlbwt) 12 | 13 | include_directories(${PROJECT_SOURCE_DIR}) 14 | include_directories(${PROJECT_SOURCE_DIR}/internal) 15 | include_directories(~/include) #SDSL headers are here 16 | 17 | LINK_DIRECTORIES(~/lib) #SDSL lib are here 18 | 19 | message("Building in ${CMAKE_BUILD_TYPE} mode") 20 | 21 | set(CMAKE_CXX_FLAGS "--std=c++11") 22 | 23 | set(CMAKE_CXX_FLAGS_DEBUG "-O0 -ggdb -g") 24 | set(CMAKE_CXX_FLAGS_RELEASE "-g -ggdb -Ofast -fstrict-aliasing -DNDEBUG -march=native") 25 | set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "-g -ggdb -Ofast -fstrict-aliasing -march=native") 26 | 27 | add_executable(ri-build ri-build.cpp) 28 | TARGET_LINK_LIBRARIES(ri-build sdsl) 29 | TARGET_LINK_LIBRARIES(ri-build divsufsort) 30 | TARGET_LINK_LIBRARIES(ri-build divsufsort64) 31 | 32 | add_executable(ri-locate ri-locate.cpp) 33 | TARGET_LINK_LIBRARIES(ri-locate sdsl) 34 | TARGET_LINK_LIBRARIES(ri-locate divsufsort) 35 | TARGET_LINK_LIBRARIES(ri-locate divsufsort64) 36 | 37 | add_executable(ri-count ri-count.cpp) 38 | TARGET_LINK_LIBRARIES(ri-count sdsl) 39 | TARGET_LINK_LIBRARIES(ri-count divsufsort) 40 | TARGET_LINK_LIBRARIES(ri-count divsufsort64) 41 | 42 | #add_executable(ri-space ri-space.cpp) 43 | #TARGET_LINK_LIBRARIES(ri-space sdsl) 44 | #TARGET_LINK_LIBRARIES(ri-space divsufsort) 45 | #TARGET_LINK_LIBRARIES(ri-space divsufsort64) 46 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Nicola Prezza nicola.prezza@gmail.com 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | r-index: the run-length BWT index 2 | =============== 3 | Author: Nicola Prezza (nicola.prezza@gmail.com) 4 | Joint work with Travis Gagie and Gonzalo Navarro 5 | 6 | cite as: 7 | 8 | Gagie T, Navarro G, Prezza N. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, NA, USA, January 7-10 2017. 9 | 10 | Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. J. ACM 67, 1, Article 2 (April 2020) 11 | 12 | ### Brief description 13 | 14 | The r-index is the first full-text index of size O(r), r being the number of BWT runs of the input text (of size n), supporting fast (almost optimal) locate of pattern occurrences. The r-index employs a novel suffix array sampling of size 2r; in classical FM-indexes, this sampling would result in a locate time of Omega(n/r) per occurrence. The r-index, on the other hand, reduces this time to O(log(n/r)). 15 | 16 | Let s be the alphabet size and fix a constant eps>0. The r-index offers the following tradeoffs: 17 | 18 | - Space: r * ( log s + (1+eps)log(n/r) + 2log n ) bits 19 | - Count time: O( (m/eps) * (log (n/r) + log s) ) 20 | - Locate time: After count, O( log(n/r) ) time per occurrence 21 | 22 | On very repetitive datasets, the r-index locates orders of magnitude faster than the RLCSA (with a sampling rate resulting in the same size for the two indexes). 23 | 24 | NEWS: refactored locate strategy. Let (l,r) be the SA range. Now, the index first finds SA[r] and then applies function Phi to locate SA[r-1], SA[r-2], ..., SA[l]. This is both faster and more space efficient than the strategy originally implemented and described in the paper. 25 | 26 | ### Download 27 | 28 | To clone the repository, run: 29 | 30 | > git clone http://github.com/nicolaprezza/r-index 31 | 32 | ### Compile 33 | 34 | The library has been tested under linux using gcc 6.2.0. You need the SDSL library installed on your system (https://github.com/simongog/sdsl-lite). 35 | 36 | We use cmake to generate the Makefile. Create a build folder in the main r-index folder: 37 | 38 | > mkdir build 39 | 40 | run cmake: 41 | 42 | > cd build; cmake .. 43 | 44 | and compile: 45 | 46 | > make 47 | 48 | ### Run 49 | 50 | After compiling, run 51 | 52 | > ri-build input 53 | 54 | This command will create the r-index of the text file "input" and will store it as "input.ri". Use option -o to specify a different basename for the index file. 55 | 56 | Run 57 | 58 | > ri-count index.ri patterns 59 | 60 | to count number of occurrences of the patterns, where is a file containing the patterns in pizza&chili format (http://pizzachili.dcc.uchile.cl/experiments.html). To generate pattern files, use the tool http://pizzachili.dcc.uchile.cl/utils/genpatterns.c. Note: the Pizza&chili format consists of a header followed by newline followed by all the patterns concatenated (without any separator). The patterns must all be of the same length. 61 | 62 | Run 63 | 64 | > ri-locate index.ri patterns 65 | 66 | to locate all occurrences of the patterns. 67 | 68 | Be aware that the above executables are just benchmarking tools: no output is generated (pattern occurrences are deleted after being extracted and not printed to output). 69 | 70 | ### Funding 71 | 72 | Nicola Prezza has been supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L, PI: Giovanna Rosone. Link: http://pages.di.unipi.it/rosone/CMACBioSeq.html 73 | -------------------------------------------------------------------------------- /datasets/patterns.7z: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolaprezza/r-index/7009b5374282aa1d8e9b594d0a06c49a0bc1330f/datasets/patterns.7z -------------------------------------------------------------------------------- /datasets/texts.7z: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nicolaprezza/r-index/7009b5374282aa1d8e9b594d0a06c49a0bc1330f/datasets/texts.7z -------------------------------------------------------------------------------- /internal/definitions.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * definitions.hpp 7 | * 8 | * Created on: Apr 13, 2017 9 | * Author: nico 10 | */ 11 | 12 | #ifndef INCLUDE_DEFINITIONS_HPP_ 13 | #define INCLUDE_DEFINITIONS_HPP_ 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include "stdint.h" 27 | #include 28 | #include 29 | #include 30 | #include 31 | #include 32 | #include 33 | #include 34 | 35 | using namespace std; 36 | 37 | namespace ri{ 38 | 39 | typedef uint64_t ulint; 40 | typedef long int lint; 41 | typedef unsigned int uint; 42 | typedef unsigned short int t_char; ///< Type for char conversion 43 | typedef unsigned short int t_errors; ///< Type for ERRORS 44 | 45 | typedef unsigned char uchar; 46 | typedef unsigned char symbol; 47 | typedef unsigned char uint8; 48 | 49 | typedef pair range_t; 50 | 51 | } 52 | 53 | #endif /* INCLUDE_DEFINITIONS_HPP_ */ 54 | -------------------------------------------------------------------------------- /internal/huff_string.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * huff_string.hpp 7 | * 8 | * Created on: May 18, 2015 9 | * Author: nicola 10 | * 11 | * Huffman-compressed string with access/rank/select. The class is a wrapper on sdsl::wt_huff, with a simpler constructor 12 | */ 13 | 14 | #ifndef HUFF_STRING_HPP_ 15 | #define HUFF_STRING_HPP_ 16 | 17 | #include 18 | 19 | using namespace sdsl; 20 | using namespace std; 21 | 22 | namespace ri{ 23 | 24 | class huff_string{ 25 | 26 | public: 27 | 28 | huff_string(){} 29 | 30 | huff_string(string &s){ 31 | 32 | s.push_back(0); 33 | construct_im(wt, s.c_str(), 1); 34 | 35 | assert(wt.size()==s.size()-1); 36 | 37 | } 38 | 39 | uchar operator[](ulint i){ 40 | 41 | assert(i wt; 87 | 88 | wt_huff<> wt; 89 | 90 | }; 91 | 92 | } 93 | 94 | #endif /* HUFF_STRING_HPP_ */ 95 | -------------------------------------------------------------------------------- /internal/r_index.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * r_index.hpp 7 | * 8 | * Created on: Apr 13, 2017 9 | * Author: nico 10 | * 11 | * Small version of the r-index: O(r) words of space, O(log(n/r)) locate time per occurrence 12 | * 13 | */ 14 | 15 | #ifndef R_INDEX_S_H_ 16 | #define R_INDEX_S_H_ 17 | 18 | #include 19 | #include 20 | #include "sparse_sd_vector.hpp" 21 | #include "sparse_hyb_vector.hpp" 22 | #include "utils.hpp" 23 | 24 | using namespace sdsl; 25 | 26 | namespace ri{ 27 | 28 | template < class sparse_bv_type = sparse_sd_vector, 29 | class rle_string_t = rle_string_sd 30 | > 31 | class r_index{ 32 | 33 | public: 34 | 35 | using triple = std::tuple; 36 | 37 | r_index(){} 38 | 39 | /* 40 | * Build index 41 | */ 42 | r_index(string &input, bool sais = true){ 43 | 44 | this->sais = sais; 45 | 46 | if(contains_reserved_chars(input)){ 47 | 48 | cout << "Error: input string contains one of the reserved characters 0x0, 0x1" << endl; 49 | exit(1); 50 | 51 | } 52 | 53 | cout << "Text length = " << input.size() << endl << endl; 54 | 55 | cout << "(1/3) Building BWT and computing SA samples"; 56 | if(sais) cout << " (SE-SAIS) ... " << flush; 57 | else cout << "(DIVSUFSORT) ... " << flush; 58 | 59 | //build run-length encoded BWT 60 | 61 | auto bwt_and_samples = sufsort(input); 62 | 63 | string& bwt_s = get<0>(bwt_and_samples); 64 | vector >& samples_first_vec = get<1>(bwt_and_samples); 65 | vector& samples_last_vec = get<2>(bwt_and_samples); 66 | 67 | cout << "done.\n(2/3) RLE encoding BWT ... " << flush; 68 | 69 | bwt = rle_string_t(bwt_s); 70 | 71 | //build F column 72 | F = vector(256,0); 73 | for(uchar c : bwt_s) 74 | F[c]++; 75 | 76 | for(ulint i=255;i>0;--i) 77 | F[i] = F[i-1]; 78 | 79 | F[0] = 0; 80 | 81 | for(ulint i=1;i<256;++i) 82 | F[i] += F[i-1]; 83 | 84 | for(ulint i=0;i(bwt_s.size(),false); 114 | 115 | for(auto p : samples_first_vec){ 116 | 117 | assert(p.first < pred_bv.size()); 118 | pred_bv[p.first] = true; 119 | 120 | } 121 | 122 | pred = sparse_bv_type(pred_bv); 123 | 124 | } 125 | 126 | assert(pred.rank(pred.size()) == r); 127 | 128 | //last text position must be sampled 129 | assert(pred[pred.size()-1]); 130 | 131 | samples_last = int_vector<>(r,0,log_n); //text positions corresponding to last characters in BWT runs, in BWT order 132 | pred_to_run = int_vector<>(r,0,log_r); //stores the BWT run (0...R-1) corresponding to each position in pred, in text order 133 | 134 | for(ulint i=0;i=F[c+1]) 175 | return {1,0}; 176 | 177 | //number of c before the interval 178 | ulint c_before = bwt.rank(rn.first,c); 179 | 180 | //number of c inside the interval rn 181 | ulint c_inside = bwt.rank(rn.second+1,c) - c_before; 182 | 183 | //if there are no c in the interval, return empty range 184 | if(c_inside==0) return {1,0}; 185 | 186 | ulint l = F[c] + c_before; 187 | 188 | return {l,l+c_inside-1}; 189 | 190 | } 191 | 192 | /* 193 | * Phi function. Phi(SA[0]) is undefined 194 | */ 195 | ulint Phi(ulint i){ 196 | 197 | assert(i != bwt.size()-1); 198 | 199 | //jr is the rank of the predecessor of i (circular) 200 | ulint jr = pred.predecessor_rank_circular(i); 201 | 202 | assert(jr<=r-1); 203 | 204 | //the actual predecessor 205 | ulint j = pred.select(jr); 206 | 207 | assert(jr0); 214 | 215 | //sample at the end of previous run 216 | assert(pred_to_run[jr]-1 < samples_last.size()); 217 | ulint prev_sample = samples_last[ pred_to_run[jr]-1 ]; 218 | 219 | return (prev_sample + delta) % bwt.size(); 220 | 221 | } 222 | 223 | //backward navigation of the BWT 224 | ulint LF(ulint i){ 225 | 226 | auto c = bwt[i]; 227 | return F[c] + bwt.rank(i,c); 228 | 229 | } 230 | 231 | //forward navigation of the BWT 232 | ulint FL(ulint i){ 233 | 234 | //i-th character in first BWT column 235 | auto c = F_at(i); 236 | 237 | //this c is the j-th (counting from 0) 238 | ulint j = i - F[c]; 239 | 240 | return bwt.select(j,uchar(c)); 241 | 242 | } 243 | 244 | //forward navigation of the BWT, where for efficiency we give c=F[i] as input 245 | ulint FL(ulint i, uchar c){ 246 | 247 | //i-th character in first BWT column 248 | assert(c == F_at(i)); 249 | 250 | //this c is the j-th (counting from 0) 251 | ulint j = i - F[c]; 252 | 253 | return bwt.select(j,uchar(c)); 254 | 255 | } 256 | 257 | /* 258 | * access column F at position i 259 | */ 260 | uchar F_at(ulint i){ 261 | 262 | ulint c = (upper_bound(F.begin(),F.end(),i) - F.begin()) - 1; 263 | assert(c<256); 264 | assert(i>=F[c]); 265 | 266 | return uchar(c); 267 | 268 | } 269 | 270 | /* 271 | * Return BWT range of character c 272 | */ 273 | range_t get_char_range(uchar c){ 274 | 275 | //if character does not appear in the text, return empty pair 276 | if((c==255 and F[c]==bwt_size()) || F[c]>=F[c+1]) 277 | return {1,0}; 278 | 279 | ulint l = F[c]; 280 | ulint r = bwt_size()-1; 281 | 282 | if(c<255) 283 | r = F[c+1]-1; 284 | 285 | return {l,r}; 286 | 287 | } 288 | 289 | /* 290 | * Return BWT range of pattern P 291 | */ 292 | range_t count(string &P){ 293 | 294 | auto range = full_range(); 295 | ulint m = P.size(); 296 | 297 | for(ulint i=0;i=range.first;++i) 298 | range = LF(range,P[m-i-1]); 299 | 300 | return range; 301 | 302 | } 303 | 304 | /* 305 | * Return number of occurrences of P in the text 306 | */ 307 | ulint occ(string &P){ 308 | 309 | auto rn = count(P); 310 | 311 | return rn.second>=rn.first ? (rn.second-rn.first)+1 : 0; 312 | 313 | } 314 | 315 | /* 316 | * iterator locate(string &P){ 317 | * 318 | * return iterator to iterate over all occurrences without storing them 319 | * in memory 320 | * 321 | * } 322 | */ 323 | 324 | /* 325 | * locate all occurrences of P and return them in an array 326 | * (space consuming if result is big). 327 | */ 328 | vector locate_all(string& P){ 329 | 330 | vector OCC; 331 | 332 | pair res = count_and_get_occ(P); 333 | 334 | ulint L = std::get<0>(res).first; 335 | ulint R = std::get<0>(res).second; 336 | ulint k = std::get<1>(res); //SA[R] 337 | 338 | ulint n_occ = R>=L ? (R-L)+1 : 0; 339 | 340 | if(n_occ>0){ 341 | 342 | OCC.push_back(k); 343 | 344 | for(ulint i=1;i0); 387 | assert(bwt.size()>0); 388 | 389 | out.write((char*)&terminator_position,sizeof(terminator_position)); 390 | out.write((char*)F.data(),256*sizeof(ulint)); 391 | 392 | w_bytes += sizeof(terminator_position) + 256*sizeof(ulint); 393 | 394 | w_bytes += bwt.serialize(out); 395 | 396 | w_bytes += pred.serialize(out); 397 | w_bytes += samples_last.serialize(out); 398 | w_bytes += pred_to_run.serialize(out); 399 | 400 | return w_bytes; 401 | 402 | } 403 | 404 | /* load the structure from the istream 405 | * \param in the istream 406 | */ 407 | void load(std::istream& in) { 408 | 409 | in.read((char*)&terminator_position,sizeof(terminator_position)); 410 | 411 | F = vector(256); 412 | in.read((char*)F.data(),256*sizeof(ulint)); 413 | 414 | bwt.load(in); 415 | 416 | r = bwt.number_of_runs(); 417 | 418 | pred.load(in); 419 | samples_last.load(in); 420 | pred_to_run.load(in); 421 | 422 | } 423 | 424 | /* 425 | * save the structure to the path specified. 426 | * \param path_prefix prefix of the index files. suffix ".ri" will be automatically added 427 | */ 428 | void save_to_file(string path_prefix){ 429 | 430 | string path = string(path_prefix).append(".ri"); 431 | 432 | std::ofstream out(path); 433 | serialize(out); 434 | out.close(); 435 | 436 | } 437 | 438 | /* 439 | * load the structure from the path specified. 440 | * \param path: full file name 441 | */ 442 | void load_from_file(string path){ 443 | 444 | std::ifstream in(path); 445 | load(in); 446 | in.close(); 447 | 448 | } 449 | 450 | ulint text_size(){ 451 | return bwt.size()-1; 452 | } 453 | 454 | ulint bwt_size(){ 455 | return bwt.size(); 456 | } 457 | 458 | uchar get_terminator(){ 459 | return TERMINATOR; 460 | } 461 | 462 | ulint print_space(){ 463 | 464 | cout << "Number of runs = " << bwt.number_of_runs() << endl<, SA[r] >, where l,r are the inclusive ranges of the pattern P. If P does not occur, then l>r 478 | * 479 | * returns 480 | * 481 | */ 482 | pair count_and_get_occ(string &P){ 483 | 484 | //k = SA[r] 485 | ulint k = 0; 486 | 487 | range_t range = full_range(); 488 | assert(r-1 < samples_last.size()); 489 | k = (samples_last[r-1]+1) % bwt.size(); 490 | 491 | range_t range1; 492 | 493 | ulint m = P.size(); 494 | 495 | for(ulint i=0;i=range.first;++i){ 496 | 497 | uchar c = P[m-i-1]; 498 | 499 | range1 = LF(range,c); 500 | 501 | //if suffix can be left-extended with char 502 | if(range1.first <= range1.second){ 503 | 504 | 505 | if(bwt[range.second] == c){ 506 | 507 | // last c is at the end of range. Then, we have this sample by induction! 508 | assert(k>0); 509 | k--; 510 | 511 | }else{ 512 | 513 | //find last c in range (there must be one because range1 is not empty) 514 | //and get its sample (must be sampled because it is at the end of a run) 515 | //note: by previous check, bwt[range.second] != c, so we can use argument range.second 516 | ulint rnk = bwt.rank(range.second,c); 517 | 518 | //there must be at least one c before range.second 519 | assert(rnk>0); 520 | 521 | //this is the rank of the last c 522 | rnk--; 523 | 524 | //jump to the corresponding BWT position 525 | ulint j = bwt.select(rnk,c); 526 | 527 | //the c must be in the range 528 | assert(j>=range.first and j < range.second); 529 | 530 | //run of position j 531 | ulint run_of_j = bwt.run_of_position(j); 532 | 533 | k = samples_last[run_of_j]; 534 | 535 | } 536 | 537 | } 538 | 539 | range = range1; 540 | 541 | } 542 | 543 | return {range, k}; 544 | 545 | } 546 | 547 | /* 548 | * returns a triple containing BWT of input string 549 | * (uses 0x1 character as terminator), text positions corresponding 550 | * to first letters in BWT runs (plus their ranks from 0 to R-1), and text positions corresponding 551 | * to last letters in BWT runs (in BWT order) 552 | */ 553 | tuple >, vector > sufsort(string &s){ 554 | 555 | string bwt_s; 556 | 557 | cache_config cc; 558 | 559 | int_vector<8> text(s.size()); 560 | assert(text.size()==s.size()); 561 | 562 | for(ulint i=0;i(cc); 573 | 574 | //now build BWT from SA 575 | int_vector_buffer<> sa(cache_file_name(conf::KEY_SA, cc)); 576 | 577 | vector > samples_first; //text positions corresponding to first characters in BWT runs, and their ranks 0...R-1 578 | vector samples_last; //text positions corresponding to last characters in BWT runs 579 | 580 | { 581 | 582 | for (ulint i=0; i 0 ) 588 | bwt_s.push_back((uchar)text[x-1]); 589 | else 590 | bwt_s.push_back(TERMINATOR); 591 | 592 | //Insert samples at begin of runs 593 | if(i>0){ 594 | 595 | if( i==1 || //case 1: i-1 == 0 is at run begin 596 | (i>1 && bwt_s[i-1] != bwt_s[i-2]) //case 2: i-1 is at the begin of a run 597 | ){ 598 | 599 | samples_first.push_back( {sa[i-1]>0?sa[i-1]-1:sa.size()-1, samples_first.size()} ); 600 | 601 | } 602 | 603 | //check last BWT letter 604 | if(i==sa.size()-1 && bwt_s[i]!=bwt_s[i-1]) samples_first.push_back( {sa[i]>0?sa[i]-1:sa.size()-1, samples_first.size()} ); 605 | 606 | } 607 | 608 | //Insert samples at end of runs 609 | if(i>0){ 610 | 611 | if( bwt_s[i-1] != bwt_s[i] //i-1 is at the end of a run 612 | ){ 613 | 614 | samples_last.push_back( sa[i-1]>0?sa[i-1]-1:sa.size()-1 ); 615 | 616 | } 617 | 618 | //last BWT letter is always at end of a run and is never checked in the previous if 619 | if(i==sa.size()-1) samples_last.push_back( sa[i]>0?sa[i]-1:sa.size()-1 ); 620 | 621 | } 622 | 623 | } 624 | 625 | } 626 | 627 | assert(samples_first.size() == samples_last.size()); 628 | 629 | sdsl::remove(cache_file_name(conf::KEY_TEXT, cc)); 630 | sdsl::remove(cache_file_name(conf::KEY_SA, cc)); 631 | 632 | return tuple >, vector >(bwt_s, samples_first, samples_last); 633 | 634 | } 635 | 636 | static bool contains_reserved_chars(string &s){ 637 | 638 | for(auto c : s) 639 | if(c == 0 or c == 1) 640 | return true; 641 | 642 | return false; 643 | 644 | } 645 | 646 | static const uchar TERMINATOR = 1; 647 | 648 | bool sais = true; 649 | 650 | /* 651 | * sparse RLBWT: r (log sigma + (1+epsilon) * log (n/r)) (1+o(1)) bits 652 | */ 653 | 654 | //F column of the BWT (vector of 256 elements) 655 | vector F; 656 | //L column of the BWT, run-length compressed 657 | rle_string_t bwt; 658 | ulint terminator_position = 0; 659 | ulint r = 0;//number of BWT runs 660 | 661 | 662 | //the predecessor structure on positions corresponding to first chars in BWT runs 663 | sparse_bv_type pred; 664 | int_vector<> samples_last; //text positions corresponding to last characters in BWT runs, in BWT order 665 | int_vector<> pred_to_run; //stores the BWT run (0...R-1) corresponding to each position in pred, in text order 666 | 667 | }; 668 | 669 | } 670 | 671 | #endif /* R_INDEX_S_H_ */ 672 | -------------------------------------------------------------------------------- /internal/rle_string.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * rle_string.hpp 7 | * 8 | * Created on: May 18, 2015 9 | * Author: nicola 10 | * 11 | * A run-length encoded string with rank/access functionalities. 12 | * 13 | * 14 | * space of the structure: R * (H0 + log(n/R) + log(n/R)/B ) (1+o(1)) bits, n being text length, 15 | * R number of runs, B block length, and H0 zero-order entropy of the run heads. 16 | * 17 | * Time for all operations: O( B*(log(n/R)+H0) ) 18 | * 19 | * From the paper 20 | * 21 | * Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza and Mathieu Raffinot. 22 | * Flexible Indexing of Repetitive Collections. Computability in Europe (CiE) 2017) 23 | * 24 | */ 25 | 26 | #ifndef RLE_STRING_HPP_ 27 | #define RLE_STRING_HPP_ 28 | 29 | #include "definitions.hpp" 30 | #include "huff_string.hpp" 31 | #include "sparse_sd_vector.hpp" 32 | #include "sparse_hyb_vector.hpp" 33 | 34 | namespace ri{ 35 | 36 | template< 37 | class sparse_bitvector_t = sparse_sd_vector, //predecessor structure storing run length 38 | class string_t = huff_string //run heads 39 | > 40 | class rle_string{ 41 | 42 | public: 43 | 44 | rle_string(){} 45 | 46 | /* 47 | * constructor: build structure on the input string 48 | * \param input the input string without 0x0 bytes in it. 49 | * \param B block size. The main sparse bitvector has R/B bits set (R being number of runs) 50 | * 51 | */ 52 | rle_string(string &input, ulint B = 2){ 53 | 54 | assert(not contains0(input)); 55 | 56 | this->B = B; 57 | n = input.size(); 58 | R = 0; 59 | 60 | auto runs_per_letter_bv = vector >(256); 61 | 62 | //runs in main bitvector 63 | vector runs_bv; 64 | string run_heads_s; 65 | 66 | uchar last_c = input[0]; 67 | 68 | for(ulint i=1;i 512 | pair run_of(ulint i){ 513 | 514 | ulint last_block = runs.rank(i); 515 | ulint current_run = last_block*B; 516 | 517 | //current position in the string: the first of a block 518 | ulint pos = 0; 519 | if(last_block>0) 520 | pos = runs.select(last_block-1)+1; 521 | 522 | assert(pos <= i); 523 | 524 | while(pos < i){ 525 | 526 | pos += run_at(current_run); 527 | current_run++; 528 | 529 | } 530 | 531 | assert(pos >= i); 532 | 533 | if(pos>i){ 534 | 535 | current_run--; 536 | 537 | }else{//pos==i 538 | 539 | pos += run_at(current_run); 540 | 541 | } 542 | 543 | assert(pos>0); 544 | assert(current_run runs_per_letter; 566 | 567 | //store run heads in a compressed string supporting access/rank 568 | string_t run_heads; 569 | 570 | //text length and number of runs 571 | ulint n=0; 572 | ulint R=0; 573 | 574 | }; 575 | 576 | typedef rle_string rle_string_sd; 577 | typedef rle_string rle_string_hyb; 578 | 579 | } 580 | 581 | #endif /* RLE_STRING_HPP_ */ 582 | -------------------------------------------------------------------------------- /internal/sparse_hyb_vector.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * sparse_hyb_vector: a wrapper on hyb_vector<> of the sdsl library, with support for rank/select1 7 | */ 8 | 9 | //============================================================================ 10 | 11 | 12 | #ifndef INTERNAL_SPARSE_HYB_VECTOR_HPP_ 13 | #define INTERNAL_SPARSE_HYB_VECTOR_HPP_ 14 | 15 | #include 16 | 17 | using namespace std; 18 | using namespace sdsl; 19 | 20 | #ifndef ulint 21 | typedef uint64_t ulint; 22 | #endif 23 | 24 | #ifndef uint 25 | typedef uint32_t uint; 26 | #endif 27 | 28 | namespace ri{ 29 | 30 | class sparse_hyb_vector{ 31 | 32 | public: 33 | 34 | /* 35 | * empty constructor. Initialize bitvector with length 0. 36 | */ 37 | sparse_hyb_vector(){} 38 | 39 | /* 40 | * constructor. build bitvector given a vector of bools 41 | */ 42 | sparse_hyb_vector(vector &b){ 43 | 44 | if(b.size()==0) return; 45 | 46 | u = b.size(); 47 | 48 | bit_vector bv(b.size()); 49 | 50 | for(uint64_t i=0;i(bv); 54 | rank1 = hyb_vector<>::rank_1_type(&sdv); 55 | select1 = hyb_vector<>::select_1_type(&sdv); 56 | 57 | } 58 | 59 | /* 60 | * constructor. build bitvector given a bit_vector 61 | */ 62 | sparse_hyb_vector(bit_vector &bv){ 63 | 64 | sdv = hyb_vector<>(bv); 65 | rank1 = hyb_vector<>::rank_1_type(&sdv); 66 | select1 = hyb_vector<>::select_1_type(&sdv); 67 | 68 | } 69 | 70 | sparse_hyb_vector & operator= (const sparse_hyb_vector & other) { 71 | 72 | u = other.sdv.size(); 73 | sdv = hyb_vector<>(other.sdv); 74 | rank1 = hyb_vector<>::rank_1_type(&sdv); 75 | select1 = hyb_vector<>::select_1_type(&sdv); 76 | 77 | return *this; 78 | } 79 | 80 | /* 81 | * not implemented 82 | * argument: a boolean b 83 | * behavior: append b at the end of the bitvector. 84 | */ 85 | //void push_back(bool b){} 86 | 87 | /* 88 | * argument: position i in the bitvector 89 | * returns: bit in position i 90 | * only access! the bitvector is static. 91 | */ 92 | bool operator[](ulint i){ 93 | 94 | assert(i0); 125 | 126 | return rank(i)-1; 127 | 128 | } 129 | 130 | /* 131 | * input: position 0<=i<=n 132 | * output: predecessor of i (i excluded) in 133 | * bitvector space 134 | */ 135 | ulint predecessor(ulint i){ 136 | 137 | /* 138 | * i must have a predecessor 139 | */ 140 | assert(rank(i)>0); 141 | 142 | return select(rank(i)-1); 143 | 144 | 145 | } 146 | 147 | /* 148 | * retrieve length of the i-th gap (i>=0). gap length includes the leading 1 149 | * \param i0 164 | * returns: position of the i-th one in the bitvector. i starts from 0! 165 | */ 166 | ulint select(ulint i){ 167 | 168 | assert(i, i starts from 1 170 | 171 | } 172 | 173 | /* 174 | * returns: size of the bitvector 175 | */ 176 | ulint size(){return u;} 177 | 178 | /* 179 | * returns: number of 1s in the bitvector 180 | */ 181 | ulint number_of_1(){return rank1(size()); } 182 | 183 | /* serialize the structure to the ostream 184 | * \param out the ostream 185 | */ 186 | ulint serialize(std::ostream& out){ 187 | 188 | ulint w_bytes = 0; 189 | 190 | out.write((char*)&u, sizeof(u)); 191 | 192 | w_bytes += sizeof(u); 193 | 194 | if(u==0) return w_bytes; 195 | 196 | w_bytes += sdv.serialize(out); 197 | 198 | return w_bytes; 199 | 200 | } 201 | 202 | /* load the structure from the istream 203 | * \param in the istream 204 | */ 205 | void load(std::istream& in) { 206 | 207 | in.read((char*)&u, sizeof(u)); 208 | 209 | if(u==0) return; 210 | 211 | sdv.load(in); 212 | rank1 = hyb_vector<>::rank_1_type(&sdv); 213 | select1 = hyb_vector<>::select_1_type(&sdv); 214 | 215 | } 216 | 217 | private: 218 | 219 | //bitvector length 220 | ulint u = 0; 221 | 222 | hyb_vector<> sdv; 223 | hyb_vector<>::rank_1_type rank1; 224 | hyb_vector<>::select_1_type select1; 225 | 226 | }; 227 | 228 | } 229 | 230 | 231 | #endif /* INTERNAL_SPARSE_HYB_VECTOR_HPP_ */ 232 | -------------------------------------------------------------------------------- /internal/sparse_sd_vector.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * sparse_sd_vector: a wrapper on sd_vector of the sdsl library, with support for rank/select1 7 | */ 8 | 9 | //============================================================================ 10 | 11 | 12 | #ifndef INTERNAL_SPARSE_SD_VECTOR_HPP_ 13 | #define INTERNAL_SPARSE_SD_VECTOR_HPP_ 14 | 15 | #include 16 | 17 | using namespace std; 18 | using namespace sdsl; 19 | 20 | #ifndef ulint 21 | typedef uint64_t ulint; 22 | #endif 23 | 24 | #ifndef uint 25 | typedef uint32_t uint; 26 | #endif 27 | 28 | namespace ri{ 29 | 30 | class sparse_sd_vector{ 31 | 32 | public: 33 | 34 | /* 35 | * empty constructor. Initialize bitvector with length 0. 36 | */ 37 | sparse_sd_vector(){} 38 | 39 | /* 40 | * constructor. build bitvector given a vector of bools 41 | */ 42 | sparse_sd_vector(vector &b){ 43 | 44 | if(b.size()==0) return; 45 | 46 | u = b.size(); 47 | 48 | bit_vector bv(b.size()); 49 | 50 | for(uint64_t i=0;i(bv); 54 | rank1 = sd_vector<>::rank_1_type(&sdv); 55 | select1 = sd_vector<>::select_1_type(&sdv); 56 | 57 | } 58 | 59 | /* 60 | * constructor. build bitvector given a bit_vector 61 | */ 62 | sparse_sd_vector(bit_vector &bv){ 63 | 64 | sdv = sd_vector<>(bv); 65 | rank1 = sd_vector<>::rank_1_type(&sdv); 66 | select1 = sd_vector<>::select_1_type(&sdv); 67 | 68 | } 69 | 70 | sparse_sd_vector & operator= (const sparse_sd_vector & other) { 71 | 72 | u = other.sdv.size(); 73 | sdv = sd_vector<>(other.sdv); 74 | rank1 = sd_vector<>::rank_1_type(&sdv); 75 | select1 = sd_vector<>::select_1_type(&sdv); 76 | 77 | return *this; 78 | } 79 | 80 | /* 81 | * not implemented 82 | * argument: a boolean b 83 | * behavior: append b at the end of the bitvector. 84 | */ 85 | //void push_back(bool b){} 86 | 87 | /* 88 | * argument: position i in the bitvector 89 | * returns: bit in position i 90 | * only access! the bitvector is static. 91 | */ 92 | bool operator[](ulint i){ 93 | 94 | assert(i0); 125 | 126 | return rank(i)-1; 127 | 128 | } 129 | 130 | /* 131 | * input: position 0<=i<=n 132 | * output: predecessor of i (i excluded) in 133 | * bitvector space 134 | */ 135 | ulint predecessor(ulint i){ 136 | 137 | /* 138 | * i must have a predecessor 139 | */ 140 | assert(rank(i)>0); 141 | 142 | return select(rank(i)-1); 143 | 144 | 145 | } 146 | 147 | /* 148 | * input: position 0<=i<=n 149 | * output: rank of predecessor of i (i excluded) in 150 | * bitvector space. If i does not have a predecessor, 151 | * return rank of the last bit set in the bitvector 152 | */ 153 | ulint predecessor_rank_circular(ulint i){ 154 | 155 | return rank(i)==0 ? number_of_1()-1 : rank(i)-1; 156 | 157 | } 158 | 159 | /* 160 | * retrieve length of the i-th gap (i>=0). gap length includes the leading 1 161 | * \param i0 176 | * returns: position of the i-th one in the bitvector. i starts from 0! 177 | */ 178 | ulint select(ulint i){ 179 | 180 | assert(i::rank_1_type(&sdv); 225 | select1 = sd_vector<>::select_1_type(&sdv); 226 | 227 | } 228 | 229 | private: 230 | 231 | //bitvector length 232 | ulint u = 0; 233 | 234 | sd_vector<> sdv; 235 | sd_vector<>::rank_1_type rank1; 236 | sd_vector<>::select_1_type select1; 237 | 238 | }; 239 | 240 | } 241 | 242 | 243 | #endif /* INTERNAL_SPARSE_SD_VECTOR_HPP_ */ 244 | -------------------------------------------------------------------------------- /internal/succinct_bit_vector.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | /* 6 | * succinct_bit_vector: a wrapper on biy_vector of the sdsl library, with support for rank and select 7 | */ 8 | 9 | 10 | #ifndef INTERNAL_SUCCINCT_BIT_VECTOR_HPP_ 11 | #define INTERNAL_SUCCINCT_BIT_VECTOR_HPP_ 12 | 13 | #include 14 | 15 | using namespace std; 16 | using namespace sdsl; 17 | 18 | #ifndef ulint 19 | typedef uint64_t ulint; 20 | #endif 21 | 22 | #ifndef uint 23 | typedef uint32_t uint; 24 | #endif 25 | 26 | namespace ri{ 27 | 28 | class succinct_bit_vector{ 29 | 30 | public: 31 | 32 | /* 33 | * empty constructor. Initialize bitvector with length 0. 34 | */ 35 | succinct_bit_vector(){} 36 | 37 | /* 38 | * constructor. build bitvector given a vector of bools 39 | */ 40 | succinct_bit_vector(vector b){ 41 | 42 | bv = bit_vector(b.size()); 43 | 44 | for(uint64_t i=0;i=0 93 | * returns: position of the i-th one in the bitvector. i starts from 0! 94 | */ 95 | ulint select(ulint i){ 96 | 97 | assert(i0); 122 | 123 | return size; 124 | 125 | } 126 | 127 | /* load the structure from the istream 128 | * \param in the istream 129 | */ 130 | void load(std::istream& in) { 131 | 132 | bv.load(in); 133 | rank1 = bit_vector::rank_1_type(&bv); 134 | select1 = bit_vector::select_1_type(&bv); 135 | 136 | } 137 | 138 | private: 139 | 140 | bit_vector bv; 141 | bit_vector::rank_1_type rank1; 142 | bit_vector::select_1_type select1; 143 | 144 | }; 145 | 146 | } 147 | 148 | 149 | #endif /* INTERNAL_SUCCINCT_BIT_VECTOR_HPP_ */ 150 | -------------------------------------------------------------------------------- /internal/utils.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | #include 6 | 7 | using namespace std; 8 | 9 | #ifndef UTILS_RI_HPP_ 10 | #define UTILS_RI_HPP_ 11 | 12 | using ulint = uint64_t; 13 | 14 | string get_time(uint64_t time){ 15 | 16 | stringstream ss; 17 | 18 | if(time>=3600){ 19 | 20 | uint64_t h = time/3600; 21 | uint64_t m = (time%3600)/60; 22 | uint64_t s = (time%3600)%60; 23 | 24 | ss << time << " seconds. ("<< h << "h " << m << "m " << s << "s" << ")"; 25 | 26 | }else if (time>=60){ 27 | 28 | uint64_t m = time/60; 29 | uint64_t s = time%60; 30 | 31 | ss << time << " seconds. ("<< m << "m " << s << "s" << ")"; 32 | 33 | }else{ 34 | 35 | ss << time << " seconds."; 36 | 37 | } 38 | 39 | return ss.str(); 40 | 41 | } 42 | 43 | uint8_t bitsize(uint64_t x){ 44 | 45 | if(x==0) return 1; 46 | return 64 - __builtin_clzll(x); 47 | 48 | } 49 | 50 | //parse pizza&chilli patterns header: 51 | void header_error(){ 52 | cout << "Error: malformed header in patterns file" << endl; 53 | cout << "Take a look here for more info on the file format: http://pizzachili.dcc.uchile.cl/experiments.html" << endl; 54 | exit(0); 55 | } 56 | 57 | ulint get_number_of_patterns(string header){ 58 | 59 | ulint start_pos = header.find("number="); 60 | if (start_pos == std::string::npos or start_pos+7>=header.size()) 61 | header_error(); 62 | 63 | start_pos += 7; 64 | 65 | ulint end_pos = header.substr(start_pos).find(" "); 66 | if (end_pos == std::string::npos) 67 | header_error(); 68 | 69 | ulint n = std::atoi(header.substr(start_pos).substr(0,end_pos).c_str()); 70 | 71 | return n; 72 | 73 | } 74 | 75 | ulint get_patterns_length(string header){ 76 | 77 | ulint start_pos = header.find("length="); 78 | if (start_pos == std::string::npos or start_pos+7>=header.size()) 79 | header_error(); 80 | 81 | start_pos += 7; 82 | 83 | ulint end_pos = header.substr(start_pos).find(" "); 84 | if (end_pos == std::string::npos) 85 | header_error(); 86 | 87 | ulint n = std::atoi(header.substr(start_pos).substr(0,end_pos).c_str()); 88 | 89 | return n; 90 | 91 | } 92 | 93 | 94 | 95 | #endif /* UTILS_RI_HPP_ */ 96 | -------------------------------------------------------------------------------- /ri-build.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | #include 6 | 7 | #include "internal/r_index.hpp" 8 | #include "utils.hpp" 9 | #include "internal/r_index.hpp" 10 | 11 | using namespace ri; 12 | using namespace std; 13 | 14 | string out_basename=string(); 15 | string input_file=string(); 16 | int sa_rate = 512; 17 | bool sais=true; 18 | ulint T = 0;//Build fast index with SA rate = T 19 | bool fast = false;//build fast index 20 | bool hyb = false; //use hybrid bitvectors instead of sd_vectors? 21 | 22 | void help(){ 23 | cout << "ri-build: builds the r-index. Extension .ri is automatically added to output index file" << endl << endl; 24 | cout << "Usage: ri-build [options] " << endl; 25 | cout << " -o use 'basename' as prefix for all index files. Default: basename is the specified input_file_name"< T>0. if used, build the fast index (see option -fast) storing T SA samples before and after each"< input text file." << endl; 35 | exit(0); 36 | } 37 | 38 | void parse_args(char** argv, int argc, int &ptr){ 39 | 40 | assert(ptr=argc-1){ 48 | cout << "Error: missing parameter after -o option." << endl; 49 | help(); 50 | } 51 | 52 | out_basename = string(argv[ptr]); 53 | ptr++; 54 | 55 | }else if(s.compare("-divsufsort")==0){ 56 | 57 | sais = false; 58 | 59 | }/*else if(s.compare("-h")==0){ 60 | 61 | hyb=true; 62 | 63 | }/*else if(s.compare("-fast")==0){ 64 | 65 | fast=true; 66 | 67 | }else if(s.compare("-T")==0){ 68 | 69 | T = atoi(argv[ptr]); 70 | 71 | if(T<=0){ 72 | cout << "Error: parameter T must be T>0" << endl; 73 | help(); 74 | } 75 | 76 | ptr++; 77 | fast=true; 78 | 79 | }*/else{ 80 | cout << "Error: unrecognized '" << s << "' option." << endl; 81 | help(); 82 | } 83 | 84 | } 85 | 86 | int main(int argc, char** argv){ 87 | 88 | using std::chrono::high_resolution_clock; 89 | using std::chrono::duration_cast; 90 | using std::chrono::duration; 91 | 92 | auto t1 = high_resolution_clock::now(); 93 | 94 | //parse options 95 | 96 | out_basename=string(); 97 | input_file=string(); 98 | int ptr = 1; 99 | 100 | if(argc<2) help(); 101 | 102 | while(ptr(input,sais); 139 | //idx.serialize(out); 140 | 141 | }else{ 142 | 143 | auto idx = r_index<>(input,sais); 144 | idx.serialize(out); 145 | 146 | } 147 | 148 | 149 | auto t2 = high_resolution_clock::now(); 150 | ulint total = duration_cast>>(t2 - t1).count(); 151 | cout << "Build time : " << get_time(total) << endl; 152 | 153 | 154 | out.close(); 155 | 156 | } 157 | -------------------------------------------------------------------------------- /ri-count.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | #include 6 | 7 | #include "internal/r_index.hpp" 8 | 9 | #include "internal/utils.hpp" 10 | 11 | using namespace ri; 12 | using namespace std; 13 | 14 | string check = string();//check occurrences on this text 15 | 16 | bool hyb=false; 17 | 18 | void help(){ 19 | cout << "ri-count: number of occurrences of the input patterns." << endl << endl; 20 | 21 | cout << "Usage: ri-count " << endl; 22 | //cout << " -h use hybrid bitvectors instead of elias-fano in both RLBWT and predecessor structures. -h is required "< index file (with extension .ri)" << endl; 25 | cout << " file in pizza&chili format containing the patterns." << endl; 26 | exit(0); 27 | } 28 | 29 | void parse_args(char** argv, int argc, int &ptr){ 30 | 31 | assert(ptr 51 | void count(std::ifstream& in, string patterns){ 52 | 53 | using std::chrono::high_resolution_clock; 54 | using std::chrono::duration_cast; 55 | using std::chrono::duration; 56 | 57 | string text; 58 | bool c = false; 59 | 60 | if(check.compare(string()) != 0){ 61 | 62 | c = true; 63 | 64 | ifstream ifs1(check); 65 | stringstream ss; 66 | ss << ifs1.rdbuf();//read the file 67 | text = ss.str(); 68 | 69 | } 70 | 71 | auto t1 = high_resolution_clock::now(); 72 | 73 | idx_t idx; 74 | 75 | idx.load(in); 76 | 77 | auto t2 = high_resolution_clock::now(); 78 | 79 | cout << "searching patterns ... " << endl; 80 | ifstream ifs(patterns); 81 | 82 | //read header of the pizza&chilli input file 83 | //header example: 84 | //# number=7 length=10 file=genome.fasta forbidden=\n\t 85 | string header; 86 | std::getline(ifs, header); 87 | 88 | ulint n = get_number_of_patterns(header); 89 | ulint m = get_patterns_length(header); 90 | 91 | uint last_perc = 0; 92 | 93 | ulint occ_tot=0; 94 | 95 | //extract patterns from file and search them in the index 96 | for(ulint i=0;ilast_perc){ 100 | cout << perc << "% done ..." << endl; 101 | last_perc=perc; 102 | } 103 | 104 | string p = string(); 105 | 106 | for(ulint j=0;j(t2 - t1).count(); 127 | cout << "Load time : " << load << " milliseconds" << endl; 128 | 129 | uint64_t search = std::chrono::duration_cast(t3 - t2).count(); 130 | cout << "number of patterns n = " << n << endl; 131 | cout << "pattern length m = " << m << endl; 132 | cout << "total number of occurrences occ_t = " << occ_tot << endl; 133 | 134 | cout << "Total time : " << search << " milliseconds" << endl; 135 | cout << "Search time : " << (double)search/n << " milliseconds/pattern (total: " << n << " patterns)" << endl; 136 | cout << "Search time : " << (double)search/occ_tot << " milliseconds/occurrence (total: " << occ_tot << " occurrences)" << endl; 137 | 138 | } 139 | 140 | int main(int argc, char** argv){ 141 | 142 | if(argc < 3) 143 | help(); 144 | 145 | int ptr = 1; 146 | 147 | while(ptr >(in, patt_file); 165 | 166 | }else{ 167 | 168 | count >(in, patt_file); 169 | 170 | } 171 | 172 | in.close(); 173 | 174 | } 175 | -------------------------------------------------------------------------------- /ri-locate.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | #include 6 | 7 | #include "internal/r_index.hpp" 8 | 9 | #include "internal/utils.hpp" 10 | 11 | using namespace ri; 12 | using namespace std; 13 | 14 | string check = string();//check occurrences on this text 15 | bool hyb=false; 16 | string ofile; 17 | 18 | void help(){ 19 | cout << "ri-locate: locate all occurrences of the input patterns." << endl << endl; 20 | 21 | cout << "Usage: ri-locate [options] " << endl; 22 | cout << " -c check correctness of each pattern occurrence on this text file (must be the same indexed)" << endl; 23 | //cout << " -h use hybrid bitvectors instead of elias-fano in both RLBWT and predecessor structures. -h is required "< write pattern occurrences to this file (ASCII)" << endl; 26 | cout << " index file (with extension .ri)" << endl; 27 | cout << " file in pizza&chili format containing the patterns." << endl; 28 | exit(0); 29 | } 30 | 31 | void parse_args(char** argv, int argc, int &ptr){ 32 | 33 | assert(ptr=argc-1){ 41 | cout << "Error: missing parameter after -c option." << endl; 42 | help(); 43 | } 44 | 45 | check = string(argv[ptr]); 46 | ptr++; 47 | 48 | }else if(s.compare("-o")==0){ 49 | 50 | if(ptr>=argc-1){ 51 | cout << "Error: missing parameter after -o option." << endl; 52 | help(); 53 | } 54 | 55 | ofile = string(argv[ptr]); 56 | ptr++; 57 | 58 | }/*else if(s.compare("-h")==0){ 59 | 60 | hyb=true; 61 | 62 | }*/else{ 63 | 64 | cout << "Error: unknown option " << s << endl; 65 | help(); 66 | 67 | } 68 | 69 | } 70 | 71 | 72 | template 73 | void locate(std::ifstream& in, string patterns){ 74 | 75 | using std::chrono::high_resolution_clock; 76 | using std::chrono::duration_cast; 77 | using std::chrono::duration; 78 | 79 | string text; 80 | bool c = false; 81 | 82 | ofstream out; 83 | 84 | if(ofile.compare(string())!=0){ 85 | 86 | out = ofstream(ofile); 87 | 88 | } 89 | 90 | if(check.compare(string()) != 0){ 91 | 92 | c = true; 93 | 94 | ifstream ifs1(check); 95 | stringstream ss; 96 | ss << ifs1.rdbuf();//read the file 97 | text = ss.str(); 98 | 99 | } 100 | 101 | auto t1 = high_resolution_clock::now(); 102 | 103 | idx_t idx; 104 | 105 | idx.load(in); 106 | 107 | auto t2 = high_resolution_clock::now(); 108 | 109 | cout << "searching patterns ... " << endl; 110 | ifstream ifs(patterns); 111 | 112 | //read header of the pizza&chilli input file 113 | //header example: 114 | //# number=7 length=10 file=genome.fasta forbidden=\n\t 115 | string header; 116 | std::getline(ifs, header); 117 | 118 | ulint n = get_number_of_patterns(header); 119 | ulint m = get_patterns_length(header); 120 | 121 | uint last_perc = 0; 122 | 123 | ulint occ_tot=0; 124 | 125 | //extract patterns from file and search them in the index 126 | for(ulint i=0;ilast_perc){ 130 | cout << perc << "% done ..." << endl; 131 | last_perc=perc; 132 | } 133 | 134 | string p = string(); 135 | 136 | for(ulint j=0;j(t2 - t1).count(); 205 | cout << "Load time : " << load << " milliseconds" << endl; 206 | 207 | uint64_t search = std::chrono::duration_cast(t3 - t2).count(); 208 | cout << "number of patterns n = " << n << endl; 209 | cout << "pattern length m = " << m << endl; 210 | cout << "total number of occurrences occ_t = " << occ_tot << endl; 211 | 212 | cout << "Total time : " << search << " milliseconds" << endl; 213 | cout << "Search time : " << (double)search/n << " milliseconds/pattern (total: " << n << " patterns)" << endl; 214 | cout << "Search time : " << (double)search/occ_tot << " milliseconds/occurrence (total: " << occ_tot << " occurrences)" << endl; 215 | 216 | } 217 | 218 | int main(int argc, char** argv){ 219 | 220 | if(argc < 3) 221 | help(); 222 | 223 | int ptr = 1; 224 | 225 | while(ptr >(in, patt_file); 243 | 244 | }else{ 245 | 246 | locate >(in, patt_file); 247 | 248 | } 249 | 250 | in.close(); 251 | 252 | } 253 | -------------------------------------------------------------------------------- /ri-space.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2017, Nicola Prezza. All rights reserved. 2 | // Use of this source code is governed 3 | // by a MIT license that can be found in the LICENSE file. 4 | 5 | #include 6 | 7 | #include "internal/r_index.hpp" 8 | #include "internal/utils.hpp" 9 | 10 | using namespace ri; 11 | using namespace std; 12 | 13 | bool hyb=false; 14 | 15 | void help(){ 16 | cout << "ri-space: breakdown of index space usage" << endl; 17 | cout << "Usage: ri-space " << endl; 18 | //cout << " -h use hybrid bitvectors instead of elias-fano in both RLBWT and predecessor structures. -h is required "< index file (with extension .ri)" << endl; 21 | exit(0); 22 | } 23 | 24 | 25 | void parse_args(char** argv, int argc, int &ptr){ 26 | 27 | assert(ptr idx; 57 | idx.load_from_file(argv[ptr]); 58 | auto space = idx.print_space(); 59 | cout << "\nTOT space: " << space << " Bytes" < idx; 64 | idx.load_from_file(argv[ptr]); 65 | auto space = idx.print_space(); 66 | cout << "\nTOT space: " << space << " Bytes" <