├── .gitignore ├── LICENSE ├── README.md ├── bitcoin_ingest ├── blockutil.py ├── continuous_ingest.py ├── fetch_blocks.py ├── fetch_exchange_rates.py ├── ingest_data.py ├── requirements.txt ├── sample └── block_0.bin └── schema_raw.cql /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | __pycache__/ 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 AIT Austrian Institute of Technology 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GraphSense Datafeed 2 | 3 | A service for ingesting raw data into Apache Cassandra. 4 | 5 | At the moment, the following sources are supported: 6 | 7 | * Bitcoin transactions extracted from the [Bitcoin Client][bitcoin-client] 8 | * currency conversion rates from [ariva.de][ariva.de] 9 | 10 | ## Prerequisites 11 | 12 | ### Python 13 | 14 | Make sure Python 3 is installed 15 | 16 | python --version 17 | 18 | Install dependencies (`requests` and `cassandra-driver`), e.g. via `pip` 19 | 20 | pip install -r requirements.txt 21 | 22 | ### Apache Cassandra 23 | 24 | Download and install [Apache Cassandra][apache-cassandra] >= 3.11 25 | in `$CASSANDRA_HOME`. 26 | 27 | Start Cassandra (in the foreground for development purposes): 28 | 29 | $CASSANDRA_HOME/bin/cassandra -f 30 | 31 | Connect to Cassandra via CQL 32 | 33 | $CASSANDRA_HOME/bin/cqlsh 34 | 35 | and test if it is running 36 | 37 | cqlsh> SELECT cluster_name, listen_address FROM system.local; 38 | 39 | cluster_name | listen_address 40 | --------------+---------------- 41 | Test Cluster | 127.0.0.1 42 | 43 | (1 rows) 44 | 45 | ## Ingest Bitcoin transactions 46 | 47 | Create raw data keyspace in Cassandra 48 | 49 | $CASSANDRA_HOME/bin/cqlsh -f schema_raw.cql 50 | 51 | Use the following script to retrieve larger transaction files from a running 52 | Bitcoin client. The following retrieves 50000 blocks from the Bitcoin client 53 | and stores them in binary format to `block_*` files within the sample folder 54 | (replace `BITCOIN_CLIENT` with the hostname or IP address of the Bitcoin REST 55 | interface) 56 | 57 | mkdir data 58 | python fetch_blocks.py -d ./data -h BITCOIN_CLIENT -n 50000 59 | 60 | The -f option specifies the prefix of the files generated (saves 32Mb files 61 | named into the `data` folder) and the -h option specifies the server running 62 | the Bitcoin server. 63 | 64 | Ingest the blocks from the data directory into Cassandra 65 | 66 | python ingest_data.py -d ./data/ -c localhost -p 2 67 | 68 | The -c option specifies that your Cassandra server is running on the 69 | localhost and the -p options specifies the number of worker processes to 70 | be used by the ingest. 71 | 72 | If necessary the `continuous_ingest.py` script can be started after the 73 | `ingest_raw.py` script in order to retrieve and ingest the blocks 74 | present in the blockchain but not ingested into Cassandra. This script 75 | checks periodically (specified via the -s option) for newly available 76 | blocks and ingests them. 77 | 78 | python continuous_ingest.py -h BITCOIN_CLIENT -c localhost -s 10 79 | 80 | ## Ingest exchange rates 81 | 82 | python fetch_exchange_rates.py -c localhost 83 | 84 | The -c option specifies that the Cassandra server is running on the 85 | localhost. 86 | 87 | [bitcoin-client]: https://github.com/graphsense/bitcoin-client 88 | [ariva.de]: http://www.ariva.de 89 | [apache-cassandra]: http://cassandra.apache.org/download/ 90 | -------------------------------------------------------------------------------- /bitcoin_ingest: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | ### BEGIN INIT INFO 4 | # Provides: bitcoin_ingest 5 | # Required-Start: $remote_fs $syslog 6 | # Required-Stop: $remote_fs $syslog 7 | # Default-Start: 2 3 4 5 8 | # Default-Stop: 0 1 6 9 | # Short-Description: Bitcoin ingest service 10 | # Description: ingest Bitcoin block/transaction data from the Bitcoin REST interface into Apache Cassandra 11 | ### END INIT INFO 12 | 13 | ### Adjust the following lines to fit your needs 14 | DIR= 15 | DAEMON=$DIR/continuous_ingest.py 16 | DAEMON_NAME=bitcoin_ingest 17 | DAEMON_OPTS="-h localhost -c localhost -l /var/log/bitcoin_ingest.log" 18 | ### 19 | 20 | # This next line determines what user the script runs as. 21 | # Root generally not recommended but necessary if you are using the Raspberry Pi GPIO from Python. 22 | DAEMON_USER=root 23 | 24 | # The process ID of the script when it runs is stored here: 25 | PIDFILE=/var/run/$DAEMON_NAME.pid 26 | 27 | . /lib/lsb/init-functions 28 | 29 | do_start () { 30 | log_daemon_msg "Starting system $DAEMON_NAME daemon" 31 | echo "start-stop-daemon --start --background --pidfile $PIDFILE --make-pidfile --user $DAEMON_USER --chuid $DAEMON_USER --startas $DAEMON -- $DAEMON_OPTS" 32 | start-stop-daemon -v --start --background --pidfile $PIDFILE --make-pidfile --user $DAEMON_USER --chuid $DAEMON_USER --startas $DAEMON -- $DAEMON_OPTS 33 | log_end_msg $? 34 | } 35 | do_stop () { 36 | log_daemon_msg "Stopping system $DAEMON_NAME daemon" 37 | start-stop-daemon -v --stop --pidfile $PIDFILE --retry 10 38 | log_end_msg $? 39 | } 40 | 41 | case "$1" in 42 | 43 | start|stop) 44 | do_${1} 45 | ;; 46 | 47 | restart|reload|force-reload) 48 | do_stop 49 | do_start 50 | ;; 51 | 52 | status) 53 | status_of_proc "$DAEMON_NAME" "$DAEMON" && exit 0 || exit $? 54 | ;; 55 | 56 | *) 57 | echo "Usage: /etc/init.d/$DAEMON_NAME {start|stop|restart|status}" 58 | exit 1 59 | ;; 60 | 61 | esac 62 | exit 0 63 | -------------------------------------------------------------------------------- /blockutil.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import sys 3 | 4 | BLOCKCHAIN_API = '' 5 | 6 | 7 | def hash_str(bytebuffer): 8 | return "".join(("%02x" % a) for a in bytebuffer) 9 | 10 | 11 | def transform_json(raw_block): 12 | transaction_ids = [] 13 | transactions = [] 14 | for raw_tx in raw_block["tx"]: 15 | tx = [] 16 | tx.append(bytearray.fromhex(raw_tx["hash"])) 17 | tx.append(raw_block["height"]) 18 | tx.append(raw_block["time"]) 19 | tx.append(raw_block["size"]) 20 | 21 | coinbase = False 22 | vins = [] 23 | for raw_vin in raw_tx["vin"]: 24 | if "coinbase" in raw_vin.keys(): 25 | coinbase = True 26 | else: 27 | vin = [] 28 | if "txid" in raw_vin.keys(): 29 | vin.append(bytearray.fromhex(raw_vin["txid"])) 30 | if "vout" in raw_vin.keys(): 31 | vin.append(raw_vin["vout"]) 32 | vins.append(vin) 33 | tx.append(coinbase) 34 | tx.append(vins) 35 | vouts = [] 36 | for raw_vout in raw_tx["vout"]: 37 | vout = [] 38 | if "value" in raw_vout.keys(): 39 | vout.append(int(raw_vout["value"] * 1e8 + 0.1)) 40 | if "n" in raw_vout.keys(): 41 | vout.append(raw_vout["n"]) 42 | adresses = [] 43 | if "scriptPubKey" in raw_vout.keys(): 44 | if "addresses" in raw_vout["scriptPubKey"].keys(): 45 | adresses = raw_vout["scriptPubKey"]["addresses"] 46 | vout.append(adresses) 47 | vouts.append(vout) 48 | tx.append(vouts) 49 | transactions.append(tx) 50 | transaction_ids.append(bytearray.fromhex(raw_tx["hash"])) 51 | block = [] 52 | block.append(raw_block["height"]) 53 | block.append(bytearray.fromhex(raw_block["hash"])) 54 | block.append(raw_block["time"]) 55 | block.append(raw_block["version"]) 56 | block.append(raw_block["size"]) 57 | block.append(transaction_ids) 58 | 59 | if "nextblockhash" in raw_block.keys(): 60 | next_block = raw_block["nextblockhash"] 61 | else: 62 | next_block = None 63 | return (next_block, block, transactions) 64 | 65 | 66 | def fetch_block(block_hash): 67 | global BLOCKCHAIN_API 68 | sz_req = BLOCKCHAIN_API + block_hash + ".json" 69 | while True: 70 | try: 71 | r = requests.get(sz_req) 72 | if r.status_code == requests.codes.ok: 73 | return r 74 | except KeyboardInterrupt: 75 | print("Ctrl-c pressed ...") 76 | sys.exit(1) 77 | except: 78 | print("Request failed. Retrying ...", end="\r") 79 | 80 | 81 | def fetch_block_json(block_hash): 82 | return fetch_block(block_hash).json() 83 | 84 | 85 | def fetch_block_text(block_hash): 86 | return fetch_block(block_hash).text 87 | 88 | 89 | def set_blockchain_api(endpoint): 90 | global BLOCKCHAIN_API 91 | BLOCKCHAIN_API = endpoint 92 | -------------------------------------------------------------------------------- /continuous_ingest.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import argparse 3 | import sys 4 | import time 5 | import logging 6 | from logging.handlers import TimedRotatingFileHandler 7 | from pathlib import Path 8 | from cassandra.cluster import Cluster 9 | from cassandra.query import BatchStatement 10 | import blockutil 11 | 12 | LOG_LEVEL = logging.INFO # could be e.g. "DEBUG" or "WARNING" 13 | BLOCK_0 = "000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f" 14 | LOCKFILE = "/var/lock/graphsense_transformation.lock" 15 | 16 | 17 | # class to capture stdout and sterr in the log 18 | class Logger(object): 19 | def __init__(self, logger, level): 20 | self.logger = logger 21 | self.level = level 22 | 23 | def write(self, message): 24 | # only log if there is a message (not just a new line) 25 | if message.rstrip() != "": 26 | self.logger.log(self.level, message.rstrip()) 27 | 28 | 29 | class FakeRS(object): 30 | def __init__(self, block_hash, height): 31 | self.block_hash = block_hash 32 | self.height = height 33 | 34 | 35 | class BlockchainIngest: 36 | 37 | def __init__(self, session): 38 | self.__session = session 39 | cql_stmt = """INSERT INTO block 40 | (height, block_hash, timestamp, block_version, size, txs) 41 | VALUES (?, ?, ?, ?, ?, ?);""" 42 | self.__insert_block_stmt = session.prepare(cql_stmt) 43 | cql_stmt = """INSERT INTO transaction 44 | (block_group, tx_number, tx_hash, 45 | height, timestamp, size, coinbase, vin, vout) 46 | VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?);""" 47 | self.__insert_transaction_stmt = session.prepare(cql_stmt) 48 | 49 | def write_next_blocks(self, start_block): 50 | next_block = blockutil.hash_str(start_block) 51 | while next_block: 52 | block_json = blockutil.fetch_block_json(next_block) 53 | next_block, block, txs = blockutil.transform_json(block_json) 54 | batchStmt = BatchStatement() 55 | batchStmt.add(self.__insert_block_stmt, block) 56 | block_group = block[0] // 10000 57 | tx_number = 0 58 | for transaction in txs: 59 | batchStmt.add(self.__insert_transaction_stmt, 60 | [block_group, tx_number] + transaction) 61 | tx_number += 1 62 | while True: 63 | try: 64 | self.__session.execute(batchStmt) 65 | except Exception as err: 66 | print("Exception ", err, " retrying...", end="\r") 67 | continue 68 | break 69 | print("Wrote block %d" % (block[0]), end="\r") 70 | 71 | def get_last_block(self, keyspace): 72 | select_stmt = "SELECT height, block_hash FROM " + keyspace + \ 73 | ".block WHERE height = ?;" 74 | block_max = 0 75 | block_inc = 100000 76 | last_rs = None 77 | rs = None 78 | while True: 79 | last_rs = rs 80 | rs = self.__session.execute(self.__session.prepare(select_stmt), 81 | [block_max]) 82 | if not rs: 83 | if block_max == 0: 84 | return [FakeRS(bytearray.fromhex(BLOCK_0), 0)] 85 | if block_inc == 1: 86 | return last_rs 87 | else: 88 | block_max -= block_inc 89 | block_inc //= 10 90 | else: 91 | block_max += block_inc 92 | 93 | 94 | def main(): 95 | parser = argparse.ArgumentParser(description="Bitcoin ingest service", 96 | add_help=False) 97 | parser.add_argument('--help', action='help', 98 | help='show this help message and exit') 99 | parser.add_argument("-h", "--host", dest="host", required=True, 100 | default="localhost", metavar="RPC_HOST", 101 | help="host running bitcoin RPC interface") 102 | parser.add_argument("-p", "--port", dest="port", 103 | type=int, default=8332, 104 | help="port number of RPC interface") 105 | parser.add_argument("-c", "--cassandra", dest="cassandra", 106 | default="localhost", metavar="CASSANDRA_NODE", 107 | help="address or name of cassandra database") 108 | parser.add_argument("-k", "--keyspace", dest="keyspace", 109 | help="keyspace to import data to", 110 | default="graphsense_raw") 111 | parser.add_argument("-s", "--sleep", dest="sleep", 112 | type=int, default=600, 113 | help="numbers of seconds to sleep " + 114 | "before checking for new blocks") 115 | parser.add_argument("-l", "--log", dest="log", 116 | help="Location of log file") 117 | args = parser.parse_args() 118 | 119 | if args.log: 120 | logger = logging.getLogger(__name__) 121 | logger.setLevel(LOG_LEVEL) 122 | # handler that writes to a file, creates a new file at midnight 123 | # and keeps 3 backups 124 | handler = TimedRotatingFileHandler(args.log, when="midnight", 125 | backupCount=3) 126 | log_fmt = "%(asctime)s %(levelname)-8s %(message)s" 127 | formatter = logging.Formatter(log_fmt) 128 | handler.setFormatter(formatter) 129 | logger.addHandler(handler) 130 | # log stdout to file at INFO level 131 | sys.stdout = Logger(logger, logging.INFO) 132 | # log stderr to file at ERROR level 133 | sys.stderr = Logger(logger, logging.ERROR) 134 | 135 | cluster = Cluster([args.cassandra]) 136 | session = cluster.connect() 137 | session.default_timeout = 60 138 | session.set_keyspace(args.keyspace) 139 | bc_ingest = BlockchainIngest(session) 140 | 141 | blockutil.set_blockchain_api("http://%s:%d/rest/block/" % 142 | (args.host, args.port)) 143 | 144 | while True: 145 | if Path(LOCKFILE).is_file(): 146 | print("Found lockfile %s, pausing ingest." % LOCKFILE) 147 | else: 148 | last_rs = bc_ingest.get_last_block(args.keyspace) 149 | if last_rs: 150 | hash_val = last_rs[0].block_hash 151 | print("Found last block:") 152 | print("\tHeight:\t%d" % last_rs[0].height) 153 | print("\tHash:\t%s" % blockutil.hash_str(hash_val)) 154 | if hash_val: 155 | bc_ingest.write_next_blocks(hash_val) 156 | else: 157 | print("Could not get last block. Exiting...") 158 | break 159 | time.sleep(args.sleep) 160 | 161 | 162 | if __name__ == "__main__": 163 | main() 164 | -------------------------------------------------------------------------------- /fetch_blocks.py: -------------------------------------------------------------------------------- 1 | from argparse import ArgumentParser 2 | import json 3 | import os 4 | import pickle 5 | import blockutil 6 | 7 | BLOCK_0 = "000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f" 8 | 9 | 10 | def write_blocks_to_file(directory, file_prefix, start_block, no_blocks): 11 | 12 | next_block = start_block 13 | counter = 0 14 | out = None 15 | filesize = 1024 * 1024 * 32 16 | while True: 17 | if not out or (out and (out.tell() > filesize)): 18 | if out: 19 | out.close() 20 | filename = os.path.join(directory, 21 | "%s_%s.bin" % (file_prefix, counter)) 22 | print("Writing to file " + filename) 23 | out = open(filename, "wb") 24 | block_json = json.loads(blockutil.fetch_block_text(next_block)) 25 | 26 | next_block, block, txs = blockutil.transform_json(block_json) 27 | block.append(txs) 28 | pickle.dump(block, out, -1) 29 | 30 | print("Wrote block %s" % counter, end="\r") 31 | if not next_block: 32 | break 33 | if no_blocks != 0: 34 | if counter >= no_blocks: 35 | break 36 | counter += 1 37 | out.close() 38 | 39 | 40 | def main(): 41 | parser = ArgumentParser(add_help=False) 42 | parser.add_argument('--help', action='help', 43 | help='show this help message and exit') 44 | parser.add_argument("-d", "--directory", dest="directory", required=True, 45 | help="directory containing exported block files") 46 | parser.add_argument("-f", "--filename", dest="file_prefix", 47 | default="blocks", 48 | help="file prefix of exported block files") 49 | parser.add_argument("-h", "--host", dest="host", required=True, 50 | default="localhost", metavar="RPC_HOST", 51 | help="host running bitcoin RPC interface") 52 | parser.add_argument("-p", "--port", dest="port", 53 | type=int, default=8332, 54 | help="port number of RPC interface") 55 | parser.add_argument("-s", "--startblock", dest="startblock", 56 | default=BLOCK_0, 57 | help="hash of first block to export") 58 | parser.add_argument("-n", "--numblocks", dest="numblocks", 59 | type=int, default=0, 60 | help="number of blocks to write " + 61 | "(default value 0 exports all blocks)") 62 | 63 | args = parser.parse_args() 64 | 65 | blockutil.set_blockchain_api("http://%s:%d/rest/block/" % 66 | (args.host, args.port)) 67 | write_blocks_to_file(args.directory, args.file_prefix, 68 | args.startblock, args.numblocks) 69 | 70 | 71 | if __name__ == "__main__": 72 | main() 73 | -------------------------------------------------------------------------------- /fetch_exchange_rates.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import requests 3 | from argparse import ArgumentParser 4 | from datetime import datetime 5 | from cassandra.cluster import Cluster 6 | 7 | BTC_EUR = (111697700, 163) 8 | BTC_USD = (111720171, 167) 9 | 10 | CSV_URL = "http://www.ariva.de/quote/historic/" + \ 11 | "historic.csv?secu={}&boerse_id={}" 12 | 13 | 14 | def ingest_exchange_rates(session, currency="eur", 15 | security_no=111697700, boerse_id=163): 16 | 17 | fetch_url = CSV_URL.format(security_no, boerse_id) 18 | print("Fetching exchange rates from {}\n".format(fetch_url)) 19 | 20 | insert_stmt = """INSERT INTO exchange_rates 21 | (timestamp, {}) VALUES (?, ?)""".format(currency) 22 | prep_stmt = session.prepare(insert_stmt) 23 | 24 | print("Ingesting exchange rates into Cassandra.\n") 25 | with requests.Session() as s: 26 | download = s.get(fetch_url) 27 | decoded_content = download.content.decode("utf-8") 28 | 29 | cr = csv.reader(decoded_content.splitlines(), delimiter=";") 30 | my_list = list(cr) 31 | for index, row in enumerate(my_list): 32 | if index > 0 and len(row) != 0: 33 | timestamp = datetime.strptime(row[0], "%Y-%m-%d").timestamp() 34 | value = float(row[1].replace(".", "").replace(",", ".")) 35 | session.execute(prep_stmt, (int(timestamp), value)) 36 | 37 | print("Finished ingest for currency {}.".format(currency)) 38 | 39 | 40 | def main(): 41 | parser = ArgumentParser() 42 | parser.add_argument("-c", "--cassandra", dest="cassandra", 43 | metavar="CASSANDRA_NODE", default="localhost", 44 | help="cassandra node") 45 | parser.add_argument("-k", "--keyspace", dest="keyspace", 46 | help="keyspace to import data to", 47 | default="graphsense_raw") 48 | 49 | args = parser.parse_args() 50 | 51 | cluster = Cluster([args.cassandra]) 52 | session = cluster.connect() 53 | session.set_keyspace(args.keyspace) 54 | 55 | ingest_exchange_rates(session, "eur", BTC_EUR[0], BTC_EUR[1]) 56 | ingest_exchange_rates(session, "usd", BTC_USD[0], BTC_USD[1]) 57 | 58 | 59 | if __name__ == "__main__": 60 | main() 61 | -------------------------------------------------------------------------------- /ingest_data.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import time 4 | from argparse import ArgumentParser 5 | from multiprocessing import Pool, Value 6 | from cassandra.cluster import Cluster 7 | from cassandra.query import BatchStatement 8 | 9 | 10 | def split_list(alist, wanted_parts=1): 11 | length = len(alist) 12 | return [alist[i * length // wanted_parts: (i + 1) * length // wanted_parts] 13 | for i in range(wanted_parts)] 14 | 15 | 16 | class QueryManager(object): 17 | # chosen to match the default in execute_concurrent_with_args 18 | concurrency = 100 19 | counter = Value("d", 0) 20 | 21 | def __init__(self, cluster, keyspace, process_count=1): 22 | self.processes = process_count 23 | self.pool = Pool(processes=process_count, 24 | initializer=self._setup, 25 | initargs=(cluster, keyspace)) 26 | 27 | @classmethod 28 | def _setup(cls, cluster, keyspace): 29 | cls.cluster = Cluster([cluster]) 30 | cls.session = cls.cluster.connect() 31 | cls.session.default_timeout = 60 32 | cls.session.set_keyspace(keyspace) 33 | cql_stmt = """INSERT INTO block 34 | (height, block_hash, timestamp, block_version, size, txs) 35 | VALUES (?, ?, ?, ?, ?, ?);""" 36 | cls.insert_block_stmt = cls.session.prepare(cql_stmt) 37 | 38 | cql_stmt = """INSERT INTO transaction 39 | (block_group, tx_number, tx_hash, 40 | height, timestamp, size, coinbase, vin, vout) 41 | VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?);""" 42 | cls.insert_transaction_stmt = cls.session.prepare(cql_stmt) 43 | 44 | def insert(self, files): 45 | params = list(files) 46 | self.pool.map(_multiprocess_insert, split_list(params, self.processes)) 47 | 48 | @classmethod 49 | def insertBlocks(cls, params): 50 | print("ingesting files", params) 51 | for filename in params: 52 | pickle_input = open(filename, "rb") 53 | 54 | while True: 55 | try: 56 | block = pickle.load(pickle_input) 57 | except EOFError: 58 | break 59 | if (cls.counter.value % 1000) == 0: 60 | print("Read block %d" % (cls.counter.value), end="\r") 61 | with cls.counter.get_lock(): 62 | cls.counter.value += 1 63 | 64 | transactions = block[6] 65 | block.pop(6) 66 | 67 | batchStmt = BatchStatement() 68 | batchStmt.add(cls.insert_block_stmt, block) 69 | block_group = block[0] // 10000 70 | tx_number = 0 71 | for transaction in transactions: 72 | batchStmt.add(cls.insert_transaction_stmt, 73 | [block_group, tx_number] + transaction) 74 | tx_number += 1 75 | 76 | while True: 77 | try: 78 | cls.session.execute(batchStmt) 79 | except Exception as err: 80 | print("Exception ", err, " retrying...", end="\r") 81 | continue 82 | break 83 | 84 | 85 | def _multiprocess_insert(params): 86 | return QueryManager.insertBlocks(params) 87 | 88 | 89 | def main(): 90 | parser = ArgumentParser() 91 | parser.add_argument("-c", "--cassandra", dest="cassandra", 92 | help="cassandra node", 93 | default="localhost") 94 | parser.add_argument("-k", "--keyspace", dest="keyspace", 95 | help="keyspace to import data to", 96 | default="graphsense_raw") 97 | parser.add_argument("-d", "--directory", dest="directory", required=True, 98 | help="source directory for raw JSON block dumps") 99 | parser.add_argument("-p", "--processes", dest="num_proc", 100 | type=int, default=1, 101 | help="number of processes") 102 | 103 | args = parser.parse_args() 104 | 105 | files = [os.path.join(args.directory, f) 106 | for f in os.listdir(args.directory) 107 | if os.path.isfile(os.path.join(args.directory, f))] 108 | 109 | qm = QueryManager(args.cassandra, args.keyspace, args.num_proc) 110 | start = time.time() 111 | qm.insert(files) 112 | delta = time.time() - start 113 | print("\n%.1fs" % delta) 114 | 115 | 116 | if __name__ == "__main__": 117 | main() 118 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests==2.13.0 2 | cassandra-driver 3 | -------------------------------------------------------------------------------- /sample/block_0.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/behas/graphsense-datafeed/91d2fa4948dbb114a546889e5640ce394f16f4ff/sample/block_0.bin -------------------------------------------------------------------------------- /schema_raw.cql: -------------------------------------------------------------------------------- 1 | DROP KEYSPACE IF EXISTS graphsense_raw; 2 | CREATE KEYSPACE graphsense_raw WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; 3 | 4 | USE graphsense_raw; 5 | 6 | // BITCOIN blockchain data 7 | 8 | CREATE TYPE input ( 9 | txid blob, 10 | vout int 11 | ); 12 | 13 | CREATE TYPE output ( 14 | value bigint, 15 | n int, 16 | addresses list 17 | ); 18 | 19 | CREATE TABLE block ( 20 | height int PRIMARY KEY, 21 | block_hash blob, 22 | timestamp int, 23 | block_version int, 24 | size int, 25 | txs list 26 | ); 27 | 28 | CREATE TABLE transaction ( 29 | block_group int, 30 | height int, 31 | size int, 32 | tx_hash blob, 33 | tx_number int, 34 | timestamp int, 35 | coinbase boolean, 36 | vin list>, 37 | vout list>, 38 | PRIMARY KEY (block_group, height, tx_hash) 39 | ); 40 | 41 | CREATE TABLE exchange_rates ( 42 | timestamp int PRIMARY KEY, 43 | eur double, 44 | usd double 45 | ); 46 | 47 | CREATE TABLE tag ( 48 | address text, 49 | tag text, 50 | tag_uri text, 51 | description text, 52 | actor_category text, 53 | source text, 54 | source_uri text, 55 | timestamp int, 56 | PRIMARY KEY (address, tag, source) 57 | ); 58 | 59 | --------------------------------------------------------------------------------