├── README.md ├── import.py └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | Firebase Streaming Import 2 | =========================== 3 | 4 | - Utilizes ijson python json streaming library along with requests to import a large json piecemeal into Firebase. 5 | 6 | - This is a **two-pass** script. Run it once in normal mode to write all the data, then run it again in --priority-mode to write priority data. 7 | 8 | - Defaults to 8-thread parallelization. Tweak this argument for your own best performance. 9 | 10 | - Repeats efforts already done in [firebase-import](https://github.com/firebase/firebase-import), however firebase-import doesn't handle large json files well. Node runs out of memory. This script streams in data so there are no limits, however it might not be as fast or efficient as the other one. 11 | 12 | - Root of tree does not need to be empty, since we make REST PATCH calls 13 | 14 | - Speed: about 30 seconds/mb, for datasets with many small leaf values. Performance improves when leaves have larger values. 15 | 16 | Requirements: 17 | - run `pip install -r requirements.txt` 18 | - May need to do `pip install pp --allow-unverified pp` in order to install the pp module 19 | 20 | ``` 21 | usage: import.py [-h] [-a AUTH] [-t THREADS] [-s] [-p] firebase_url json_file 22 | 23 | Import a large json file into a Firebase via json Streaming. Uses HTTP PATCH 24 | requests. Two-pass script, run once normally, then again in --priority_mode. 25 | 26 | positional arguments: 27 | firebase_url Specify the Firebase URL (e.g. 28 | https://test.firebaseio.com/dest/path/). 29 | json_file The JSON file to import. 30 | 31 | optional arguments: 32 | -h, --help show this help message and exit 33 | -a AUTH, --auth AUTH Optional Auth token if necessary to write to Firebase. 34 | -t THREADS, --threads THREADS 35 | Number of parallel threads to use, default 8. 36 | -s, --silent Silences the server response, speeding up the 37 | connection. 38 | -p, --priority_mode Run this script in priority mode after running it in 39 | normal mode to write all priority values. 40 | ``` 41 | -------------------------------------------------------------------------------- /import.py: -------------------------------------------------------------------------------- 1 | import ijson 2 | import requests 3 | import argparse 4 | import json 5 | import sys 6 | import re 7 | import traceback 8 | import pp 9 | import time 10 | 11 | 12 | def main(args): 13 | print("started at {0}".format(time.time())) 14 | 15 | parser = ijson.parse(open(args.json_file)) 16 | session = requests.Session() 17 | parallelJobs = pp.Server() 18 | parallelJobs.set_ncpus(args.threads) 19 | 20 | for prefix, event, value in parser: 21 | if value is not None and event != 'map_key': 22 | 23 | # ijson sends the prefix as a string of keys connected by periods, 24 | # but Firebase uses periods for special values such as priority. 25 | # 1. Find '..', and store the indexes of the second period 26 | doublePeriodIndexes = [m.start() + 1 for m in re.finditer('\.\.', prefix)] 27 | # 2. Replace all '.' with ' ' 28 | prefix = prefix.replace('.', ' ') 29 | # 3. Use stored indexes of '..' to recreate second periods in the pairs of periods 30 | prefixList = list(prefix) 31 | for index in doublePeriodIndexes: 32 | prefixList[index] = '.' 33 | prefix = "".join(prefixList) 34 | # 4. Split on whitespace 35 | prefixes = prefix.split(' ') 36 | lastPrefix = prefixes[-1] 37 | prefixes = prefixes[:-1] 38 | 39 | url = args.firebase_url 40 | for prefix in prefixes: 41 | url += prefix + '/' 42 | url += '.json' 43 | if args.silent: 44 | url += '?print=silent' 45 | 46 | if not args.priority_mode: 47 | if lastPrefix == '.priority': 48 | continue 49 | else: 50 | if lastPrefix != '.priority': 51 | continue 52 | 53 | if event == 'number': 54 | dataObj = {lastPrefix: float(value)} 55 | else: 56 | dataObj = {lastPrefix: value} 57 | 58 | try: 59 | parallelJobs.submit(sendData, (url, dataObj, session, args), (), ("json", "requests")) 60 | except Exception, e: 61 | print('Caught an error: ' + traceback.format_exc()) 62 | print prefix, event, value 63 | 64 | # If we don't wait for all jobs to finish, the script will end and kill all still open threads 65 | parallelJobs.wait() 66 | print("finished at {0}".format(time.time())) 67 | 68 | 69 | def sendData(url, dataObject, session, args): 70 | if args.auth is not None: 71 | authObj = {'auth': args.auth} 72 | session.patch(url, data=json.dumps(dataObject), params=authObj) 73 | else: 74 | session.patch(url, data=json.dumps(dataObject)) 75 | 76 | 77 | if __name__ == '__main__': 78 | argParser = argparse.ArgumentParser(description="Import a large json file into a Firebase via json Streaming.\ 79 | Uses HTTP PATCH requests. Two-pass script, run once normally,\ 80 | then again in --priority_mode.") 81 | argParser.add_argument('firebase_url', help="Specify the Firebase URL (e.g. https://test.firebaseio.com/dest/path/).") 82 | argParser.add_argument('json_file', help="The JSON file to import.") 83 | argParser.add_argument('-a', '--auth', help="Optional Auth token if necessary to write to Firebase.") 84 | argParser.add_argument('-t', '--threads', type=int, default=8, help='Number of parallel threads to use, default 8.') 85 | argParser.add_argument('-s', '--silent', action='store_true', 86 | help="Silences the server response, speeding up the connection.") 87 | argParser.add_argument('-p', '--priority_mode', action='store_true', 88 | help='Run this script in priority mode after running it in normal mode to write all priority values.') 89 | 90 | main(argParser.parse_args()) 91 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | argparse 3 | ijson 4 | traceback 5 | pp --------------------------------------------------------------------------------