├── README.md
├── import.py
└── requirements.txt


/README.md:
--------------------------------------------------------------------------------
 1 | Firebase Streaming Import
 2 | ===========================
 3 | 
 4 | - Utilizes ijson python json streaming library along with requests to import a large json piecemeal into Firebase.
 5 | 
 6 | - This is a **two-pass** script.  Run it once in normal mode to write all the data, then run it again in --priority-mode to write priority data.
 7 | 
 8 | - Defaults to 8-thread parallelization.  Tweak this argument for your own best performance.  
 9 | 
10 | - Repeats efforts already done in [firebase-import](https://github.com/firebase/firebase-import), however firebase-import doesn't handle large json files well.  Node runs out of memory.  This script streams in data so there are no limits, however it might not be as fast or efficient as the other one.
11 | 
12 | - Root of tree does not need to be empty, since we make REST PATCH calls
13 | 
14 | - Speed: about 30 seconds/mb, for datasets with many small leaf values.  Performance improves when leaves have larger values.
15 | 
16 | Requirements: 
17 | - run `pip install -r requirements.txt`
18 | - May need to do `pip install pp --allow-unverified pp` in order to install the pp module
19 | 
20 | ```
21 | usage: import.py [-h] [-a AUTH] [-t THREADS] [-s] [-p] firebase_url json_file
22 | 
23 | Import a large json file into a Firebase via json Streaming. Uses HTTP PATCH
24 | requests. Two-pass script, run once normally, then again in --priority_mode.
25 | 
26 | positional arguments:
27 |   firebase_url          Specify the Firebase URL (e.g.
28 |                         https://test.firebaseio.com/dest/path/).
29 |   json_file             The JSON file to import.
30 | 
31 | optional arguments:
32 |   -h, --help            show this help message and exit
33 |   -a AUTH, --auth AUTH  Optional Auth token if necessary to write to Firebase.
34 |   -t THREADS, --threads THREADS
35 |                         Number of parallel threads to use, default 8.
36 |   -s, --silent          Silences the server response, speeding up the
37 |                         connection.
38 |   -p, --priority_mode   Run this script in priority mode after running it in
39 |                         normal mode to write all priority values.
40 | ```
41 | 


--------------------------------------------------------------------------------
/import.py:
--------------------------------------------------------------------------------
 1 | import ijson
 2 | import requests
 3 | import argparse
 4 | import json
 5 | import sys
 6 | import re
 7 | import traceback
 8 | import pp
 9 | import time
10 | 
11 | 
12 | def main(args):
13 |     print("started at {0}".format(time.time()))
14 | 
15 |     parser = ijson.parse(open(args.json_file))
16 |     session = requests.Session()
17 |     parallelJobs = pp.Server()
18 |     parallelJobs.set_ncpus(args.threads)
19 | 
20 |     for prefix, event, value in parser:
21 |         if value is not None and event != 'map_key':
22 | 
23 |             # ijson sends the prefix as a string of keys connected by periods,
24 |             # but Firebase uses periods for special values such as priority.
25 |             # 1. Find '..', and store the indexes of the second period
26 |             doublePeriodIndexes = [m.start() + 1 for m in re.finditer('\.\.', prefix)]
27 |             # 2. Replace all '.' with ' '
28 |             prefix = prefix.replace('.', ' ')
29 |             # 3. Use stored indexes of '..' to recreate second periods in the pairs of periods
30 |             prefixList = list(prefix)
31 |             for index in doublePeriodIndexes:
32 |                 prefixList[index] = '.'
33 |             prefix = "".join(prefixList)
34 |             # 4. Split on whitespace
35 |             prefixes = prefix.split(' ')
36 |             lastPrefix = prefixes[-1]
37 |             prefixes = prefixes[:-1]
38 | 
39 |             url = args.firebase_url
40 |             for prefix in prefixes:
41 |                 url += prefix + '/'
42 |             url += '.json'
43 |             if args.silent:
44 |                 url += '?print=silent'
45 | 
46 |             if not args.priority_mode:
47 |                 if lastPrefix == '.priority':
48 |                     continue
49 |             else:
50 |                 if lastPrefix != '.priority':
51 |                     continue
52 | 
53 |             if event == 'number':
54 |                 dataObj = {lastPrefix: float(value)}
55 |             else:
56 |                 dataObj = {lastPrefix: value}
57 | 
58 |             try:
59 |                 parallelJobs.submit(sendData, (url, dataObj, session, args), (), ("json", "requests"))
60 |             except Exception, e:
61 |                 print('Caught an error: ' + traceback.format_exc())
62 |                 print prefix, event, value
63 | 
64 |     # If we don't wait for all jobs to finish, the script will end and kill all still open threads
65 |     parallelJobs.wait()
66 |     print("finished at {0}".format(time.time()))
67 | 
68 | 
69 | def sendData(url, dataObject, session, args):
70 |     if args.auth is not None:
71 |         authObj = {'auth': args.auth}
72 |         session.patch(url, data=json.dumps(dataObject), params=authObj)
73 |     else:
74 |         session.patch(url, data=json.dumps(dataObject))
75 | 
76 | 
77 | if __name__ == '__main__':
78 |     argParser = argparse.ArgumentParser(description="Import a large json file into a Firebase via json Streaming.\
79 |                                                      Uses HTTP PATCH requests.  Two-pass script, run once normally,\
80 |                                                      then again in --priority_mode.")
81 |     argParser.add_argument('firebase_url', help="Specify the Firebase URL (e.g. https://test.firebaseio.com/dest/path/).")
82 |     argParser.add_argument('json_file', help="The JSON file to import.")
83 |     argParser.add_argument('-a', '--auth', help="Optional Auth token if necessary to write to Firebase.")
84 |     argParser.add_argument('-t', '--threads', type=int, default=8, help='Number of parallel threads to use, default 8.')
85 |     argParser.add_argument('-s', '--silent', action='store_true',
86 |                            help="Silences the server response, speeding up the connection.")
87 |     argParser.add_argument('-p', '--priority_mode', action='store_true',
88 |                            help='Run this script in priority mode after running it in normal mode to write all priority values.')
89 | 
90 |     main(argParser.parse_args())
91 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | argparse
3 | ijson
4 | traceback
5 | pp


--------------------------------------------------------------------------------