├── .gitignore ├── README.md ├── __init__.py ├── crawler.py ├── filter_profile.py ├── pagination.py ├── profile_tag_edit.py ├── targets.txt ├── test.py └── util.py /.gitignore: -------------------------------------------------------------------------------- 1 | profiles/ 2 | profiles_example/ 3 | __pycache__/ 4 | lib/ 5 | bin/ 6 | *.json 7 | userdata/ 8 | include/ 9 | .Python -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Instagram Crawler 2 | This crawler was made because most of the crawlers out there seems to either require a browser or a developer account. This Instagram crawler utilizes a private API of Instagram and thus no developer account is required. However, it needs your Instagram account information as it uses your user endpoints. 3 | 4 | Instagram may or may not approve of this method. It is known to regularly shut down user accounts that are suspected of traffic hoarding. Use at your own risk. 5 | 6 | This README assumes, to an extent, the reader's knowledge of [graphs](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) and [graph search algorithms](https://en.wikipedia.org/wiki/Graph_traversal#Graph_traversal_algorithms). Regardless, there shouldn't be a big problem understanding it. 7 | 8 | ## Installation 9 | First install [Instagram Private API](https://github.com/ping/instagram_private_api). Kudos for a great project! 10 | ```sh 11 | $ pip install git+https://github.com/ping/instagram_private_api.git@1.2.7 12 | ``` 13 | 14 | Then download or clone this project into a folder. 15 | ```sh 16 | $ git clone https://github.com/simonseo/instacrawler-privateapi.git 17 | ``` 18 | 19 | [Sign up to Instagram](https://www.instagram.com/) if you don't have an account. Take note of your username and password. 20 | 21 | 22 | ## Get Crawlin' 23 | 24 | Now if you try to run `__init__.py` in the project folder from a shell, it'll provide you with the command options. If this shows up, everything probably works. Also try `python __init.py__ -h` for more information regarding the options. 25 | ```sh 26 | $ python __init__.py 27 | usage: __init__.py [-h] -u USERNAME -p PASSWORD [-f TARGETFILE] [-t TARGET] 28 | ``` 29 | 30 | To get crawlin', you need to provide your 31 | 1. Instagram username 32 | 1. Instagram passwor 33 | 1. either an Instagram ID (target `-t`) or a text file of Instagram IDs in each row (targetfile `-f`) 34 | 35 | ### Examples 36 | #### Single Root Node 37 | In the case you want to start at one specific user node, provide the ID/username/handle with the option `-t`. `selenagomez` is a good place to start because this account is one of the most followed account. 38 | ```sh 39 | $ python __init__.py -u -p -t selenagomez 40 | ``` 41 | 42 | #### Multiple Root Nodes 43 | In the case you want to crawl from multiple user nodes, list the IDs in a separate file and pass the filename with the `-f` option. Example: 44 | ```sh 45 | $ python __init__.py -u -p -f "people I stalk.txt" 46 | ``` 47 | 48 | In `people I stalk.txt` you should have accounts that you want to start at: 49 | ``` 50 | instagram 51 | selenagomez 52 | realdonaldtrump 53 | president_vladimir_putin 54 | ``` 55 | 56 | Wait a bit and a folder will be made with crawled profiles as json files. 57 | 58 | ## Config 59 | Inside `__init__.py`, there is a config dictionary. Each config option is explained in the comments. 60 | Note that `min_collect_media` and `max_collect_media` is trumped if `min_timestamp` is provided as a number. 61 | ``` 62 | config = { 63 | 'search_algorithm' : 'BFS', # Possible values: BFS, DFS 64 | 'profile_path' : './profiles', # Path where output data gets saved 65 | 'max_followers' : 10, # How many followers per user to collect 66 | 'max_following' : 15, # how many follows to collect per user 67 | 'min_collect_media' : 10, # how many media items to be collected per person. If time is specified, this is ignored 68 | 'max_collect_media' : 10, # how many media items to be collected per person. If time is specified, this is ignored 69 | 'max_collect_users' : 1000, # how many users to collect in total. 70 | # 'min_timestamp' : int(time() - 60*60*24*30*2) # up to how recent you want the posts to be in seconds. If you do not want to use this, put None as value 71 | 'min_timestamp' : None 72 | } 73 | ``` 74 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os.path 3 | import argparse 4 | import multiprocessing as mp 5 | import csv 6 | from time import time 7 | from collections import deque 8 | from util import file_to_list 9 | from crawler import crawl 10 | try: 11 | from instagram_private_api import ( 12 | Client, __version__ as client_version) 13 | except ImportError: 14 | import sys 15 | sys.path.append(os.path.join(os.path.dirname(__file__), '..')) 16 | from instagram_private_api import ( 17 | Client, __version__ as client_version) 18 | 19 | 20 | if __name__ == '__main__': 21 | # Example command: 22 | # python examples/savesettings_logincallback.py -u "yyy" -p "zzz" --targetfile "names.txt" 23 | parser = argparse.ArgumentParser(description='Crawling') 24 | parser.add_argument('-u', '--username', dest='username', type=str, required=True) 25 | parser.add_argument('-p', '--password', dest='password', type=str, required=True) 26 | parser.add_argument('-f', '--targetfile', dest='targetfile', type=str, required=False) 27 | parser.add_argument('-t', '--target', dest='target', type=str, required=False) 28 | parser.add_argument('--hashtag', dest='use_hashtag', action='store_true') 29 | parser.set_defaults(use_hashtag=False) 30 | 31 | args = parser.parse_args() 32 | config = { 33 | 'search_algorithm' : 'BFS', # Possible values: BFS, DFS 34 | 'profile_path' : './profiles', # Path where output data gets saved 35 | 'max_followers' : 10, # How many followers per user to collect 36 | 'max_following' : 15, # how many follows to collect per user 37 | 'min_collect_media' : 10, # how many media items to be collected per person/hashtag. If time is specified, this is ignored 38 | 'max_collect_media' : 10, # how many media items to be collected per person/hashtag. If time is specified, this is ignored 39 | 'max_collect_users' : 1000, # how many users to collect in total. 40 | # 'min_timestamp' : int(time() - 60*60*24*30*2) # up to how recent you want the posts to be in seconds. If you do not want to use this, put None as value 41 | 'min_timestamp' : None 42 | } 43 | 44 | try: 45 | if args.target: 46 | origin_names = [args.target] 47 | elif args.targetfile: 48 | origin_names = file_to_list(args.targetfile) 49 | else: 50 | raise Exception('No crawl target given. Provide a username with -t option or file of usernames with -f') 51 | print('Client version: %s' % client_version) 52 | print(origin_names) 53 | api = Client(args.username, args.password) 54 | except Exception as e: 55 | raise Exception("Unable to initiate API:", e) 56 | else: 57 | print("Initiating API") 58 | 59 | # open graph files 60 | if not os.path.exists('./userdata/'): 61 | os.makedirs('userdata') 62 | v = "./userdata/visited.csv" 63 | open(v,'a').close() 64 | visited_nodes = mp.Manager().list([int(node) for node in file_to_list(v)]) 65 | s = "./userdata/skipped.csv" 66 | open(s,'a').close() 67 | skipped_nodes = mp.Manager().list([int(node) for node in file_to_list(s)]) 68 | 69 | try: 70 | jobs = [] 71 | for origin in (api.username_info(username) for username in origin_names): 72 | # crawl(api, origin, config, visited_nodes, skipped_nodes) 73 | p = mp.Process(target=crawl, args=(api, origin, config, visited_nodes, skipped_nodes)) 74 | jobs.append(p) 75 | p.start() 76 | except KeyboardInterrupt: 77 | print('Jobs terminated') 78 | except Exception as e: 79 | print(e) 80 | for p in jobs: 81 | p.join() 82 | 83 | with open(v, "w") as output: 84 | writer = csv.writer(output, lineterminator='\n') 85 | writer.writerow(visited_nodes) 86 | with open(s, "w") as output: 87 | writer = csv.writer(output, lineterminator='\n') 88 | writer.writerow(skipped_nodes) 89 | -------------------------------------------------------------------------------- /crawler.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | from collections import deque 4 | from re import findall 5 | from time import time 6 | from util import randselect, byteify, file_to_list 7 | import csv 8 | 9 | def crawl(api, origin,config, visited_nodes, skipped_nodes): 10 | print('Crawling started at origin user', origin['user']['username'], 'with ID', origin['user']['pk']) 11 | que = deque([]) 12 | que.append(origin['user']['pk']) 13 | 14 | while len(visited_nodes) < config['max_collect_users']: 15 | user_id = que.popleft() 16 | if user_id in visited_nodes or user_id in skipped_nodes: 17 | print('Already visited or skipped') 18 | continue 19 | if visit_profile(api, user_id, config): 20 | visited_nodes.append(user_id) 21 | else: 22 | skipped_nodes.append(user_id) 23 | following, followers = get_community(api, user_id, config) 24 | extend_que(following, followers, que, config) 25 | if len(que) <= 0: 26 | raise Exception('Que empty!') 27 | 28 | def visit_profile(api, user_id, config): 29 | while True: 30 | try: 31 | profile = api.user_info(user_id) 32 | print('Visiting:', profile['user']['username']) 33 | processed_profile = { 34 | 'user_id' : user_id, 35 | 'username' : profile['user']['username'], 36 | 'full_name' : profile['user']['full_name'], 37 | 'profile_pic_url' : profile['user']['profile_pic_url'], 38 | 'media_count' : profile['user']['media_count'], 39 | 'follower_count' : profile['user']['follower_count'], 40 | 'following_count' : profile['user']['following_count'], 41 | 'posts' : [] 42 | } 43 | feed = get_posts(api, user_id, config) 44 | posts = [beautify_post(api, post) for post in feed] 45 | posts = list(filter(lambda x: not x is None, posts)) 46 | if len(posts) < config['min_collect_media']: 47 | return False 48 | else: 49 | processed_profile['posts'] = posts[:config['max_collect_media']] 50 | 51 | if not os.path.exists(config['profile_path'] + os.sep): os.makedirs(config['profile_path']) 52 | with open(config['profile_path'] + os.sep + str(user_id) + '.json', 'w') as file: 53 | json.dump(processed_profile, file, indent=2) 54 | except Exception as e: 55 | print('exception while visiting profile', e) 56 | if api.friendships_show(user_id)['is_private']: 57 | return False 58 | if str(e) == '-': 59 | return False 60 | else: 61 | return True 62 | 63 | def beautify_post(api, post): 64 | if post['media_type'] != 1: # If post is not a single image media 65 | return None 66 | keys = post.keys() 67 | processed_media = { 68 | 'date' : post['taken_at'], 69 | 'pic_url' : post['image_versions2']['candidates'][0]['url'], 70 | 'like_count' : post['like_count'] if 'like_count' in keys else 0, 71 | 'comment_count' : post['comment_count'] if 'comment_count' in keys else 0, 72 | 'caption' : post['caption']['text'] if 'caption' in keys and post['caption'] is not None else '' 73 | } 74 | processed_media['tags'] = findall(r'#[^#\s]*', processed_media['caption']) 75 | # print(processed_media['tags']) 76 | return processed_media 77 | 78 | def get_posts(api, user_id, config): 79 | feed = [] 80 | results = api.user_feed(user_id, min_timestamp=config['min_timestamp']) 81 | feed.extend(results.get('items', [])) 82 | 83 | if config['min_timestamp'] is not None: return feed 84 | 85 | next_max_id = results.get('next_max_id') 86 | while next_max_id and len(feed) < config['max_collect_media']: 87 | print("next_max_id", next_max_id, "len(feed) < max_collect_media", len(feed) < config['max_collect_media'] ) 88 | results = api.user_feed(user_id, max_id=next_max_id) 89 | feed.extend(results.get('items', [])) 90 | next_max_id = results.get('next_max_id') 91 | 92 | return feed 93 | 94 | def get_community(api, user_id, config): 95 | while True: 96 | try: 97 | following = [] 98 | uuid = api.generate_uuid(return_hex=False, seed='0') 99 | results = api.user_following(user_id, rank_token=uuid) 100 | following.extend(results.get('users', [])) 101 | next_max_id = results.get('next_max_id') 102 | while next_max_id and len(following) < config['max_following']: 103 | results = api.user_following(user_id, rank_token=uuid, max_id=next_max_id) 104 | following.extend(results.get('users', [])) 105 | next_max_id = results.get('next_max_id') 106 | 107 | followers = [] 108 | results = api.user_followers(user_id, rank_token=uuid) 109 | followers.extend(results.get('users', [])) 110 | next_max_id = results.get('next_max_id') 111 | while next_max_id and len(followers) < config['max_followers']: 112 | results = api.user_followers(user_id, rank_token=uuid, max_id=next_max_id) 113 | followers.extend(results.get('users', [])) 114 | next_max_id = results.get('next_max_id') 115 | 116 | print(user_id, 'has', len(following), 'following and', len(followers), 'followers') 117 | return following, followers 118 | following, followers = get_community(api, user_id, config) 119 | except Exception as e: 120 | print('exception while getting community', e) 121 | else: 122 | break 123 | 124 | def extend_que(following, followers, que, config): 125 | try: 126 | if config['search_algorithm'] == 'BFS': 127 | que.extend(randselect([u['pk'] for u in following], config['max_following'])) 128 | que.extend(randselect([u['pk'] for u in followers], config['max_followers'])) 129 | elif config['search_algorithm'] == 'DFS': 130 | que.extendleft(randselect([u['pk'] for u in following], config['max_following'])) 131 | que.extendleft(randselect([u['pk'] for u in followers], config['max_followers'])) 132 | else: 133 | raise Exception('Please provide proper search algorithm in config (BFS or DFS)') 134 | except Exception as e: 135 | print('exception while extending que', e) 136 | for user_id in que: 137 | try: 138 | float(user_id) 139 | except ValueError: 140 | raise Exception('wrong value put into que') 141 | -------------------------------------------------------------------------------- /filter_profile.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | # @File Name: filter_profile.py 4 | # @Created: 2017-05-19 20:10:56 Simon Myunggun Seo (simon.seo@nyu.edu) 5 | # @Updated: 2017-05-20 08:41:30 Simon Seo (simon.seo@nyu.edu) 6 | 7 | '''This script was created to deal with profiles that have less than 10 posts.''' 8 | 9 | import os 10 | import json 11 | 12 | rootdir = './profiles_3' 13 | for subdir, dirs, files in os.walk(rootdir): 14 | for file in files: 15 | #print os.path.join(subdir, file) 16 | filepath = subdir + os.sep + file 17 | # filepath = "profiles/1057969838.json" 18 | if filepath.endswith(".json"): 19 | with open(filepath, 'r') as fp: 20 | print (filepath) 21 | profile = json.load(fp) 22 | l = len(profile['posts']) 23 | if l != 10: 24 | print('removing {}'.format(filepath)) 25 | os.remove(filepath) 26 | -------------------------------------------------------------------------------- /pagination.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os.path 3 | import logging 4 | import argparse 5 | try: 6 | from instagram_private_api import ( 7 | Client, __version__ as client_version) 8 | except ImportError: 9 | import sys 10 | sys.path.append(os.path.join(os.path.dirname(__file__), '..')) 11 | from instagram_private_api import ( 12 | Client, __version__ as client_version) 13 | 14 | 15 | if __name__ == '__main__': 16 | 17 | logging.basicConfig() 18 | logger = logging.getLogger('instagram_private_api') 19 | logger.setLevel(logging.WARNING) 20 | 21 | # Example command: 22 | # python examples/savesettings_logincallback.py -u "yyy" -p "zzz" -settings "test_credentials.json" 23 | parser = argparse.ArgumentParser(description='Pagination demo') 24 | parser.add_argument('-u', '--username', dest='username', type=str, required=True) 25 | parser.add_argument('-p', '--password', dest='password', type=str, required=True) 26 | parser.add_argument('-debug', '--debug', action='store_true') 27 | 28 | args = parser.parse_args() 29 | if args.debug: 30 | logger.setLevel(logging.DEBUG) 31 | 32 | print('Client version: %s' % client_version) 33 | api = Client(args.username, args.password) 34 | 35 | # user_id = '2958144170' 36 | origin = api.username_info('simon_oncepiglet') 37 | user_id = origin['user']['pk'] 38 | followers = [] 39 | print(api.user_info(user_id)) 40 | results = api.user_followers(user_id, rank_token=uuid) 41 | 42 | followers.extend(results.get('users', [])) 43 | 44 | next_max_id = results.get('next_max_id') 45 | while next_max_id: 46 | try: 47 | print(api.user_info(next_max_id)) 48 | except Exception as e: 49 | print(e) 50 | pass 51 | results = api.user_followers(user_id, rank_token=uuid, max_id=next_max_id) 52 | followers.extend(results.get('users', [])) 53 | if len(followers) >= 500: # get only first 600 or so 54 | break 55 | next_max_id = results.get('next_max_id') 56 | 57 | followers.sort(key=lambda x: x['pk']) 58 | # print list of user IDs 59 | print(json.dumps([u['username'] for u in followers], indent=2)) 60 | # for u in followers -------------------------------------------------------------------------------- /profile_tag_edit.py: -------------------------------------------------------------------------------- 1 | '''This script was created to deal with posts that had tags in the form of #smth#smth, without a space. 2 | The current crawler should handle such cases properly.''' 3 | 4 | import os 5 | import json 6 | 7 | rootdir = './profiles' 8 | for subdir, dirs, files in os.walk(rootdir): 9 | for file in files: 10 | #print os.path.join(subdir, file) 11 | filepath = subdir + os.sep + file 12 | # filepath = "profiles/1057969838.json" 13 | if filepath.endswith(".json"): 14 | with open(filepath, 'r') as fp: 15 | print (filepath) 16 | profile = json.load(fp) 17 | for post in profile['posts']: 18 | temp = [] 19 | d = '#' 20 | for tag in post['tags']: 21 | temp.extend([d+e for e in tag.split(d) if e]) 22 | post['tags'] = temp 23 | with open(filepath, 'w') as fp: 24 | json.dump(profile, fp, indent=2) 25 | -------------------------------------------------------------------------------- /targets.txt: -------------------------------------------------------------------------------- 1 | selenagomez 2 | chelstoth9 3 | instagram -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os.path 3 | import argparse 4 | try: 5 | from instagram_private_api import ( 6 | Client, __version__ as client_version) 7 | except ImportError: 8 | import sys 9 | sys.path.append(os.path.join(os.path.dirname(__file__), '..')) 10 | from instagram_private_api import ( 11 | Client, __version__ as client_version) 12 | 13 | 14 | if __name__ == '__main__': 15 | 16 | # Example command: 17 | # python examples/savesettings_logincallback.py -u "yyy" -p "zzz" -settings "test_credentials.json" 18 | parser = argparse.ArgumentParser(description='Pagination demo') 19 | parser.add_argument('-u', '--username', dest='username', type=str, required=True) 20 | parser.add_argument('-p', '--password', dest='password', type=str, required=True) 21 | 22 | args = parser.parse_args() 23 | if args.debug: 24 | logger.setLevel(logging.DEBUG) 25 | 26 | print('Client version: %s' % client_version) 27 | api = Client(args.username, args.password) 28 | 29 | # user_id = '2958144170' 30 | origin = api.username_info('simon_oncepiglet') 31 | user_id = origin['user']['pk'] 32 | followers = [] 33 | print(api.user_info(user_id)) 34 | results = api.user_followers(user_id, rank_token=uuid) 35 | 36 | followers.extend(results.get('users', [])) 37 | 38 | next_max_id = results.get('next_max_id') 39 | while next_max_id: 40 | try: 41 | print(api.user_info(next_max_id)) 42 | except Exception as e: 43 | print(e) 44 | pass 45 | results = api.user_followers(user_id, rank_token=uuid, max_id=next_max_id) 46 | followers.extend(results.get('users', [])) 47 | if len(followers) >= 500: # get only first 600 or so 48 | break 49 | next_max_id = results.get('next_max_id') 50 | 51 | followers.sort(key=lambda x: x['pk']) 52 | # print list of user IDs 53 | print(json.dumps([u['username'] for u in followers], indent=2)) 54 | # for u in followers -------------------------------------------------------------------------------- /util.py: -------------------------------------------------------------------------------- 1 | from random import sample, shuffle 2 | import csv 3 | 4 | def randselect(list, num): 5 | l = len(list) 6 | if l <= num: 7 | return shuffle(list) 8 | if l > 5*num: 9 | return sample(list[:5*num], num) 10 | 11 | def byteify(input): 12 | if isinstance(input, dict): 13 | return {byteify(key): byteify(value) 14 | for key, value in input.iteritems()} 15 | elif isinstance(input, list): 16 | return [byteify(element) for element in input] 17 | elif isinstance(input, unicode): 18 | return input.encode('utf-8') 19 | else: 20 | return input 21 | 22 | def file_to_list(file): 23 | '''1 Dimensional''' 24 | data = [] 25 | f = open(file, 'r') 26 | contents = csv.reader(f.read().splitlines()) 27 | count = 0 28 | try: 29 | for c in contents: 30 | count += 1 31 | data.append(c) 32 | except Exception as e: 33 | print("count",count) 34 | raise e 35 | 36 | if len(data) >= 2: 37 | return [d[0] for d in data] 38 | elif len(data) == 1: 39 | return data[0] 40 | else: 41 | return data 42 | 43 | --------------------------------------------------------------------------------