├── images └── .gitkeep ├── pages └── .gitkeep ├── .gitignore ├── README.md ├── filter.txt ├── cat-to-text.py └── marktplaats-example.html /images/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pages/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | pages/* 2 | index.html 3 | images/* 4 | ads.json 5 | test-*.html 6 | 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Marktplaats parser with filters 2 | DEMO: [http://ponify.nl/mp/test-1.html](http://ponify.nl/mp/test-1.html) 3 | 4 | Marktplaats has a limited category overview which displays 25 ad's max and does not allow filtering. This simple python crawler attempts to fix that by parsing and filtering the marktplaats category pages. It also creates a, IMHO more usable overview without all the nonsense. Furthermore it caches advertisements offline, so this can be run locally and is still usable when an internet connection is not available. 5 | -------------------------------------------------------------------------------- /filter.txt: -------------------------------------------------------------------------------- 1 | newray 2 | ringband 3 | Gelezen 4 | DVD 5 | oorkonde 6 | Bundesbahn 7 | Reichsbahn 8 | nieuwenhuis 9 | herdenkingsbord 10 | wandbord 11 | kop en schotel 12 | affiche 13 | sneeuwbal 14 | schilderij 15 | stars der schiene 16 | Video express 17 | Van den burg beeldproducties 18 | Van den burg 19 | hema 20 | koffiebeker 21 | kunst 22 | lucky life 23 | Hornby 24 | vintage 25 | lehman 26 | blikken 27 | Enter 28 | ho spoor 29 | memory 30 | antiek 31 | sinterklaas 32 | van reeuwijk 33 | kluwer 34 | antieke 35 | Van den burg beeldproducties 36 | Van den burg 37 | DVD stars 38 | dvdstars 39 | aille standaard 40 | houten 41 | speelgoed 42 | decoartie 43 | postzegels 44 | aandelen 45 | lima 46 | deze vaak als exacte kopie van het origineel 47 | Emaille standaard trein en tram serie 48 | Emaille standaard trein en tram serie.deze 49 | minitrains 50 | Wilhelmusjes 51 | video-kurier 52 | speelduur 53 | opwindbare 54 | stockton 55 | darlington 56 | geillustreerde 57 | treingids 58 | Trambladen 59 | schwebebahn 60 | h0 61 | modelbouw 62 | boek 63 | tijdschrift 64 | magazin 65 | magazine 66 | De trein hoort erbij 67 | Sporen over zee 68 | gebonden 69 | folder 70 | ansichtkaarten 71 | ansichtkaart 72 | foto 73 | Brochure 74 | jaargang 75 | nachschlagewerk 76 | treinkaartjes 77 | 150 jaar spoorwegen 78 | Jubileumuitgave 79 | railrunner 80 | reisgidsen 81 | prent 82 | Matchbox 83 | hardcover 84 | lepeltjes 85 | ansichten 86 | Elsevier 87 | Uitgegeven 88 | kaart 89 | enveloppe 90 | envelop 91 | Kaartspel 92 | kippenlijn 93 | waterlandse 94 | bahnpost 95 | Spoorwegen in Nederland 96 | Wereld Spoorwegatlas 97 | dolby digital 98 | roco 99 | knipsel 100 | Gestempeld 101 | Spoorwegmaterieel in Nederland 102 | Dieseltreinen in Nederland 103 | fleischman 104 | fabdor 105 | Treinen van A tot Z 106 | marklin 107 | kalender 108 | krant 109 | post it 110 | Overstappen aan de blaakschedijk 111 | Blikken trein 112 | blikken pioneer 113 | schuifspel 114 | Miniatuur 115 | redacteuren 116 | uitgeverij 117 | Theelepel 118 | opwindlocomotieven 119 | Kunststof treinstation 120 | miniatuur 121 | lepeltje 122 | kolen wagon 123 | fietslabels 124 | grote alken 125 | encyclopedia 126 | postzegel 127 | jigsaw 128 | uitgever 129 | uitgave 130 | dienstregeling 131 | dienstregelingen 132 | aandeel 133 | videobanden 134 | schuifpuzzel 135 | trsfo 136 | trafo 137 | den oudsten 138 | this, that & the other 139 | Op de rails 140 | rail hobby 141 | trix 142 | bahnland ddr 143 | brief 144 | De blauwe bus 145 | The Stock Book of the Nene Valley Railway 146 | Bahnnews 147 | kleinbeelddia 148 | gravure 149 | paul henken 150 | Bierpul 151 | Daar komt de trein 152 | time tables 153 | maandblad 154 | De spoorwegen van Afrika 155 | puntfriteszakjes 156 | Van dishoeck 157 | Db blick-punkt 158 | dia's 159 | onderzetters 160 | lage bekers 161 | hoge bekers 162 | christkindelsmarkt 163 | videoband 164 | de haagse paardetrams 165 | reeskamp 166 | De amsterdamse tram in het verleden 167 | legpuzzel 168 | vhs banden 169 | Leideritz 170 | spoorweg journaal 171 | Hema treinenset 172 | blikken rails 173 | Die stampende stomende locomotieven 174 | Samen moet het lukken 175 | kees volkers 176 | Treinenloop en vogelvlucht 177 | de spoorwegen van de united states en canada 178 | t. L. Hameeteman 179 | harry's kringloop 180 | ralf roman rossberg 181 | Treinen rixon bucknall helmond 182 | locomotievenencyclopedie 183 | alan kent 184 | 50 buren van ns 185 | Bestijg de trein nooit zonder uw valies met dromen 186 | isbn 187 | de eerste moeilijke jaren 188 | de doofpot in alphen aan den rijn 189 | die entwicklung der lokomotive band 190 | Lokaalspoorwegen in Twente en de Achterhoek 191 | NOCH Katalog 192 | vulcan ankerbahn 193 | Een Stukje Werkplaatsgeschiedenis 194 | A.D. Hildebrand 195 | pola licht masten 196 | railhobby 197 | Van Stoomtram-Locomotieven en Dieseltreinen 198 | De trein 1839-1939 199 | Aus den annalen der uster-oetwil-bahn 200 | reisgids 201 | elmec 202 | tourist map 203 | Van wijck jurriaanse 204 | Trams en tramlijnen 30 205 | asbak 206 | kwartet 207 | delfs blauw 208 | Gilson Riecke 209 | mecano 210 | kibri 211 | faller 212 | 40 jahre rekoloks der deutschen reichsbahn 213 | minuten durende video 214 | lgb 215 | trams en tramlijnen 216 | de blauwe tram 217 | Alkenreeks 218 | tegel 219 | tegels 220 | Lokaalspoorwegen in TWENTE 221 | anwb 222 | tegeltje 223 | caxton 224 | Märkli 225 | rokal 226 | bekers 227 | mokken 228 | Uitgegeven 229 | video casssette 230 | Tempo Doeloe 231 | hildebrand 232 | catalogus 233 | treinbaan 234 | Spoor en Trein 235 | Tramweggids 236 | grote alk 237 | vhs band 238 | serviesgoed 239 | bierglas 240 | future express 241 | Ongelopen 242 | Thomas the tank engine 243 | Dvd e.r. Video express 244 | Rio grande taigatrommel 245 | dordrecht van brug tot brug 246 | serieusbiedingen 247 | wekker 248 | treinservies 249 | Franstalig 250 | uitgegeven 251 | asbakje 252 | Marius broos 253 | NS-locomotievendepot Heerlen 254 | De mooiste treinreizen vd wereld pocket 255 | balpen 256 | De Amsterdamse Blauwen 257 | Stoppen op Doorreis 258 | tram amsterdam 259 | ge de bot 260 | atlas collections 261 | verzonken spoor 262 | der deutschen bundesbah 263 | schuyt & co 264 | puzzel 265 | atlas treinen 266 | knvto 267 | jan de bruin 268 | guus ferree 269 | Carel van Gestel 270 | Henk Bouman 271 | jouef 272 | Van den burg beeldproducties 273 | autobusgids 274 | Meccano 275 | blikkentreinen 276 | calssic 277 | tramconductrice 278 | Hesselink 279 | Drinkglas 280 | vhs banden 281 | legpuzzel 282 | de raaf 283 | stibans 284 | onderzetters 285 | onderzetter 286 | dishoeck 287 | atlas collectie 288 | grachten panden 289 | rekenmachine 290 | hollands Spoor door het verleden 291 | spiegeltje 292 | spiegel 293 | janetzky 294 | atlas treintje 295 | atlas trein 296 | Vrachtauto 297 | houten locomotief 298 | buttons 299 | frans 2de hands 300 | Kristallen 301 | Kristal 302 | timetable 303 | opname 304 | winterscheopname -------------------------------------------------------------------------------- /cat-to-text.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | # Copyright (C) 2014 - Remy van Elst 4 | 5 | # This program is free software: you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation, either version 3 of the License, or 8 | # (at your option) any later version. 9 | 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | from bs4 import BeautifulSoup 19 | import os, json, sys, cgi, csv, urllib2, datetime, base64, re, urlparse, hashlib, urllib, HTMLParser 20 | from multiprocessing import Pool 21 | from multiprocessing.dummy import Pool as ThreadPool 22 | 23 | # Max pages from marktplaats to parse 24 | max_pages = 102 25 | # Max ads on new overview page 26 | max_page_items = 100 27 | 28 | pool = ThreadPool() 29 | 30 | base_url = "http://www.marktplaats.nl/z/verzamelen/spoorwegen-en-tramwegen.html?categoryId=944&sortBy=SortIndex&sortOrder=decreasing¤tPage=" 31 | title = "Spoorwegen" 32 | 33 | filter = [] 34 | 35 | with open("./filter.txt") as filterfile: 36 | for line in filterfile: 37 | filter.append(line.rstrip('\n')) 38 | num = 0 39 | 40 | def create_folder(directory): 41 | """Creates directory if not exists yet""" 42 | if not os.path.exists(directory): 43 | os.makedirs(directory) 44 | 45 | def uniqify(seq): 46 | """Removes duplicates from a list and keeps order""" 47 | seen = set() 48 | seen_add = seen.add 49 | return [ x for x in seq if x not in seen and not seen_add(x)] 50 | 51 | def page_to_soup(page, number=0): 52 | """Uses urllib2 to get a webpage and returns a beautifulsoup object of the html. If page number is given this is appended at end of url.""" 53 | if number >= 1: 54 | page_url = page + str(number) 55 | else: 56 | page_url = page 57 | req = urllib2.Request(page_url, headers={'User-Agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/4.0; InfoPath.2; SV1; .NET CLR 2.0.50727; WOW64)'}) 58 | try: 59 | page = urllib2.urlopen(req).read() 60 | except urllib2.HTTPError as e: 61 | print("Could not retreive HTTP page %s" % page_url) 62 | sys.exit(1) 63 | page_soup = BeautifulSoup(page) 64 | return page_soup 65 | 66 | def remove_double_whitespace(string): 67 | """Replaces double whitespaces with a space and newlines with a double space""" 68 | return re.sub(' +', ' ', re.sub('\n', ' ', str(string))) 69 | 70 | def save_image(url, filename): 71 | """Saves image from url to filename and returns filename""" 72 | urllib.urlretrieve(url, filename) 73 | return filename 74 | 75 | def parse_overview_page(page_soup): 76 | """Parses a Marktplaats.nl advertisement category overview page and returns an dict with data.""" 77 | ads = [] 78 | for item in page_soup.find_all(attrs={'class': "defaultSnippet"}): 79 | stop_loop = False 80 | item_soup = BeautifulSoup(item.encode('utf-8')) 81 | item_title = item_soup.find(attrs={'class':'mp-listing-title'}).get_text().encode('utf-8') 82 | item_descr = item_soup.find(attrs={'class':'mp-listing-description'}) 83 | if item_descr: 84 | item_descr = item_soup.find(attrs={'class':'mp-listing-description'}).get_text().encode('utf-8') 85 | item_descr_ext = item_soup.find(attrs={'class':'mp-listing-description-extended'}) 86 | if item_descr_ext: 87 | item_descr_ext = item_descr_ext.get_text().encode('utf-8') 88 | item_price = item_soup.find(attrs={'class':'column-price'}).get_text().encode('utf-8') 89 | item_location = item_soup.find(attrs={'class':'location-name'}).get_text().encode('utf-8') 90 | item_seller = item_soup.find(attrs={'class':'seller-name'}).get_text().encode('utf-8') 91 | item_attrs = item_soup.find(attrs={'class':'mp-listing-title'}).get_text().encode('utf-8') 92 | try: 93 | item_date = item_soup.find(attrs={'class':'column-date'}).get_text().encode('utf-8') 94 | except AttributeError as e: 95 | item_date = e 96 | item_url = "" 97 | seller_url = "" 98 | for link in item_soup.find_all('a'): 99 | parse = urlparse.urlparse(link.get('href')) 100 | if parse.netloc == "www.marktplaats.nl" and str(parse.path).startswith("/a/"): 101 | item_url = link.get('href') 102 | if parse.netloc == "www.marktplaats.nl" and str(parse.path).startswith("/verkopers/"): 103 | seller_url = link.get('href') 104 | item_prio = item_soup.find(attrs={'class':'mp-listing-priority-product'}) 105 | if item_prio: 106 | item_prio = item_prio.get_text().encode('utf-8') 107 | item_img_src = item_soup.img['src'] 108 | if remove_double_whitespace(item_prio) == "Topadvertentie": 109 | stop_loop = True 110 | print("Filtering out sponsored ad %s" % item_url) 111 | for filteritem in filter: 112 | if filteritem.lower() in str(remove_double_whitespace(item_title)).lower() or filteritem in str(remove_double_whitespace(item_descr_ext)).lower() or filteritem in str(remove_double_whitespace(item_descr)).lower(): 113 | stop_loop = True 114 | print(("Filtering out ad %s with word trigger %s.") % ( item_url, filteritem.lower()) ) 115 | if not stop_loop: 116 | ad_data = {} 117 | ad_data["title"] = remove_double_whitespace(item_title) 118 | ad_data["descr"] = remove_double_whitespace(item_descr) 119 | ad_data["descr_extra"] = remove_double_whitespace(item_descr_ext) 120 | ad_data["price"] = remove_double_whitespace(item_price) 121 | ad_data["location"] = remove_double_whitespace(item_location) 122 | ad_data["seller"] = remove_double_whitespace(item_seller) 123 | ad_data["attrs"] = remove_double_whitespace(item_attrs) 124 | ad_data["date"] = remove_double_whitespace(item_date) 125 | ad_data["item_url"] = item_url 126 | ad_data["seller_url"] = seller_url 127 | ad_data["prio"] = remove_double_whitespace(item_prio) 128 | ad_data["img_url"] = "http:" + item_img_src 129 | stuff_to_hash = str(ad_data["title"] + ad_data["seller"] + ad_data["location"] + ad_data["price"]) 130 | hash_object = hashlib.sha1(stuff_to_hash) 131 | hex_dig = hash_object.hexdigest() 132 | ad_data["uid"] = str(hex_dig) 133 | print("Parsing ad %s" % str(hex_dig)) 134 | ads.append(ad_data) 135 | return ads 136 | 137 | def create_ad_overview_json_file(ads_list): 138 | """Creates an json file for every ad in the ads list in its uid folder.""" 139 | for ads in ads_list: 140 | for ad in ads: 141 | json_ad = json.dumps(ad) 142 | create_folder("pages/" + ad["uid"]) 143 | with open("pages/" + ad["uid"] + "/overview_page.json", "wb") as file: 144 | file.write(json_ad) 145 | 146 | def create_overview_page(ads_list, page_number, max_pages, filename): 147 | """Creates an overview page from a list with ad json data.""" 148 | prev_pagination = [] 149 | if page_number > 1: 150 | for prev_page_num in range(1, page_number): 151 | prev_pagination.append(("
  • %s
  • " ) % (filename + "-" + str(prev_page_num) + ".html", str(prev_page_num))) 152 | next_pagination = [] 153 | if page_number != max_pages: 154 | for next_page_num in range(page_number + 1, max_pages + 1): 155 | next_pagination.append(("
  • %s
  • " ) % (filename + "-" + str(next_page_num) + ".html", str(next_page_num))) 156 | current_pagination = [] 157 | current_pagination.append("
  • %s(current)
  • " % str(page_number)) 158 | 159 | if page_number > 1: 160 | prev_filename = filename + "-" + str(page_number - 1) + ".html" 161 | prev_html = ("Previous Page" % prev_filename) 162 | else: 163 | prev_filename = "#" 164 | prev_html = ("") 165 | if page_number == max_pages: 166 | next_filename = "#" 167 | next_html = ("") 168 | else: 169 | next_filename = filename + "-" + str(page_number + 1) + ".html" 170 | next_html = ("Next Page" % next_filename) 171 | filename = filename + "-" + str(page_number) + ".html" 172 | 173 | with open(filename, "w") as file: 174 | file.write("Overview") 175 | file.write('') 176 | file.write('
    ') 177 | file.write('
    ') 178 | file.write(("

    %s

    ") % ( title )) 179 | file.write('
    Overview page #%s
    ' % str(page_number)) 180 | file.write('

    Last updated: %s.

    ' % str(datetime.datetime.now())) 181 | file.write("
    ") 182 | file.write("
    ") 183 | file.write(prev_html) 184 | file.write("
      ") 185 | for line in prev_pagination: 186 | file.write(line) 187 | for line in current_pagination: 188 | file.write(line) 189 | for line in next_pagination: 190 | file.write(line) 191 | file.write("
    ") 192 | file.write(next_html) 193 | file.write("
    ") 194 | file.write("
    ") 195 | file.write("") 196 | file.write("") 197 | file.write("") 198 | file.write("") 199 | file.write("") 200 | file.write("") 201 | file.write("\n") 202 | for ad in ads_list: 203 | file.write("") 204 | file.write("") 215 | file.write("") 222 | file.write("") 229 | file.write("") 233 | file.write("\n") 234 | file.write("
    FotoInfoVerkoperPrijs
    image") 218 | file.write(ad["title"].encode('ascii', 'xmlcharrefreplace')) 219 | file.write("
    ") 220 | file.write(ad["descr"].encode('ascii', 'xmlcharrefreplace')) 221 | file.write("
    ") 225 | file.write(ad["seller"].encode('ascii', 'xmlcharrefreplace')) 226 | file.write(" | ") 227 | file.write(ad["location"].encode('ascii', 'xmlcharrefreplace')) 228 | file.write("") 230 | file.write("") 231 | file.write(ad["price"].encode('ascii', 'xmlcharrefreplace')) 232 | file.write("
    ") 235 | file.write("
    ") 236 | file.write("
    ") 237 | file.write(prev_html) 238 | file.write("
      ") 239 | for line in prev_pagination: 240 | file.write(line) 241 | for line in current_pagination: 242 | file.write(line) 243 | for line in next_pagination: 244 | file.write(line) 245 | file.write("
    ") 246 | file.write(next_html) 247 | file.write("
    ") 248 | file.write("
    ") 251 | print("Written overview page to %s" % filename) 252 | 253 | def parse_ad_page(page_soup, uid, url): 254 | """Parses a Marktplaats.nl advertisement page and returns a dict with ad data""" 255 | content = {} 256 | content["images"] = [] 257 | for item in page_soup.find_all(attrs={'class': "listing"}): 258 | item_soup = BeautifulSoup(item.encode('utf-8')) 259 | content["uid"] = uid 260 | content["url"] = url 261 | try: 262 | content["title"] = item_soup.find(attrs={'id':'title'}).get_text().encode('utf-8') 263 | except AttributeError as e: 264 | print(('\033[91mError processing UID %s: %s\033[0m') % (str(uid), str(e))) 265 | return content 266 | content["title"] = cgi.escape(content["title"]) 267 | content["descr"] = item_soup.find(attrs={'id':'vip-ad-description'}).get_text().encode('utf-8') 268 | content["views"] = item_soup.find(attrs={'id':'view-count'}).get_text().encode('utf-8') 269 | content["price"] = item_soup.find(attrs={'id':'vip-ad-price-container'}).get_text().encode('utf-8') 270 | content["shipping"] = item_soup.find(attrs={'id':'vip-ad-shipping-cost'}).get_text().encode('utf-8') 271 | item_photo_carousel = item_soup.find(attrs={'id':'vip-carousel'}) 272 | item_images = item_photo_carousel.attrs['data-images-xl'] 273 | item_image_urls = item_images.split("&//") 274 | for image in item_image_urls: 275 | if image: 276 | parse = urlparse.urlparse(image) 277 | content["images"].append("http://" + parse.netloc + parse.path) 278 | return content 279 | 280 | def create_item_page(content, uid): 281 | """Creates an item advertisement page based on content json and uid. Also downloads and saves all ad images if not exist""" 282 | create_folder("pages/" + uid) 283 | if os.path.exists("pages/" + uid + "/index.html") or os.path.exists("pages/" + uid + "/content.json"): 284 | return True 285 | print("Creating page for ad %s" % uid) 286 | with open("pages/" + uid + "/content.json", "w") as file: 287 | file.write(json.dumps(content)) 288 | file.close() 289 | with open("pages/" + uid + "/index.html", "wb") as file: 290 | if 'descr' in content: 291 | file.write("%s" % content["descr"].decode('utf-8').encode('ascii', 'xmlcharrefreplace')) 292 | file.write('
    ') 293 | file.write("

    %s

    " % (content["url"], content["title"].decode('utf-8').encode('ascii', 'xmlcharrefreplace'))) 294 | file.write("") 295 | file.write("" % (content["descr"].decode('utf-8').encode('ascii', 'xmlcharrefreplace'))) 296 | file.write("" % (content["price"].decode('utf-8').encode('ascii', 'xmlcharrefreplace'))) 297 | file.write("" % (content["views"].decode('utf-8').encode('ascii', 'xmlcharrefreplace'))) 298 | file.write("" % (content["shipping"].decode('utf-8').encode('ascii', 'xmlcharrefreplace'))) 299 | file.write("" % content["url"]) 300 | try: 301 | for counter, img_url in enumerate(content["images"]): 302 | if not os.path.exists("pages/" + uid + "/" + str(counter) + ".jpg"): 303 | save_image(img_url, "pages/" + uid + "/" + str(counter) + ".jpg") 304 | file.write("" % (str(counter) + ".jpg")) 305 | except Exception as e: 306 | file.write("") 307 | file.write("
    Beschrijving%s
    Prijs%s
    Views%s
    Verzendmethode%s
    View on Marktplaats
    image
    image
    ") 308 | file.write("

    ") 311 | else: 312 | file.write("%s") 313 | return 314 | 315 | def get_url_from_uid_json_file(uid): 316 | """Parse the overview json file and get the item URL from it""" 317 | with open("pages/" + uid + "/overview_page.json", "r") as file: 318 | return json.loads(file.read())["item_url"] 319 | 320 | def process_ad_page_full(uid): 321 | """Does all the functions to get and parse an ad, used for the thread pool""" 322 | # uncomment below to find out failint uid 323 | print("Parsing ad page for UID: %s" % str(uid)) 324 | if not os.path.exists("pages/" + uid + "/index.html"): 325 | url = get_url_from_uid_json_file(uid) 326 | print("Original URL: %s " % str(url)) 327 | ad_page_soup = page_to_soup(url) 328 | content = parse_ad_page(ad_page_soup, uid, url) 329 | create_item_page(content, uid) 330 | 331 | def main(): 332 | global max_pages 333 | global pax_page_items 334 | global pool 335 | global num 336 | 337 | ads_list = [] 338 | #comment out below if only want to test parsing and rendering. 339 | for number in range(1,max_pages): 340 | overview_page_soup = page_to_soup(base_url, number) 341 | ads_list.append(parse_overview_page(overview_page_soup)) 342 | print("Parsed overview page %i" % number) 343 | create_ad_overview_json_file(ads_list) 344 | 345 | uids = [] 346 | for ads in ads_list: 347 | for ad in ads: 348 | uids.append(ad["uid"]) 349 | 350 | if os.path.exists("ads.json"): 351 | print("Reading ads.json") 352 | with open("ads.json", "r") as file: 353 | try: 354 | load_uids = json.load(file) 355 | except ValueError as error: 356 | print("ads.json exists but error: %s" % error) 357 | load_uids = [] 358 | else: 359 | print("Creating ads.json") 360 | load_uids = [] 361 | 362 | for item in load_uids: 363 | uids.append(item) 364 | 365 | uids = uniqify(uids) 366 | with open("ads.json", "w") as file: 367 | file.write(json.dumps(uids)) 368 | 369 | # either this one, with threads 370 | pool = ThreadPool(10) 371 | uid_pool = pool.map(process_ad_page_full, uids) 372 | # or this one, without threads 373 | #for uid in uids: 374 | # process_ad_page_full(uid) 375 | 376 | split_uid_list = [uids[x:x+max_page_items] for x in range(0, len(uids),max_page_items)] 377 | for counter, uid_list in enumerate(split_uid_list): 378 | counter = counter + 1 379 | ads_list = [] 380 | for uid in uid_list: 381 | with open("pages/" + uid + "/overview_page.json") as file: 382 | ads_list.append(json.load(file)) 383 | max_pages = len(split_uid_list) 384 | create_overview_page(ads_list, counter, max_pages, "test") 385 | 386 | 387 | if __name__ == "__main__": 388 | main() 389 | -------------------------------------------------------------------------------- /marktplaats-example.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | ≥ Marktplaats - De plek om nieuwe en tweedehands spullen te kopen en verkopen 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 188 | 189 | 190 | 2473 | 2474 | 2475 | 2476 | 2477 | 2478 | 2479 | 2480 | 2481 | 2482 | 2483 | 2484 | 2485 | 2486 | 2594 | 2595 | 2596 | 2597 | 2598 | 2599 | 2600 | 2601 | 2602 | 2603 | 2604 | 2605 | 2606 | 2607 | 2608 | 2609 | 2610 | 2611 | 2612 | 2613 | 2614 | 2615 | 2616 | 2617 | 2618 | 2619 | 2620 | 2621 | 2622 | 2623 | 2624 | 2664 | 2665 | 2666 | 2667 | 2668 | 2669 | 2670 | 2671 | 2672 | --------------------------------------------------------------------------------