├── requirements.txt
├── deploy.sh
├── LICENSE
├── README.md
└── cookidump.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4
2 | selenium>=4.8.0
3 | 


--------------------------------------------------------------------------------
/deploy.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | NAME="cookidump"
 4 | 
 5 | if [ "$#" != "1" ]; then
 6 | 	echo "Usage: $0 <comment>"
 7 | 	exit 0
 8 | fi
 9 | 
10 | git add .
11 | git commit -m "$1"
12 | git push -f $NAME master
13 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Enrico Cambiaso
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # cookidump
  2 | 
  3 | Easily dump cookidoo recipes from the official website
  4 | 
  5 | ### Description ###
  6 | 
  7 | This program allows you to dump all recipes on [Cookidoo](https://cookidoo.co.uk) websites (available for different countries) for offline and posticipate reading.
  8 | Those recipes are valid in particular for [Thermomix/Bimby](https://en.wikipedia.org/wiki/Thermomix) devices.
  9 | In order to dump the recipes, a valid subscription is needed.
 10 | 
 11 | The initial concept of this program was based on [jakubszalaty/cookidoo-parser](https://github.com/jakubszalaty/cookidoo-parser).
 12 | 
 13 | ### Mentioning ###
 14 | 
 15 | If you intend to scientifically investigate or extend cookidump, please consider citing the following paper.
 16 | 
 17 | ```
 18 | @article{cambiaso2022cookidump,
 19 | title = {Web security and data dumping: The Cookidump case},
 20 | journal = {Software Impacts},
 21 | volume = {14},
 22 | pages = {100426},
 23 | year = {2022},
 24 | issn = {2665-9638},
 25 | doi = {https://doi.org/10.1016/j.simpa.2022.100426},
 26 | url = {https://www.sciencedirect.com/science/article/pii/S2665963822001105},
 27 | author = {Enrico Cambiaso and Maurizio Aiello},
 28 | keywords = {Cyber-security, Data dump, Database security, Browser automation},
 29 | abstract = {In the web security field, data dumping activities are often related to a malicious exploitation. In this paper, we focus on data dumping activities executed legitimately by scraping/storing data shown on the browser. We evaluate such operation by proposing Cookidump, a tool able to dump all recipes available on the Cookidoo© website portal. While such scenario is not relevant, in terms of security and privacy, we discuss the impact of such kind of activity for other scenarios including web applications hosting sensitive information.}
 30 | }
 31 | ```
 32 | 
 33 | Further information can be found at [https://www.sciencedirect.com/science/article/pii/S2665963822001105](https://www.sciencedirect.com/science/article/pii/S2665963822001105).
 34 | 
 35 | ### Features ###
 36 | 
 37 | * Easy to run
 38 | * Easy to open HTML output
 39 | * Output including a surfable list of dumped recipes
 40 | * Customizable searches
 41 | 
 42 | ### Installation ###
 43 | 
 44 | #### nix ####
 45 | 
 46 | ```
 47 | nix run github:auino/cookidump -- <outputdir> [--separate-json]
 48 | ```
 49 | 
 50 | Nix provisions `google-chrome` together with `chromedriver`. Only 
 51 | `<outputdir>` and `[--separate-json]` arguments are expected.
 52 | 
 53 | #### manual ####
 54 | 
 55 | 1. Clone the repository:
 56 | 
 57 | ```
 58 | git clone https://github.com/auino/cookidump.git
 59 | ```
 60 | 
 61 | 2. `cd` into the download folder
 62 | 
 63 | 3. Install [Python](https://www.python.org) requirements:
 64 | 
 65 | ```
 66 | pip install -r requirements.txt
 67 | ```
 68 | 
 69 | 4. Install the [Google Chrome](https://chrome.google.com) browser, if not already installed
 70 | 
 71 | 5. Download the [Chrome WebDriver](https://sites.google.com/chromium.org/driver/) and save it on the `cookidump` folder
 72 | 
 73 | 6. You are ready to dump your recipes
 74 | 
 75 | ### Usage ###
 76 | 
 77 | Simply run the following command to start the program. The program is interactive to simplify it's usage.
 78 | 
 79 | ```
 80 | python cookidump.py [--separate-json] <webdriverfile> <outputdir>
 81 | ```
 82 | 
 83 | where:
 84 | * `webdriverfile` identifies the path to the downloaded [Chrome WebDriver](https://sites.google.com/chromium.org/driver/) (for instance, `chromedriver.exe` for Windows hosts, `./chromedriver` for Linux and macOS hosts)
 85 | * `outputdir` identifies the path of the output directory (will be created, if not already existent)
 86 | * `--separate-json` allows to generate a separate JSON file for each recipe, instead of one aggregate file including all recipes
 87 | 
 88 | The program will open a [Google Chrome](https://chrome.google.com) window and wait until you are logged in into your [Cookidoo](https://cookidoo.co.uk) account (different countries are supported).
 89 | 
 90 | After that, follow intructions provided by the script itself to proceed with the dump.
 91 | 
 92 | #### Considerations ####
 93 | 
 94 | By following script instructions, it is also possible to apply custom filters to export selected recipes (for instance, in base of the dish, title and ingredients, Thermomix/Bimby version, etc.).
 95 | 
 96 | Output is represented by an `index.html` file, included in `outputdir`, plus a set of recipes inside of structured folders.
 97 | By opening the generated `index.html` file on your browser, it is possible to have a list of recipes downloaded and surf to the desired recipe.
 98 | 
 99 | The number of exported recipes is limited to around `1000` for each execution.
100 | Hence, use of filters may help in this case to reduce the number of recipes exported.
101 | 
102 | ### Other approaches ###
103 | 
104 | A different approach, previously adopted, is based on the retrieval of structured data on recipes.
105 | More information can be found on the [datastructure branch](https://github.com/auino/cookidump/tree/datastructure).
106 | Output is represented in this case in a different (structured) format, hence, it has to be interpreted. Such interpretation is not implemented in the linked previous commit.
107 | 
108 | Another community-driven approach, supported by some AI, has been released in the [community branch](https://github.com/auino/cookidump/tree/community).
109 | 
110 | ### TODO ###
111 | 
112 | * Bypass the limited number of exported recipes
113 | * Parse downloaded recipes to store them on a database, or to generate a unique linked PDF
114 | * Make Chrome run headless for better speeds
115 | * Set up a dedicated container for the program
116 | 
117 | ### Supporters ###
118 | 
119 | * [@vikramsoni2](https://github.com/vikramsoni2), regarding JSON saves plus minor enhancements
120 | * [@mrwogu](https://github.com/mrwogu), regarding additional information to be extracted on the generated JSON file, plus suggestions on the possibility to save recipes on dedicated JSON files
121 | * [@nilskrause](https://github.com/NilsKrause), regarding argument parsing and updates on the link to download the Chrome WebDriver
122 | * [@NightProgramming](https://github.com/NightProgramming), regarding the use of selenium version 3
123 | * [@morela](https://github.com/morela), regarding the update of the tool to support a newer version of Selenium
124 | * [@ndjc](https://github.com/ndjc), fixing some deprecation warnings
125 | 
126 | ### Disclaimer ###
127 | 
128 | The authors of this program are not responsible of the usage of it.
129 | This program is released only for research and dissemination purposes.
130 | Also, the program provides users the ability to locally and temporarily store recipes accessible through a legit subscription.
131 | Before using this program, check Cookidoo subscription terms of service, according to the country related to the exploited subscription. 
132 | Sharing of the obtained recipes is not a legit activity and the authors of this program are not responsible of any illecit and sharing activity accomplished by the users.
133 | 
134 | ### Contacts ###
135 | 
136 | You can find me on Twitter as [@auino](https://twitter.com/auino).
137 | 


--------------------------------------------------------------------------------
/cookidump.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | 
  3 | # cookidump
  4 | # Original GitHub project:
  5 | # https://github.com/auino/cookidump
  6 | 
  7 | import os
  8 | import io
  9 | import re
 10 | import time
 11 | import json
 12 | import pathlib
 13 | import argparse
 14 | import platform
 15 | from bs4 import BeautifulSoup
 16 | from selenium import webdriver
 17 | from urllib.parse import urlparse
 18 | from urllib.request import urlretrieve
 19 | from selenium.webdriver.common.by import By
 20 | from selenium.webdriver.common.keys import Keys
 21 | from selenium.webdriver.chrome.options import Options
 22 | from selenium.webdriver.chrome.service import Service
 23 | from selenium.webdriver.common.action_chains import ActionChains
 24 | 
 25 | PAGELOAD_TO = 3
 26 | SCROLL_TO = 1
 27 | MAX_SCROLL_RETRIES = 5
 28 | 
 29 | def startBrowser(chrome_driver_path):
 30 |     """Starts browser with predefined parameters"""
 31 |     chrome_options = Options()
 32 |     if "GOOGLE_CHROME_PATH" in os.environ:
 33 |         chrome_options.binary_location = os.getenv('GOOGLE_CHROME_PATH')
 34 |     #chrome_options.add_argument('--headless')
 35 |     chrome_service = Service(chrome_driver_path)
 36 |     driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
 37 |     return driver
 38 | 
 39 | def listToFile(browser, baseDir):
 40 |     """Gets html from search list and saves in html file"""
 41 |     filename = '{}index.html'.format(baseDir)
 42 |     # creating directories, if needed
 43 |     path = pathlib.Path(filename)
 44 |     path.parent.mkdir(parents=True, exist_ok=True)
 45 |     # getting web page source
 46 |     #html = browser.page_source
 47 |     html = browser.execute_script("return document.documentElement.outerHTML")
 48 |     # saving the page
 49 |     with io.open(filename, 'w', encoding='utf-8') as f: f.write(html)
 50 | 
 51 | def imgToFile(outputdir, recipeID, img_url):
 52 |     img_path = '{}images/{}.jpg'.format(outputdir, recipeID)
 53 |     path = pathlib.Path(img_path)
 54 |     path.parent.mkdir(parents=True, exist_ok=True)
 55 |     urlretrieve(img_url, img_path)
 56 |     return '../images/{}.jpg'.format(recipeID)
 57 | 
 58 | def recipeToFile(browser, filename):
 59 |     """Gets html of the recipe and saves in html file"""
 60 |     # creating directories, if needed
 61 |     path = pathlib.Path(filename)
 62 |     path.parent.mkdir(parents=True, exist_ok=True)
 63 |     # getting web page source
 64 |     html = browser.page_source
 65 |     # saving the page
 66 |     with io.open(filename, 'w', encoding='utf-8') as f: f.write(html)
 67 | 
 68 | def recipeToJSON(browser, recipeID):
 69 |     html = browser.page_source
 70 |     soup = BeautifulSoup(html, 'html.parser')
 71 | 
 72 |     recipe = {}
 73 |     recipe['id'] = recipeID
 74 |     recipe['language'] = soup.select_one('html').attrs['lang']
 75 |     recipe['title'] = soup.select_one(".recipe-card__title").text
 76 |     recipe['rating_count'] = re.sub(r'\D', '', soup.select_one(".core-rating__label").text, flags=re.IGNORECASE)
 77 |     recipe['rating_score'] = soup.select_one(".core-rating__counter").text
 78 |     recipe['tm-versions'] = [v.text.replace('\n','').strip().lower() for v in soup.select(".recipe-card__tm-version core-badge")]
 79 |     recipe.update({ l.text : l.next_sibling.strip() for l in soup.select("core-feature-icons label span") })
 80 |     recipe['ingredients'] = [re.sub(' +', ' ', li.text).replace('\n','').strip() for li in soup.select("#ingredients li")]
 81 |     recipe['nutritions'] = {}
 82 |     for item in list(zip(soup.select(".nutritions dl")[0].find_all("dt"), soup.select(".nutritions dl")[0].find_all("dd"))):
 83 |         dt, dl = item
 84 |         recipe['nutritions'].update({ dt.string.replace('\n','').strip().lower(): re.sub(r'\s{2,}', ' ', dl.string.replace('\n','').strip().lower()) })
 85 |     recipe['steps'] = [re.sub(' +', ' ', li.text).replace('\n','').strip() for li in soup.select("#preparation-steps li")]
 86 |     recipe['tags'] = [a.text.replace('#','').replace('\n','').strip().lower() for a in soup.select(".core-tags-wrapper__tags-container a")]
 87 | 
 88 |     return recipe
 89 | 
 90 | def run(webdriverfile, outputdir, separate_json):
 91 |     """Scraps all recipes and stores them in html"""
 92 |     print('[CD] Welcome to cookidump, starting things off...')
 93 |     # fixing the outputdir parameter, if needed
 94 |     if outputdir[-1:][0] != '/': outputdir += '/'
 95 |     locale = str(input('[CD] Complete the website domain: https://cookidoo.'))
 96 |     baseURL = 'https://cookidoo.{}/'.format(locale)
 97 |     brw = startBrowser(webdriverfile)
 98 |     # opening the home page
 99 |     brw.get(baseURL)
100 |     time.sleep(PAGELOAD_TO)
101 |     reply = input('[CD] Please login to your account and then enter y to continue: ')
102 |     # recipes base url
103 |     rbURL = 'https://cookidoo.{}/search/'.format(locale)
104 |     brw.get(rbURL)
105 |     time.sleep(PAGELOAD_TO)
106 |     # possible filters done here
107 |     reply = input('[CD] Set your filters, if any, and then enter y to continue: ')
108 |     # asking for additional details for output organization
109 |     custom_output_dir = input("[CD] enter the directory name to store the results (ex. vegeratian): ")
110 |     if custom_output_dir : outputdir += '{}/'.format(custom_output_dir)
111 |     # proceeding
112 |     print('[CD] Proceeding with scraping')
113 |     # removing the name
114 |     brw.execute_script("var element = arguments[0];element.parentNode.removeChild(element);", brw.find_element(By.TAG_NAME, 'core-user-profile'))
115 |     # clicking on cookie accept
116 |     try: brw.find_element(By.CLASS_NAME, 'accept-cookie-container').click()
117 |     except: pass
118 |     # showing all recipes
119 |     elementsToBeFound = int(brw.find_element(By.CLASS_NAME, 'items-start').text.split('\n')[-1].split(' ')[0])
120 |     previousElements = 0
121 |     while True:
122 |         # checking if ended or not
123 |         currentElements = len(brw.find_elements(By.CLASS_NAME, 'link--alt'))
124 |         if currentElements >= elementsToBeFound: break
125 |         # scrolling to the end
126 |         brw.execute_script("window.scrollTo(0, document.body.scrollHeight);")
127 |         time.sleep(SCROLL_TO)
128 |         # clicking on the "load more recipes" button
129 |         try:
130 |             brw.find_element(By.ID, 'load-more-page').click()
131 |             time.sleep(PAGELOAD_TO)
132 |         except: pass
133 |         print('Scrolling [{}/{}]'.format(currentElements, elementsToBeFound))
134 |         # checking if I can't load more elements
135 |         count = count + 1 if previousElements == currentElements else 0
136 |         if count >= MAX_SCROLL_RETRIES: break
137 |         previousElements = currentElements
138 | 
139 |     print('Scrolling [{}/{}]'.format(currentElements, elementsToBeFound))
140 | 
141 |     # saving all recipes urls
142 |     els = brw.find_elements(By.CLASS_NAME, 'link--alt')
143 |     recipesURLs = []
144 |     for el in els:
145 |         recipeURL = el.get_attribute('href')
146 |         recipesURLs.append(recipeURL)
147 |         recipeID = recipeURL.split('/')[-1:][0]
148 |         brw.execute_script("arguments[0].setAttribute(arguments[1], arguments[2]);", el, 'href', './recipes/{}.html'.format(recipeID))
149 | 
150 |     # removing search bar
151 |     try: brw.execute_script("var element = arguments[0];element.parentNode.removeChild(element);", brw.find_element(By.TAG_NAME, 'core-search-bar'))
152 |     except: pass
153 | 
154 |     # removing scripts
155 |     for s in brw.find_elements(By.TAG_NAME, 'script'):
156 |         try: brw.execute_script("var element = arguments[0];element.parentNode.removeChild(element);", s)
157 |         except: pass
158 | 
159 |     # saving the list to file
160 |     listToFile(brw, outputdir)
161 | 
162 |     # filter recipe Url list because it contains terms-of-use, privacy, disclaimer links too
163 |     recipesURLs = [l for l in recipesURLs if 'recipe' in l]
164 | 
165 |     # getting all recipes
166 |     print("Getting all recipes...")
167 |     c = 0
168 |     recipeData = []
169 |     for recipeURL in recipesURLs:
170 |         try:
171 |             # building urls
172 |             u = str(urlparse(recipeURL).path)
173 |             if u[0] == '/': u = '.'+u
174 |             recipeID = u.split('/')[-1:][0]
175 |             # opening recipe url
176 |             brw.get(recipeURL)
177 |             time.sleep(PAGELOAD_TO)
178 |             # removing the base href header
179 |             try: brw.execute_script("var element = arguments[0];element.parentNode.removeChild(element);", brw.find_element(By.TAG_NAME, 'base'))
180 |             except: pass
181 |             # removing the name
182 |             brw.execute_script("var element = arguments[0];element.parentNode.removeChild(element);", brw.find_element(By.TAG_NAME, 'core-user-profile'))
183 |             # changing the top url
184 |             brw.execute_script("arguments[0].setAttribute(arguments[1], arguments[2]);", brw.find_element(By.CLASS_NAME, 'page-header__home'), 'href', '../../index.html')
185 |             # saving recipe image
186 |             img_url = brw.find_element(By.ID, 'recipe-card__image-loader').find_element(By.TAG_NAME, 'img').get_attribute('src')
187 |             local_img_path = imgToFile(outputdir, recipeID, img_url)
188 |             # change the image url to local
189 |             brw.execute_script("arguments[0].setAttribute(arguments[1], arguments[2]);", brw.find_element(By.CLASS_NAME, 'core-tile__image'), 'srcset', '')
190 |             brw.execute_script("arguments[0].setAttribute(arguments[1], arguments[2]);", brw.find_element(By.CLASS_NAME, 'core-tile__image'), 'src', local_img_path)
191 |             # saving the file
192 |             recipeToFile(brw, '{}recipes/{}.html'.format(outputdir, recipeID))
193 |             # extracting JSON info
194 |             recipe = recipeToJSON(brw, recipeID)
195 |             # saving JSON file, if needed
196 |             if separate_json:
197 |                 print('[CD] Writing recipe to JSON file')
198 |                 with open('{}recipes/{}.json'.format(outputdir, recipeID), 'w') as outfile: json.dump(recipe, outfile)
199 |             else:
200 |                 recipeData.append(recipe)
201 |             # printing information
202 |             c += 1
203 |             if c % 10 == 0: print('Dumped recipes: {}/{}'.format(c, len(recipesURLs)))
204 |         except: pass
205 | 
206 |     # save JSON file, if needed
207 |     if not separate_json:
208 |         print('[CD] Writing recipes to JSON file')
209 |         with open('{}data.json'.format(outputdir), 'w') as outfile: json.dump(recipeData, outfile)
210 | 
211 |     # logging out
212 |     logoutURL = 'https://cookidoo.{}/profile/logout'.format(locale)
213 |     brw.get(logoutURL)
214 |     time.sleep(PAGELOAD_TO)
215 | 
216 |     # closing session
217 |     print('[CD] Closing session\n[CD] Goodbye!')
218 |     brw.close()
219 | 
220 | if  __name__ =='__main__':
221 |     parser = argparse.ArgumentParser(description='Dump Cookidoo recipes from a valid account')
222 |     parser.add_argument('webdriverfile', type=str, help='the path to the Chrome WebDriver file')
223 |     parser.add_argument('outputdir', type=str, help='the output directory')
224 |     parser.add_argument('-s', '--separate-json', action='store_true', help='Create a separate JSON file for each recipe; otherwise, a single data file will be generated')
225 |     args = parser.parse_args()
226 |     run(args.webdriverfile, args.outputdir, args.separate_json)
227 | 


--------------------------------------------------------------------------------