├── code ├── run.sh ├── home.py ├── buildspec.yml └── Dockerfile ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md └── CONTRIBUTING.md /code/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | Xvfb -ac -nolisten inet6 :99 & 3 | python3 /opt/home.py 4 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /code/home.py: -------------------------------------------------------------------------------- 1 | 2 | import time 3 | import datetime 4 | import boto3 5 | from botocore.errorfactory import ClientError 6 | from selenium import webdriver 7 | 8 | from selenium.webdriver.firefox.options import Options 9 | 10 | 11 | options = Options() 12 | options.headless = True 13 | 14 | driver=webdriver.Firefox(options=options, executable_path='/opt/geckodriver') 15 | 16 | print('start your scraping project') 17 | 18 | -------------------------------------------------------------------------------- /code/buildspec.yml: -------------------------------------------------------------------------------- 1 | version: 0.2 2 | phases: 3 | pre_build: 4 | commands: 5 | - echo Logging in to Amazon ECR... 6 | - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION) 7 | build: 8 | commands: 9 | - echo Build started on `date` 10 | - echo Building the Docker image... 11 | - docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG . 12 | - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG 13 | - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG 14 | 15 | post_build: 16 | commands: 17 | - echo done 18 | 19 | 20 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /code/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM amazonlinux 2 | RUN yum update -y 3 | RUN yum install -y \ 4 | gcc \ 5 | openssl-devel \ 6 | zlib-devel \ 7 | libffi-devel \ 8 | python3 \ 9 | python3-pip \ 10 | git \ 11 | Xvfb \ 12 | gtk3 \ 13 | dbus-glib \ 14 | wget && \ 15 | yum -y clean all 16 | RUN yum -y groupinstall development 17 | WORKDIR /opt 18 | 19 | RUN wget -O- "https://download.mozilla.org/?product=firefox-latest-ssl&os=linux64&lang=en-US" | tar -jx -C /usr/local/ 20 | RUN ln -s /usr/local/firefox/firefox /usr/bin/firefox 21 | 22 | RUN pip3 install --no-cache-dir selenium boto3 xvfbwrapper 23 | 24 | 25 | RUN wget https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux64.tar.gz 26 | RUN tar -xf geckodriver-v0.26.0-linux64.tar.gz 27 | RUN ls -lta 28 | RUN rm geckodriver-v0.26.0-linux64.tar.gz 29 | 30 | RUN chmod +x geckodriver 31 | RUN export DISPLAY=:99 32 | #RUN Xvfb -ac -nolisten inet6 :99 & 33 | 34 | 35 | COPY run.sh /opt/run.sh 36 | COPY home.py /opt/home.py 37 | RUN chmod +x /opt/run.sh 38 | ENTRYPOINT ["/opt/run.sh", "--no-save"] 39 | 40 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## A web scraper in a Docker container hosted on AWS 2 | 3 | This example illustrates how to build and run a Docker image containing Firefox web browser, Python libraries, such as Selenium and etc., to host a web scraper on AWS. 4 | 5 | ## Instructions 6 | 7 | The example contains a CloudFormation script to rebuild the project and infrastructure automatically on AWS. Please update `home.py` file in `code` folder with your logic. Then, archive the content of code folder while naming `code.zip` the archive and upload it to your S3 bucket, which is specified in CF script under the name `S3HostingBucket`. 8 | 9 | ## Note 10 | 11 | The solution relies on Firefox browser, which has constant updates with important security fixes. Please make sure that you are running the latest version of it or consider alternatives. For example, Selenium requires a web browser, however other scraping libraries can run independently. 12 | 13 | ## Security 14 | 15 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 16 | 17 | ## License 18 | 19 | This library is licensed under the MIT-0 License. See the LICENSE file. 20 | 21 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | --------------------------------------------------------------------------------